mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
This change updates the Impala performance schema and test vector generation techniques. It also migrates the existing benchmark scripts that were Ruby over to use Python. The changes has a few parts: 1) Conversion of test vector generation and benchmark statement generation from Ruby to Python. A result of this was also to update the benchmark test vector and dimension files to be written in CSV format (python doesn't have built-in YAML support) 2) Standardize on the naming for benchmark tables to (somewhat match Query tests). In general the form is: * If file_format=text and compression=none, do not use a table suffix * Abbreviate sequence file as (seq) rc file as (rc) etc * If using BLOCK compression don't append anything to table name, if using 'record' append 'record' 3) Created a new way to adding new schemas. this is the benchmark_schema_template.sql file. The generate_benchmark_statements.py script reads this in and breaks up the sections. The section format is: ==== Data Set Name --- BASE table name --- CREATE STATEMENT Template --- INSERT ... SELECT * format --- LOAD Base statement --- LOAD STATEMENT Format Where BASE Table is a table the other file formats/compression types can be generated from. This would generally be a local file. The thinking is that if the files already exist in HDFS then we can just load the file directly rather than issue an INSERT ... SELECT * statement. The generate_benchmark_statements.py script has been updated to use this new template as well as query HDFS for each table to determine how it should be created. It then outputs an ideal file call load-benchmark-*-generated.sql. Since this file is geneated dynamically we can remove the old benchmark statement files. 4) This has been hooked into load-benchmark-data.sh and run_query has been updated to use the new format as well
48 lines
1.7 KiB
Bash
Executable File
48 lines
1.7 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# Copyright (c) 2012 Cloudera, Inc. All rights reserved.
|
|
#
|
|
# Script that creates schema and loads data into hive for running benchmarks.
|
|
# By default the script will load the base data for the "core" scenario.
|
|
# If 'pairwise' is specified as a parameter the pairwise combinations of workload
|
|
# + file format + compression will be loaded.
|
|
# If 'exhaustive' is passed as an argument the exhaustive set of combinations will
|
|
# be executed.
|
|
|
|
bin=`dirname "$0"`
|
|
bin=`cd "$bin"; pwd`
|
|
. "$bin"/impala-config.sh
|
|
|
|
set -e
|
|
|
|
exploration_strategy=core
|
|
if [ $1 ]; then
|
|
exploration_strategy=$1
|
|
fi
|
|
|
|
BENCHMARK_SCRIPT_DIR=$IMPALA_HOME/testdata/bin
|
|
|
|
function execute_hive_query_from_file {
|
|
hive_args="-hiveconf hive.root.logger=WARN,console -v -f"
|
|
"$HIVE_HOME/bin/hive" $hive_args $1
|
|
}
|
|
|
|
pushd "$IMPALA_HOME/testdata/bin";
|
|
./generate_benchmark_statements.py --exploration_strategy $exploration_strategy
|
|
popd
|
|
|
|
if [ "$exploration_strategy" = "exhaustive" ]; then
|
|
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/create-benchmark-exhaustive-generated.sql"
|
|
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/load-benchmark-exhaustive-generated.sql"
|
|
elif [ "$exploration_strategy" = "pairwise" ]; then
|
|
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/create-benchmark-pairwise-generated.sql"
|
|
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/load-benchmark-pairwise-generated-sql"
|
|
elif [ "$exploration_strategy" = "core" ]; then
|
|
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/create-benchmark-core-generated.sql"
|
|
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/load-benchmark-core-generated.sql"
|
|
else
|
|
echo "Invalid exploration strategy: $exploration_strategy"
|
|
exit 1
|
|
fi
|
|
|
|
$IMPALA_HOME/testdata/bin/generate-block-ids.sh
|