Files
impala/bin/load-benchmark-data.sh
Lenni Kuff 0da77037e3 Updated Impala performance schema and test vector generation
This change updates the Impala performance schema and test vector generation
techniques. It also migrates the existing benchmark scripts that were Ruby over
to use Python. The changes has a few parts:

1) Conversion of test vector generation and benchmark statement generation from
Ruby to Python. A result of this was also to update the benchmark test vector
and dimension files to be written in CSV format (python doesn't have built-in
YAML support)

2) Standardize on the naming for benchmark tables to (somewhat match Query
tests). In general the form is:
* If file_format=text and compression=none, do not use a        table suffix
* Abbreviate sequence file as (seq) rc file as (rc) etc
* If using BLOCK compression don't append anything to table name, if using
 'record' append 'record'

3) Created a new way to adding new schemas. this is the
benchmark_schema_template.sql file. The generate_benchmark_statements.py script
reads this in and breaks up the sections. The section format is:
====
Data Set Name
---
BASE table name
---
CREATE STATEMENT Template
---
INSERT ... SELECT * format
---
LOAD Base statement
---
LOAD STATEMENT Format

Where BASE Table is a table the other file formats/compression types can be
generated from. This would generally be a local file.

The thinking is that if the files already exist in HDFS then we can just load
the file directly rather than issue an INSERT ... SELECT * statement. The
generate_benchmark_statements.py script has been updated to use this new
template as well as query HDFS for each table to determine how it should be
created. It then outputs an ideal file call load-benchmark-*-generated.sql.
Since this file is geneated dynamically we can remove the old benchmark
statement files.

4) This has been hooked into load-benchmark-data.sh and run_query has been
updated to use the new format as well
2012-07-12 23:12:20 -07:00

48 lines
1.7 KiB
Bash
Executable File

#!/usr/bin/env bash
# Copyright (c) 2012 Cloudera, Inc. All rights reserved.
#
# Script that creates schema and loads data into hive for running benchmarks.
# By default the script will load the base data for the "core" scenario.
# If 'pairwise' is specified as a parameter the pairwise combinations of workload
# + file format + compression will be loaded.
# If 'exhaustive' is passed as an argument the exhaustive set of combinations will
# be executed.
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`
. "$bin"/impala-config.sh
set -e
exploration_strategy=core
if [ $1 ]; then
exploration_strategy=$1
fi
BENCHMARK_SCRIPT_DIR=$IMPALA_HOME/testdata/bin
function execute_hive_query_from_file {
hive_args="-hiveconf hive.root.logger=WARN,console -v -f"
"$HIVE_HOME/bin/hive" $hive_args $1
}
pushd "$IMPALA_HOME/testdata/bin";
./generate_benchmark_statements.py --exploration_strategy $exploration_strategy
popd
if [ "$exploration_strategy" = "exhaustive" ]; then
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/create-benchmark-exhaustive-generated.sql"
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/load-benchmark-exhaustive-generated.sql"
elif [ "$exploration_strategy" = "pairwise" ]; then
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/create-benchmark-pairwise-generated.sql"
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/load-benchmark-pairwise-generated-sql"
elif [ "$exploration_strategy" = "core" ]; then
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/create-benchmark-core-generated.sql"
execute_hive_query_from_file "$BENCHMARK_SCRIPT_DIR/load-benchmark-core-generated.sql"
else
echo "Invalid exploration strategy: $exploration_strategy"
exit 1
fi
$IMPALA_HOME/testdata/bin/generate-block-ids.sh