This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
Add support for generating ANALYZE TABLE ... COMPUTE STATISTICS statements to the data loading
workflow. This allows for capturing simple table stats such as number of rows, number of
partitions, and table size in bytes. These are stored into a new mysql database with the same
name as the metastore except with a '_Stats' suffix. If using Derby a new database results are
stored in a new derby database.
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:
./run-benchmark --workloads=hive-benchmark,tpch
We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.
To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.
Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:
./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.
Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3
Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3
This changeset also includes a few other minor tweaks to some of the test
scripts.
Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
Moved this out of the data loading framework because it is kind of a special
case. I will consider how we can update the framework to address mixed format
tables.
This change moves (almost) all the functional data loading to the new data
loading framework. This removes the need for the create.sql, load.sql, and
load-raw-data.sql file. Instead we just have the single schema template file:
testdata/datasets/functional/functional_schema_template.sql
This template can be used to generate the schema for all file formats and
compression variations. It also should help make loading data easier. Now you
can run:
bin/load-impala-data.sh "query-test" "exhaustive"
And get all data needed for running the query tests.
This change also includes the initial changes for new dataset/workload directory
structure. The new structure looks like:
testdata/workload <- Will contain query files and test vectors/dimensions
testdata/datasets <- WIll contain the data files and schema templates
Note: This is the first part of the change to this directory structure - it's
not yet complete. # Please enter the commit message for your changes. Lines starting
Rewrote TPCH queries to be more performant by removing subqueries, re-ordering joins, etc. Also did this work refering to the
official TPC-H documentation rather than the work done for Hive which overly used subqueries.
Also added a Planner test for TPCH and made a few changes so the TPC-H tests can run against ImpalaD.
This adds most of the Hive TPCH queries into the functional Impala tests. This
code review doesn't actually include the TPCH data. The data set is relatively
large. Instead I updated scripts to copy the data from a data host.
This change has a few parts:
1) Update the benchmark schema generation/test vector generation to be more
generic. This way we can use the same schema creation/data loading steps for
TPCH as we do for benchmark tests.
2) Add in schema template for the TPCH workload along with test vectors and
dimensions which are used for schema generation.
3) Add in a new test file for each TPC-H query. The Hive TPCH work broke down
the queries to generate some "temp" tables, then execute using joins/selects
from these temp tables. Since creating the temp tables does some real work
it is good to execute these via Impala. Each test a) Runs all the Insert
statements to generate the temp tables b) runs the additional TPCH queries
4) Updated all the TPCH insert statements and queries to be parameterized on
$TABLE name. This way we can run the tests across all combinations of file
format/compression/etc.
5) Updated data loading
Change-Id: I6891acc4c7464eaf1dc7dbbb532ddbeb6c259bab
This change updates the Impala performance schema and test vector generation
techniques. It also migrates the existing benchmark scripts that were Ruby over
to use Python. The changes has a few parts:
1) Conversion of test vector generation and benchmark statement generation from
Ruby to Python. A result of this was also to update the benchmark test vector
and dimension files to be written in CSV format (python doesn't have built-in
YAML support)
2) Standardize on the naming for benchmark tables to (somewhat match Query
tests). In general the form is:
* If file_format=text and compression=none, do not use a table suffix
* Abbreviate sequence file as (seq) rc file as (rc) etc
* If using BLOCK compression don't append anything to table name, if using
'record' append 'record'
3) Created a new way to adding new schemas. this is the
benchmark_schema_template.sql file. The generate_benchmark_statements.py script
reads this in and breaks up the sections. The section format is:
====
Data Set Name
---
BASE table name
---
CREATE STATEMENT Template
---
INSERT ... SELECT * format
---
LOAD Base statement
---
LOAD STATEMENT Format
Where BASE Table is a table the other file formats/compression types can be
generated from. This would generally be a local file.
The thinking is that if the files already exist in HDFS then we can just load
the file directly rather than issue an INSERT ... SELECT * statement. The
generate_benchmark_statements.py script has been updated to use this new
template as well as query HDFS for each table to determine how it should be
created. It then outputs an ideal file call load-benchmark-*-generated.sql.
Since this file is geneated dynamically we can remove the old benchmark
statement files.
4) This has been hooked into load-benchmark-data.sh and run_query has been
updated to use the new format as well
At the same time, this patch removes the partitionKeyRegex in favour
of explicitly sending a list of literal expressions for each file path
from the front end.
generate and execute the benchmark queries.
Updated to demove Lzo compression and add coverage of 'DefaultCodec'
Fixed up make_release to more cleanly list queries.
The scripts consist of a few parts
* generate_test_vectors.rb - This is a ruby script (I will convert it to python with a later checkin) which reads in a dimension file (in this case benchmark_dimensions.yaml) that describes the different dimensions that we want to explore. Currently the dimensions are: data set, file format, and compression algorithm. This script outputs to a file a list of "test vectors". It explores the input based on a fully exhaustive and a pairwise exploration strategy. The goal is to have a reduced set of test vectors to provide coverage but don't take as long to run as the exhaustive set of vectors. Note that I am checking in the vector outputs so this script only needs to be run if we want to generate a new set of vectors. I also am checking in a vector called benchmark_core.vector which just describes the current benchmark behavior (no compression, text file). This will allow running benchmarks with the coverage that exists today.
* testdata/bin/generate_benchmark_statements.rb - This script reads in the vector output and generates the statements required to create the benchmark schema and load the data into the benchmark tables with the proper compression/file format settings. It outputs "sql" files "create-benchmark-*.sql" and "load-benchmark-*.sql".
* updated the load-benchmark-data.sh to take parameters for which set of data to load (pairwise or exhaustive). If no parameters are passed is just loads the "core" data which is what it does now.
* Updated run_benchmark.py so that you can specify what type of run you want to do - core, exhaustive, or pairwise. It will read in the corresponding vector file and generate the proper table names for the queries to select the proper data. Also added the functionality to specify a results file to compare against.
Overall notes: By default the current behavior and coverage is maintained when running these scripts without any parameters. Everything needed is checked in so the ruby scripts don't need to be run unless we want to add dimensions at a later time.