Commit Graph

26 Commits

Author SHA1 Message Date
Lenni Kuff
831ee529be Fixed data loading bugs, moved most tables out of load-dependent-tables 2014-01-08 10:48:56 -08:00
Lenni Kuff
e0a7b7cb55 Compute column stats on tables used by Planner tests 2014-01-08 10:48:48 -08:00
Lenni Kuff
328ceed4e7 Add support for generating lzo compressed text files and running tests against lzo 2014-01-08 10:48:38 -08:00
Lenni Kuff
6e1f8d178a Update utility script to compute column and table stats for given table(s) 2014-01-08 10:48:23 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
99bb22dcac Add db name filter to compute stats, run compute stats on functional/text tables 2014-01-08 10:48:08 -08:00
Alan Choi
73c8ee3d96 IMPALA-18 Ignore hidden file prefixed with . or _ 2014-01-08 10:48:00 -08:00
Lenni Kuff
d2e4776731 Support passing snapshot file to buildall, add script to run all tests, remove old tests 2014-01-08 10:47:59 -08:00
Lenni Kuff
d5177c3c30 Update run-workload to support specifying table format test vectors from command line 2014-01-08 10:47:20 -08:00
Lenni Kuff
deea7a86b9 Remove Trevni from mixed format table and fix data loading bug 2014-01-08 10:47:10 -08:00
Lenni Kuff
30dbf59ef2 Final changes to enable Python test infrastructure and tests
With this change the Python tests will now be called as part of buildall and
the corresponding Java tests have been disabled. The new tests can also be
invoked calling ./tests/run-tests.sh directly.

This includes a fix from Nong that caused wrong results for limit on non-io
manager formats.
2014-01-08 10:46:57 -08:00
Lenni Kuff
bed633c1ae Extract config/metastore creation from buildall + script for loading warehouse snapshot 2014-01-08 10:46:53 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
Lenni Kuff
1e25c98fb4 Test data loading framework improvements
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
2014-01-08 10:46:49 -08:00
Michael Ubell
8a5297a526 Add HdfsLzoTextScanner 2014-01-08 10:46:35 -08:00
Michael Ubell
85807f6169 Start a single impalad to avoid data load race 2014-01-08 10:46:18 -08:00
Michael Ubell
37aaf06f79 IMP-390 Get rid of test dependencies on InProcessQE and Runquery 2014-01-08 10:46:18 -08:00
Michael Ubell
0c4f025a5e Fix loading of nulltable data, remove loading functional-planner data 2014-01-08 10:45:58 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00
Michael Ubell
02d63d8dc3 Trevni file support 2014-01-08 10:44:19 -08:00
Lenni Kuff
84d91fca4f Fix sequence file data loading for the alltypesmixedformat table
Moved this out of the data loading framework because it is kind of a special
case. I will consider how we can update the framework to address mixed format
tables.
2014-01-08 10:44:18 -08:00
Lenni Kuff
bf27a31f98 Move functional data loading to new framework + initial changes for workload directory structure
This change moves (almost) all the functional data loading to the new data
loading framework. This removes the need for the create.sql, load.sql, and
load-raw-data.sql file. Instead we just have the single schema template file:
testdata/datasets/functional/functional_schema_template.sql

This template can be used to generate the schema for all file formats and
compression variations. It also should help make loading data easier. Now you
can run:

bin/load-impala-data.sh "query-test" "exhaustive"

And get all data needed for running the query tests.

This change also includes the initial changes for new dataset/workload directory
structure. The new structure looks like:

testdata/workload  <- Will contain query files and test vectors/dimensions

testdata/datasets <- WIll contain the data files and schema templates

Note: This is the first part of the change to this directory structure - it's
not yet complete. # Please enter the commit message for your changes. Lines starting
2014-01-08 10:44:18 -08:00
Alan Choi
b17e24d654 A few FE fixes [rewritten by hnr]
review issue: 198

Change-Id: I84a2f38b0bce5a6f33dfb974de60c822945834e5
2014-01-08 10:44:15 -08:00
Lenni Kuff
e293164b37 Added TPCH functional query tests and schema generation
This adds most of the Hive TPCH queries into the functional Impala tests. This
code review doesn't actually include the TPCH data. The data set is relatively
large. Instead I updated scripts to copy the data from a data host.

This change has a few parts:
1) Update the benchmark schema generation/test vector generation to be more
generic. This way we can use the same schema creation/data loading steps for
TPCH as we do for benchmark tests.

2) Add in schema template for the TPCH workload along with test vectors and
dimensions which are used for schema generation.

3) Add in a new test file for each TPC-H query. The Hive TPCH work broke down
the queries to generate some "temp" tables, then execute using joins/selects
from these temp tables. Since creating the temp tables does some real work
it is good to execute these via Impala. Each test a) Runs all the Insert
statements to generate the temp tables b) runs the additional TPCH queries

4) Updated all the TPCH insert statements and queries to be parameterized on
$TABLE name. This way we can run the tests across all combinations of file
format/compression/etc.

5) Updated data loading

Change-Id: I6891acc4c7464eaf1dc7dbbb532ddbeb6c259bab
2014-01-08 10:44:06 -08:00
Henry Robinson
3ff3559805 Add support for per-partition file formats to front end and backend.
At the same time, this patch removes the partitionKeyRegex in favour
of explicitly sending a list of literal expressions for each file path
from the front end.
2012-06-05 12:00:09 -07:00
Michael Ubell
7b14187bf1 Install snappy library
add create-load-data.sh
2012-05-02 07:31:10 -07:00