Commit Graph

31 Commits

Author SHA1 Message Date
Alex Behm
3d764619f7 Run Hive data loading through beeline instead of the Hive shell.
Fixes our log configuration to put the Hive logs in cluster_logs/hive.

Change-Id: I5d98581e35325f2173e4b3170e36bec42d33f8f3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1497
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1615
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-02-20 15:43:31 -08:00
ishaan
01ef3ef4c1 load-data.py should exit if a bash command returns a non-zero error code.
Change-Id: I2f732a276a42d2697fa55bce0f18ac89e9a6f0a1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1397
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1408
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-30 15:47:13 -08:00
ishaan
4e9913b52f Fix race in data loading by creating text tables first.
While loading parquet, there are a few table creation queries that use the 'like'
keyword; this ends up opening a small race window when all the table formats are created
concurrently. With this change, we create the text tables first before attempting to
parallelize the rest of the data loading.

Change-Id: Ib84cf0e5120b3588d3f0503d7119ca055e08e53f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1241
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-10 15:01:59 -08:00
Nong Li
056c7d94d6 Remove compute stats option from bin/load-data.py
This option is not implemented in this script and doesn't make it
obvious that it doesn't do anything.

Change-Id: I1a1eff38460fd181c486cfca2840108a58e21603
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1059
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-10 14:01:35 -08:00
ishaan
0ed1781323 Invalidate metadata before loading parquet data through Impala.
During a full data load, we load all the data (except parquet) via hive, and then load the
parquet data via Impala. The catalog service does not update the metadata of tables
changed outside Impala, so we need to explicitly invalidate the metadata before loading
parquet data.

Change-Id: Iec39db9ea46e4a11b17589881732629a56444120
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1207
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:39 -08:00
Lenni Kuff
baf79f8185 Call 'invalidate metadata' after loading test data instead of before
Instead of calling 'invalidate metadata' before loading each workload
we should call it once, after loading all test data. This will allow
us to pickup data inserted by Hive. The only reason this worked before
is because we restart Impala before running the tests. This will also
be a bit faster if loading multiple workloads.

Change-Id: I28d42bbf5d7a24b5fde687d67a4b41472ec4b897
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1153
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:37 -08:00
ishaan
287953e87c Better error logging while loading data.
Change-Id: I67cbd9fd1d915ea043a731b7951f29fec25fc446
Reviewed-on: http://gerrit.ent.cloudera.com:8080/982
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:13 -08:00
ishaan
bf5359be8d Cleanup Impala connections after data is loaded.
Change-Id: I152b09808740d5344462bcbaf4df4b71d88504cc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/953
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:02 -08:00
Lenni Kuff
f579ee8b25 Fix logging in load-data to print the query being executed
Change-Id: I4332e8d3a340f11e1bbb1f6c5126b0b9b4a2ad8e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/949
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:53:58 -08:00
ishaan
fcdcf1a9d8 Parallelize data loaded through Impala to speed up data loading.
Currently, we execute all the queries involved in data loading serially. This change
creates a separate .sql file for each file format, compression codec and compression
scheme combination, and executes all the files in parallel. Additionally, we now store all the
.sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note
that only data loaded through Impala is parallelized, data loaded through hive and hbase
remains serial.

On our build machines, the time taken to load all the data from snapshot was on the order
of 15 minutes.

Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/804
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:53 -08:00
ishaan
565d15579c Add the ability to use a workload as the unit of execution in the Impala benchmark runner.
At the moment, a query is the default unit of execution and parallelism in the Impala
performance suite. With this change, we now have the ability to treat a workload as the
unit of execution. A workload is defined as a unique combination of the dataset, scale
factor, a subset (or all) of the queries in the dataset, and a table format (file format,
compression codec and compression scheme).

It introduces two new command line options in bin/run-workload.py:
  * --execution_scope
    The default scope is 'query', and it maintains previous semantics. The
    new scope is 'workload', which toggles the unit of execution to a workload.
  * --shuffle_query_exec_order.
    Shuffles the order in which queries are executed (only applicable when the
    execution_scope if workload), defaults to False.

Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/503
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:53:07 -08:00
Lenni Kuff
a1f2f72f49 Add Impala DDL support for creation of AVRO tables + support for CREATE/ALTER SERDEPROPERTIES
This change adds Impala DDL support for creation of AVRO tables.
Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES
which are used when creating Avro backed tables. This syntax is not
exactly the same as the Hive support since it introduces a new
fileformat (AVROFILE) that implies the needed Serialization library,
input format, and output format.

Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/464
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:48 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Alan Choi
ecee109e68 IMPALA-387 Add refresh/invalidate SQL 2014-01-08 10:51:25 -08:00
Alan Choi
b71357fc28 IMPALA-387 Reuse Hdfs and Hive metastore metadata to perform a fast incremental refresh 2014-01-08 10:51:17 -08:00
Lenni Kuff
2f7198292a Add support for auxiliary workloads, tests, and datasets
This change adds support for auxiliary worksloads, tests, and datasets. This is useful
to augment the regular test runs with some additional tests that do not belong in the
main Impala repo.
2014-01-08 10:50:32 -08:00
Nong Li
1f6481382e Fix parquet test setup. 2014-01-08 10:49:41 -08:00
Lenni Kuff
cba9cd00dd Fix full data load build break due to constructing incorrect HDFS paths 2014-01-08 10:49:34 -08:00
Lenni Kuff
558d5ce755 Data loading: Exec DDL statements via Impala and don't recreate metadata if it exists 2014-01-08 10:49:28 -08:00
Lenni Kuff
831ee529be Fixed data loading bugs, moved most tables out of load-dependent-tables 2014-01-08 10:48:56 -08:00
Skye Wanderman-Milne
811d5dd00b Create Avro schema directory in test warehouse 2014-01-08 10:48:50 -08:00
Nong Li
0df9476be1 Parquet data loading. 2014-01-08 10:48:48 -08:00
Skye Wanderman-Milne
461a48df2b Refactor testing framework to generate Avro tables. 2014-01-08 10:48:45 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
12d18631e3 Test enhancements: dynamic table format data loading, per-workload exploration stategies 2014-01-08 10:47:07 -08:00
Lenni Kuff
1e25c98fb4 Test data loading framework improvements
This change includes a number of improvements for the test data loading framework:
* Named sections for schema template definitions
* Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE)
* More granular data loading via table name filters
* Improved robustness in detecting failed data loads
* Table level constraints for specific file formats
* Re-written compute stats script
2014-01-08 10:46:49 -08:00
Michael Ubell
37aaf06f79 IMP-390 Get rid of test dependencies on InProcessQE and Runquery 2014-01-08 10:46:18 -08:00
Lenni Kuff
846b5c55be Disabling running of COMPUTE STATISTICS statements by default during data loading 2014-01-08 10:45:10 -08:00
Lenni Kuff
6e07e0b8d8 Added support for generating ANALYZE TABLE ... COMPUTE STATISTICS statements during data loading
Add support for generating ANALYZE TABLE ... COMPUTE STATISTICS statements to the data loading
workflow. This allows for capturing simple table stats such as number of rows, number of
partitions, and table size in bytes. These are stored into a new mysql database with the same
name as the metastore except with a '_Stats' suffix. If using Derby a new database results are
stored in a new derby database.
2014-01-08 10:44:34 -08:00
Lenni Kuff
91f51a1b39 Fixed issue with data loading of workloads that have non-word characters in their names
Fixed a problem where we were not properly looking up the dataset associated
with the given workload if it had non-word characters in its name (a-z & _). Also cut down
on the execution time of the hive-benchmark workload under the "core" vector.
2014-01-08 10:44:23 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00