impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 21:00:54 -05:00

Author	SHA1	Message	Date
Casey Ching	d202d6a967	Use "impala-python" (virtualenv) instead of system python Python tests and infra scripts will now use "python" from the virtualenv via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now that python 2.6 and a dependable set of third-party libraries are available but that is not done as part of this commit. Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f Reviewed-on: http://gerrit.cloudera.org:8080/603 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-06 02:09:09 +00:00
Martin Grund	f58159d431	[CDH5] IMPALA-1141: HBase Planner Performance This patch improves the performance of the planning phase of a query querying HBase tables. It removes an unnecessary second call to compute stats and adds a new version for estimating the row count in a table. This patch adds an incremental version to estimate the number of rows for a set of regions. This incremental version will start querying up to five regions to calculate the average row size and use this value to estimate the row count based on the size of the regions on disk. Only if the standard deviation from the average is larger than 15% query an additional region, it will query additional regions to calculate an average with more confidence. If the data is balanced it will not be necessary to retrieve data from all regions but only from a subset. In the worst case, all regions are queried. Change-Id: Idcb3bea81b11cb08da6d9329ba66c86aca23e170 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5258 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2014-11-14 13:47:02 -08:00
Mike Yoder	75a97d3d7e	[CDH5] Kerberize mini-cluster and Impala daemons This is the first iteration of a kerberized development environment. All the daemons start and use kerberos, with the sole exception of the hive metastore. This is sufficient to test impala authentication. When buildall.sh is run using '-kerberize', it will stop before loading data or attempting to run tests. Loading data into the cluster is known to not work at this time, the root causes being that Beeline -> HiveServer2 -> MapReduce throws errors, and Beeline -> HiveServer2 -> HBase has problems. These are left for later work. However, the impala daemons will happily authenticate using kerberos both from clients (like the impala shell) and amongst each other. This means that if you can get data into the mini-cluster, you could query it. Usage: * Supply a '-kerberize' option to buildall.sh, or * Supply a '-kerberize' option to create-test-configuration.sh, then 'run-all.sh -format', re-source impala-config.sh, and then start impala daemons as usual. You must reformat the cluster because kerberizing it will change all the ownership of all files in HDFS. Notable changes: * Added clean start/stop script for the llama-minikdc * Creation of Kerberized HDFS - namenode and datanodes * Kerberized HBase (and Zookeeper) * Kerberized Hive (minus the MetaStore) * Kerberized Impala * Loading of data very nearly working Still to go: * Kerberize the MetaStore * Get data loading working * Run all tests * The unknown unknowns * Extensive testing Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019 Reviewed-by: Michael Yoder <myoder@cloudera.com> Tested-by: jenkins	2014-09-05 12:36:21 -07:00
Alex Behm	3d764619f7	Run Hive data loading through beeline instead of the Hive shell. Fixes our log configuration to put the Hive logs in cluster_logs/hive. Change-Id: I5d98581e35325f2173e4b3170e36bec42d33f8f3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1497 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1615 Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-02-20 15:43:31 -08:00
ishaan	01ef3ef4c1	load-data.py should exit if a bash command returns a non-zero error code. Change-Id: I2f732a276a42d2697fa55bce0f18ac89e9a6f0a1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1397 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1408 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-30 15:47:13 -08:00
ishaan	4e9913b52f	Fix race in data loading by creating text tables first. While loading parquet, there are a few table creation queries that use the 'like' keyword; this ends up opening a small race window when all the table formats are created concurrently. With this change, we create the text tables first before attempting to parallelize the rest of the data loading. Change-Id: Ib84cf0e5120b3588d3f0503d7119ca055e08e53f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1241 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-10 15:01:59 -08:00
Nong Li	056c7d94d6	Remove compute stats option from bin/load-data.py This option is not implemented in this script and doesn't make it obvious that it doesn't do anything. Change-Id: I1a1eff38460fd181c486cfca2840108a58e21603 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1059 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-10 14:01:35 -08:00
ishaan	0ed1781323	Invalidate metadata before loading parquet data through Impala. During a full data load, we load all the data (except parquet) via hive, and then load the parquet data via Impala. The catalog service does not update the metadata of tables changed outside Impala, so we need to explicitly invalidate the metadata before loading parquet data. Change-Id: Iec39db9ea46e4a11b17589881732629a56444120 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1207 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:39 -08:00
Lenni Kuff	baf79f8185	Call 'invalidate metadata' after loading test data instead of before Instead of calling 'invalidate metadata' before loading each workload we should call it once, after loading all test data. This will allow us to pickup data inserted by Hive. The only reason this worked before is because we restart Impala before running the tests. This will also be a bit faster if loading multiple workloads. Change-Id: I28d42bbf5d7a24b5fde687d67a4b41472ec4b897 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1153 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:37 -08:00
ishaan	287953e87c	Better error logging while loading data. Change-Id: I67cbd9fd1d915ea043a731b7951f29fec25fc446 Reviewed-on: http://gerrit.ent.cloudera.com:8080/982 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:13 -08:00
ishaan	bf5359be8d	Cleanup Impala connections after data is loaded. Change-Id: I152b09808740d5344462bcbaf4df4b71d88504cc Reviewed-on: http://gerrit.ent.cloudera.com:8080/953 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:02 -08:00
Lenni Kuff	f579ee8b25	Fix logging in load-data to print the query being executed Change-Id: I4332e8d3a340f11e1bbb1f6c5126b0b9b4a2ad8e Reviewed-on: http://gerrit.ent.cloudera.com:8080/949 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:58 -08:00
ishaan	fcdcf1a9d8	Parallelize data loaded through Impala to speed up data loading. Currently, we execute all the queries involved in data loading serially. This change creates a separate .sql file for each file format, compression codec and compression scheme combination, and executes all the files in parallel. Additionally, we now store all the .sql files (independent of workload) in $IMPALA_HOME/data_load_files/<dataset_name>. Note that only data loaded through Impala is parallelized, data loaded through hive and hbase remains serial. On our build machines, the time taken to load all the data from snapshot was on the order of 15 minutes. Change-Id: If8a862c43f0e75b506ca05d83eacdc05621cbbf8 Reviewed-on: http://gerrit.ent.cloudera.com:8080/804 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:53 -08:00
ishaan	565d15579c	Add the ability to use a workload as the unit of execution in the Impala benchmark runner. At the moment, a query is the default unit of execution and parallelism in the Impala performance suite. With this change, we now have the ability to treat a workload as the unit of execution. A workload is defined as a unique combination of the dataset, scale factor, a subset (or all) of the queries in the dataset, and a table format (file format, compression codec and compression scheme). It introduces two new command line options in bin/run-workload.py: * --execution_scope The default scope is 'query', and it maintains previous semantics. The new scope is 'workload', which toggles the unit of execution to a workload. * --shuffle_query_exec_order. Shuffles the order in which queries are executed (only applicable when the execution_scope if workload), defaults to False. Change-Id: I790d75f0896210cda8eb999015b0be04246e4c45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/503 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:07 -08:00
Lenni Kuff	a1f2f72f49	Add Impala DDL support for creation of AVRO tables + support for CREATE/ALTER SERDEPROPERTIES This change adds Impala DDL support for creation of AVRO tables. Additionally, it add Impala support for CREATE and ALTER SERDEPROPERTIES which are used when creating Avro backed tables. This syntax is not exactly the same as the Hive support since it introduces a new fileformat (AVROFILE) that implies the needed Serialization library, input format, and output format. Change-Id: I5047e419198a89599e9d014fdedfee1a20437a7d Reviewed-on: http://gerrit.ent.cloudera.com:8080/464 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:52:48 -08:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Alan Choi	ecee109e68	IMPALA-387 Add refresh/invalidate SQL	2014-01-08 10:51:25 -08:00
Alan Choi	b71357fc28	IMPALA-387 Reuse Hdfs and Hive metastore metadata to perform a fast incremental refresh	2014-01-08 10:51:17 -08:00
Lenni Kuff	2f7198292a	Add support for auxiliary workloads, tests, and datasets This change adds support for auxiliary worksloads, tests, and datasets. This is useful to augment the regular test runs with some additional tests that do not belong in the main Impala repo.	2014-01-08 10:50:32 -08:00
Nong Li	1f6481382e	Fix parquet test setup.	2014-01-08 10:49:41 -08:00
Lenni Kuff	cba9cd00dd	Fix full data load build break due to constructing incorrect HDFS paths	2014-01-08 10:49:34 -08:00
Lenni Kuff	558d5ce755	Data loading: Exec DDL statements via Impala and don't recreate metadata if it exists	2014-01-08 10:49:28 -08:00
Lenni Kuff	831ee529be	Fixed data loading bugs, moved most tables out of load-dependent-tables	2014-01-08 10:48:56 -08:00
Skye Wanderman-Milne	811d5dd00b	Create Avro schema directory in test warehouse	2014-01-08 10:48:50 -08:00
Nong Li	0df9476be1	Parquet data loading.	2014-01-08 10:48:48 -08:00
Skye Wanderman-Milne	461a48df2b	Refactor testing framework to generate Avro tables.	2014-01-08 10:48:45 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Lenni Kuff	12d18631e3	Test enhancements: dynamic table format data loading, per-workload exploration stategies	2014-01-08 10:47:07 -08:00
Lenni Kuff	1e25c98fb4	Test data loading framework improvements This change includes a number of improvements for the test data loading framework: * Named sections for schema template definitions * Removal of uneeded sections from schema template definitions (ex. ANALYZE TABLE) * More granular data loading via table name filters * Improved robustness in detecting failed data loads * Table level constraints for specific file formats * Re-written compute stats script	2014-01-08 10:46:49 -08:00
Michael Ubell	37aaf06f79	IMP-390 Get rid of test dependencies on InProcessQE and Runquery	2014-01-08 10:46:18 -08:00
Lenni Kuff	846b5c55be	Disabling running of COMPUTE STATISTICS statements by default during data loading	2014-01-08 10:45:10 -08:00
Lenni Kuff	6e07e0b8d8	Added support for generating ANALYZE TABLE ... COMPUTE STATISTICS statements during data loading Add support for generating ANALYZE TABLE ... COMPUTE STATISTICS statements to the data loading workflow. This allows for capturing simple table stats such as number of rows, number of partitions, and table size in bytes. These are stored into a new mysql database with the same name as the metastore except with a '_Stats' suffix. If using Derby a new database results are stored in a new derby database.	2014-01-08 10:44:34 -08:00
Lenni Kuff	91f51a1b39	Fixed issue with data loading of workloads that have non-word characters in their names Fixed a problem where we were not properly looking up the dataset associated with the given workload if it had non-word characters in its name (a-z & _). Also cut down on the execution time of the hive-benchmark workload under the "core" vector.	2014-01-08 10:44:23 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00

34 Commits