impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 12:02:54 -05:00

Author	SHA1	Message	Date
Bikramjeet Vig	36b4ea6f65	IMPALA-1683: Allow REFRESH on a single partition Currently the only way to refresh metadata for a partition was to refresh the whole table. This is a relatively time consuming process especially if there are many partitions and only one is to be refreshed. This patch allows the client to REFRESH on a single partition by using the following syntax: REFRESH [database_name.]table_name PARTITION (partition_spec) Testing: Added parsing and authorization tests in ParserTest.java and AuthorizationTest.java respectively. A new test file "test_refresh_partition.py" was added for testing functionality. Performance: For a table with 10000 partitions and 1 file per partition execResetMetadata() Total Execution Time Refresh Table 3795 ms 4630 ms Refersh Partition 42 ms 680 ms We see that the time to refresh improves by a factor of 90x but due to significant overhead of about 640ms in this case the effective improvement is about 7x. As the size of the table and number of partitions increase, this improvement would be more significant. Change-Id: Ia9aa25d190ada367fbebaca47ae8b2cafbea16fb Reviewed-on: http://gerrit.cloudera.org:8080/3813 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-29 23:57:50 +00:00
Dimitris Tsirogiannis	6fbd35fa87	Enable TPC-H workload for Kudu tables With this commit we enable loading of TPC-H data in Kudu tables and running the 22 TPC-H queries against Kudu. Since Kudu doesn't support the decimal data type, we had to modify the queries by using round() function and update the test results. Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289 Reviewed-on: http://gerrit.cloudera.org:8080/3789 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:35:11 +00:00
Matthew Jacobs	25428fdb21	Add support for streaming decompression of gzip text Compressed text formats currently require entire compressed files be read into memory to be decompressed in a single call to the decompression codec. This changes the HdfsTextScanner to drive gzip in a streaming mode, i.e. produce partial output as input is consumed. Change-Id: Id5c0805e18cf6b606bcf27a5df4b5f58895809fd Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5233 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 05c3cc55e7a601d97adc4eebe03f878c68a33e56) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5385	2014-11-23 01:55:55 -08:00
Skye Wanderman-Milne	a7e15b1417	Update Parquet scanner to only scan a file if assigned the first split. Also re-enable Parquet tests.	2014-01-08 10:49:25 -08:00
Nong Li	329763e5ab	Disable parquet tests.	2014-01-08 10:49:20 -08:00
Nong Li	0df9476be1	Parquet data loading.	2014-01-08 10:48:48 -08:00
Skye Wanderman-Milne	461a48df2b	Refactor testing framework to generate Avro tables.	2014-01-08 10:48:45 -08:00
Nong Li	02c329b97a	Update RC files to use io mgr and remove scanner support for non-io mgr.	2014-01-08 10:47:11 -08:00
Nong Li	4fd7bd9606	Updated tpch core workload to include seq/snappy and seq/gzip. Change-Id: Ifb01ee95542fced2ae8cfa4928ffbc7e357df3a8	2014-01-08 10:44:34 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00
Lenni Kuff	bf27a31f98	Move functional data loading to new framework + initial changes for workload directory structure This change moves (almost) all the functional data loading to the new data loading framework. This removes the need for the create.sql, load.sql, and load-raw-data.sql file. Instead we just have the single schema template file: testdata/datasets/functional/functional_schema_template.sql This template can be used to generate the schema for all file formats and compression variations. It also should help make loading data easier. Now you can run: bin/load-impala-data.sh "query-test" "exhaustive" And get all data needed for running the query tests. This change also includes the initial changes for new dataset/workload directory structure. The new structure looks like: testdata/workload <- Will contain query files and test vectors/dimensions testdata/datasets <- WIll contain the data files and schema templates Note: This is the first part of the change to this directory structure - it's not yet complete. # Please enter the commit message for your changes. Lines starting	2014-01-08 10:44:18 -08:00

11 Commits