impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 06:00:36 -05:00

Author	SHA1	Message	Date
Riza Suminto	7ca20b3c94	Revert "IMPALA-11123: Optimize count(star) for ORC scans" This reverts commit `f932d78ad0`. The commit is reverted because it cause significant regression for non-optimized counts star query in parquet format. There are several conflicts that need to be resolved manually: - Removed assertion against 'NumFileMetadataRead' counter that is lost with the revert. - Adjust the assertion in test_plain_count_star_optimization, test_in_predicate_push_down, and test_partitioned_insert of test_iceberg.py due to missing improvement in parquet optimized count star code path. - Keep the "override" specifier in hdfs-parquet-scanner.h to pass clang-tidy - Keep python3 style of RuntimeError instantiation in test_file_parser.py to pass check-python-syntax.sh Change-Id: Iefd8fd0838638f9db146f7b706e541fe2aaf01c1 Reviewed-on: http://gerrit.cloudera.org:8080/19843 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-05-06 22:55:05 +00:00
Riza Suminto	f932d78ad0	IMPALA-11123: Optimize count(star) for ORC scans This patch provides count(star) optimization for ORC scans, similar to the work done in IMPALA-5036 for Parquet scans. We use the stripes num rows statistics when computing the count star instead of materializing empty rows. The aggregate function changed from a count to a special sum function initialized to 0. This count(star) optimization is disabled for the full ACID table because the scanner might need to read and validate the 'currentTransaction' column in table's special schema. This patch drops 'parquet' from names related to the count star optimization. It also improves the count(star) operation in general by serving the result just from the file's footer stats for both Parquet and ORC. We unify the optimized count star and zero slot scan functions into HdfsColumnarScanner. The following table shows a performance comparison before and after the patch. primitive_count_star query target tpch10_parquet.lineitem table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc query is a modified primitive_count_star query that targets tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly. +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| tpch_parquet \| count_star_parq \| parquet / none / none \| 0.06 \| 0.07 \| -10.45% \| 2.87% \| * 25.51% * \| 9 \| -1.47% \| -1.26 \| -1.22 \| \| tpch_orc_def \| count_star_orc \| orc / def / none \| 0.06 \| 0.08 \| -22.37% \| 6.22% \| * 30.95% * \| 9 \| -1.85% \| -1.16 \| -2.14 \| \| TARGETED-PERF(10) \| primitive_count_star \| parquet / none / none \| 0.06 \| 0.08 \| I -30.40% \| 2.68% \| * 29.63% * \| 9 \| I -7.20% \| -2.42 \| -3.07 \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Add PlannerTest.testOrcStatsAgg - Add TestAggregationQueries::test_orc_count_star_optimization - Exercise count(star) in TestOrc::test_misaligned_orc_stripes - Pass core tests Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Reviewed-on: http://gerrit.cloudera.org:8080/18327 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-05 13:27:10 +00:00
Tim Armstrong	a154b2d6e7	IMPALA-5861: fix RowsRead for zero-slot table scan Testing: Added regression test based on JIRA and a targeted test for all HDFS file formats. Change-Id: I7a927c6a4f0b8055608cb7a5e2b550a1610cef89 Reviewed-on: http://gerrit.cloudera.org:8080/12332 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-12 20:37:45 +00:00
poojanilangekar	c6f9b61ec2	IMPALA-6625: Skip computing parquet conjuncts for non-Parquet scans This change ensures that the planner computes parquet conjuncts only when for scans containing parquet files. Additionally, it also handles PARQUET_DICTIONARY_FILTERING and PARQUET_READ_STATISTICS query options in the planner. Testing was carried out independently on parquet and non-parquet scans: 1. Parquet scans were tested via the existing parquet-filtering planner test. Additionally, a new test [parquet-filtering-disabled] was added to ensure that the explain plan generated skips parquet predicates based on the query options. 2. Non-parquet scans were tested manually to ensure that the functions to compute parquet conjucts were not invoked. Additional test cases were added to the parquet-filtering planner test to scan non parquet tables and ensure that the plans do not contain conjuncts based on parquet statistics. 3. A parquet partition was added to the alltypesmixedformat table in the functional database. Planner tests were added to ensure that Parquet conjuncts are constructed only when the Parquet partition is included in the query. Change-Id: I9d6c26d42db090c8a15c602f6419ad6399c329e7 Reviewed-on: http://gerrit.cloudera.org:8080/10704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-06 02:06:50 +00:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00

7 Commits