Files
impala/testdata/workloads/functional-query/queries/QueryTest/partition-key-scans.test
Riza Suminto f932d78ad0 IMPALA-11123: Optimize count(star) for ORC scans
This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Reviewed-on: http://gerrit.cloudera.org:8080/18327
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-04-05 13:27:10 +00:00

176 lines
3.4 KiB
Plaintext

====
---- QUERY
# Basic partition key scan.
select distinct year
from alltypes
---- RESULTS
2009
2010
---- TYPES
INT
---- RUNTIME_PROFILE
# Confirm that only one row per file is read.
aggregation(SUM, RowsRead): 24
---- RUNTIME_PROFILE: table_format=parquet,orc
# Confirm that only one metadata per file is read.
aggregation(SUM, RowsRead): 0
aggregation(SUM, NumFileMetadataRead): 24
====
---- QUERY
# Test with more complex multiple distinct aggregation.
select count(distinct year), count(distinct month)
from alltypes
---- RESULTS
2,12
---- TYPES
BIGINT,BIGINT
---- RUNTIME_PROFILE
# Confirm that only one row per file is read.
aggregation(SUM, RowsRead): 24
---- RUNTIME_PROFILE: table_format=parquet,orc
# Confirm that only one metadata per file is read.
aggregation(SUM, RowsRead): 0
aggregation(SUM, NumFileMetadataRead): 24
====
---- QUERY
# Distinct aggregation with multiple columns.
select distinct year, month
from alltypes
---- RESULTS
2009,1
2009,2
2009,3
2009,4
2009,5
2009,6
2009,7
2009,8
2009,9
2009,10
2009,11
2009,12
2010,1
2010,2
2010,3
2010,4
2010,5
2010,6
2010,7
2010,8
2010,9
2010,10
2010,11
2010,12
---- TYPES
INT,INT
---- RUNTIME_PROFILE
# Confirm that only one row per file is read.
aggregation(SUM, RowsRead): 24
---- RUNTIME_PROFILE: table_format=parquet,orc
# Confirm that only one metadata per file is read.
aggregation(SUM, RowsRead): 0
aggregation(SUM, NumFileMetadataRead): 24
====
---- QUERY
# Partition key scan combined with analytic function.
select year, row_number() over (order by year)
from alltypes group by year;
---- RESULTS
2009,1
2010,2
---- TYPES
INT,BIGINT
---- RUNTIME_PROFILE
# Confirm that only one row per file is read.
aggregation(SUM, RowsRead): 24
---- RUNTIME_PROFILE: table_format=parquet,orc
# Confirm that only one metadata per file is read.
aggregation(SUM, RowsRead): 0
aggregation(SUM, NumFileMetadataRead): 24
====
---- QUERY
# Partition scan combined with sort.
select distinct year, month
from alltypes
order by year, month
---- RESULTS
2009,1
2009,2
2009,3
2009,4
2009,5
2009,6
2009,7
2009,8
2009,9
2009,10
2009,11
2009,12
2010,1
2010,2
2010,3
2010,4
2010,5
2010,6
2010,7
2010,8
2010,9
2010,10
2010,11
2010,12
---- TYPES
INT,INT
---- RUNTIME_PROFILE
# Confirm that only one row per file is read.
aggregation(SUM, RowsRead): 24
---- RUNTIME_PROFILE: table_format=parquet,orc
# Confirm that only one metadata per file is read.
aggregation(SUM, RowsRead): 0
aggregation(SUM, NumFileMetadataRead): 24
====
---- QUERY
# Partition key scan combined with predicate on partition columns
select distinct year, month
from alltypes
where year - 2000 = month;
---- RESULTS
2009,9
2010,10
---- TYPES
INT,INT
---- RUNTIME_PROFILE
# Confirm that only one row per file is read.
aggregation(SUM, RowsRead): 2
---- RUNTIME_PROFILE: table_format=parquet,orc
# Confirm that only one metadata per file is read.
aggregation(SUM, RowsRead): 0
aggregation(SUM, NumFileMetadataRead): 2
====
---- QUERY
# Partition key scan combined with having predicate.
select year, min(month)
from alltypes
group by year
having min(month) = 1
---- RESULTS
2009,1
2010,1
---- TYPES
INT,INT
---- RUNTIME_PROFILE
# Confirm that only one row per file is read.
aggregation(SUM, RowsRead): 24
---- RUNTIME_PROFILE: table_format=parquet,orc
# Confirm that only one metadata per file is read.
aggregation(SUM, RowsRead): 0
aggregation(SUM, NumFileMetadataRead): 24
====
---- QUERY
# Empty table should not return any rows
select distinct 'test'
from emptytable
---- RESULTS
---- TYPES
STRING
====