mirror of
https://github.com/apache/impala.git
synced 2026-01-03 06:00:52 -05:00
This change is a first step towards a more efficient Parquet scanner. The focus is on presenting the new code flow that materializes the table-level slots in a column-wise fashion, without going deep into actually improving scan efficieny. After these changes there are several obvious places that should be optimized to realize efficiency gains. Summary of changes - the table-level tuples are materialized in a column-wise fashion with new ColumnReader::ReadValueBatch() functions - this is done by materializing a 'scratch' batch, and transferring scratch tuples that survive filters/conjuncts to the output batch - the tuples of nested collections are still materialized in a row-wise fashion using the ColumnReader::ReadValue() function, just as before Mini benchmark I ran the following queries on a single impalad before and after my change using a synthetic 'huge_lineitem' table. I modified hdfs-scan-node.cc to set the number of rows of any row batch to 0 to focus the measurement on the scan time. Query options: set num_scanner_threads=1; set disable_codegen=true; set num_nodes=1; select * from huge_lineitem; Before: 22.39s Afer: 18.50s select * from huge_lineitem where l_linenumber < 0; Before: 25.11s After: 20.56s select * from huge_lineitem where l_linenumber % 2 = 0; Before: 26.32s After: 21.82s Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Reviewed-on: http://gerrit.cloudera.org:8080/2779 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins
86 lines
1.0 KiB
Plaintext
86 lines
1.0 KiB
Plaintext
====
|
|
---- QUERY
|
|
# Returns all results despite a discrepancy between the number of values
|
|
# scanned and the number of values stored in the file metadata.
|
|
# Set a single node and scanner thread to make this test deterministic.
|
|
set num_nodes=1;
|
|
set num_scanner_threads=1;
|
|
select id, cnt from bad_column_metadata t, (select count(*) cnt from t.int_array) v
|
|
---- TYPES
|
|
bigint,bigint
|
|
---- RESULTS
|
|
1,10
|
|
2,10
|
|
3,10
|
|
4,10
|
|
5,10
|
|
6,10
|
|
7,10
|
|
8,10
|
|
9,10
|
|
10,10
|
|
11,10
|
|
12,10
|
|
13,10
|
|
14,10
|
|
15,10
|
|
16,10
|
|
17,10
|
|
18,10
|
|
19,10
|
|
20,10
|
|
21,10
|
|
22,10
|
|
23,10
|
|
24,10
|
|
25,10
|
|
26,10
|
|
27,10
|
|
28,10
|
|
29,10
|
|
30,10
|
|
---- ERRORS
|
|
Column metadata states there are 50 values, but read 100 values from column element. file: hdfs://regex:.$
|
|
====
|
|
---- QUERY
|
|
# Same as above but only selecting a single scalar column.
|
|
set num_nodes=1;
|
|
set num_scanner_threads=1;
|
|
select id from bad_column_metadata
|
|
---- TYPES
|
|
bigint
|
|
---- RESULTS
|
|
1
|
|
2
|
|
3
|
|
4
|
|
5
|
|
6
|
|
7
|
|
8
|
|
9
|
|
10
|
|
11
|
|
12
|
|
13
|
|
14
|
|
15
|
|
16
|
|
17
|
|
18
|
|
19
|
|
20
|
|
21
|
|
22
|
|
23
|
|
24
|
|
25
|
|
26
|
|
27
|
|
28
|
|
29
|
|
30
|
|
---- ERRORS
|
|
Column metadata states there are 11 values, but read 10 values from column id. file: hdfs://regex:.$
|
|
====
|