Files
impala/testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
Alex Behm bce6b2b422 IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
This change is a first step towards a more efficient Parquet scanner.
The focus is on presenting the new code flow that materializes
the table-level slots in a column-wise fashion, without going deep
into actually improving scan efficieny.

After these changes there are several obvious places that should
be optimized to realize efficiency gains.

Summary of changes
- the table-level tuples are materialized in a column-wise fashion
  with new ColumnReader::ReadValueBatch() functions
- this is done by materializing a 'scratch' batch, and transferring
  scratch tuples that survive filters/conjuncts to the output batch
- the tuples of nested collections are still materialized in
  a row-wise fashion using the ColumnReader::ReadValue() function,
  just as before

Mini benchmark
I ran the following queries on a single impalad before and after my
change using a synthetic 'huge_lineitem' table.
I modified hdfs-scan-node.cc to set the number of rows of any row
batch to 0 to focus the measurement on the scan time.

Query options:
set num_scanner_threads=1;
set disable_codegen=true;
set num_nodes=1;

select * from huge_lineitem;
Before: 22.39s
Afer:   18.50s

select * from huge_lineitem where l_linenumber < 0;
Before: 25.11s
After:  20.56s

select * from huge_lineitem where l_linenumber % 2 = 0;
Before: 26.32s
After:  21.82s

Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Reviewed-on: http://gerrit.cloudera.org:8080/2779
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:48 -07:00

86 lines
1.0 KiB
Plaintext

====
---- QUERY
# Returns all results despite a discrepancy between the number of values
# scanned and the number of values stored in the file metadata.
# Set a single node and scanner thread to make this test deterministic.
set num_nodes=1;
set num_scanner_threads=1;
select id, cnt from bad_column_metadata t, (select count(*) cnt from t.int_array) v
---- TYPES
bigint,bigint
---- RESULTS
1,10
2,10
3,10
4,10
5,10
6,10
7,10
8,10
9,10
10,10
11,10
12,10
13,10
14,10
15,10
16,10
17,10
18,10
19,10
20,10
21,10
22,10
23,10
24,10
25,10
26,10
27,10
28,10
29,10
30,10
---- ERRORS
Column metadata states there are 50 values, but read 100 values from column element. file: hdfs://regex:.$
====
---- QUERY
# Same as above but only selecting a single scalar column.
set num_nodes=1;
set num_scanner_threads=1;
select id from bad_column_metadata
---- TYPES
bigint
---- RESULTS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---- ERRORS
Column metadata states there are 11 values, but read 10 values from column id. file: hdfs://regex:.$
====