Commit Graph

2 Commits

Author SHA1 Message Date
Alex Behm
bce6b2b422 IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
This change is a first step towards a more efficient Parquet scanner.
The focus is on presenting the new code flow that materializes
the table-level slots in a column-wise fashion, without going deep
into actually improving scan efficieny.

After these changes there are several obvious places that should
be optimized to realize efficiency gains.

Summary of changes
- the table-level tuples are materialized in a column-wise fashion
  with new ColumnReader::ReadValueBatch() functions
- this is done by materializing a 'scratch' batch, and transferring
  scratch tuples that survive filters/conjuncts to the output batch
- the tuples of nested collections are still materialized in
  a row-wise fashion using the ColumnReader::ReadValue() function,
  just as before

Mini benchmark
I ran the following queries on a single impalad before and after my
change using a synthetic 'huge_lineitem' table.
I modified hdfs-scan-node.cc to set the number of rows of any row
batch to 0 to focus the measurement on the scan time.

Query options:
set num_scanner_threads=1;
set disable_codegen=true;
set num_nodes=1;

select * from huge_lineitem;
Before: 22.39s
Afer:   18.50s

select * from huge_lineitem where l_linenumber < 0;
Before: 25.11s
After:  20.56s

select * from huge_lineitem where l_linenumber % 2 = 0;
Before: 26.32s
After:  21.82s

Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Reviewed-on: http://gerrit.cloudera.org:8080/2779
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:48 -07:00
Skye Wanderman-Milne
dd2eb951d7 IMPALA-2558: DCHECK in parquet scanner after block read error
There was an incorrect DCHECK in the parquet scanner. If abort_on_error
is false, the intended behaviour is to skip to the next row group, but
the DCHECK assumed that execution should have aborted if a parse error
was encountered.

This also:
- Fixes a DCHECK after an empty row group. InitColumns() would try to
  create empty scan ranges for the column readers.
- Uses metadata_range_->file() instead of stream_->filename() in the
  scanner. InitColumns() was using stream_->filename() in error
  messages, which used to work but now stream_ is set to NULL before
  calling InitColumns().

Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701
Reviewed-on: http://gerrit.cloudera.org:8080/1257
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-30 22:35:57 +00:00