Files
impala/testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
Tim Armstrong 153663c22f IMPALA-4123: Columnar decoding in Parquet
The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Reviewed-on: http://gerrit.cloudera.org:8080/8319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-17 01:48:05 +00:00

48 lines
623 B
Plaintext

====
---- QUERY
# Test that we materialize the right number of nulls.
select count(*),
count(id),
count(nullcol),
sum(nullcol)
from manynulls
---- RESULTS
11000,11000,5500,28870000
---- TYPES
BIGINT,BIGINT,BIGINT,BIGINT
====
---- QUERY
# Spot check some values.
select id, nullcol
from manynulls
where id >= 4490 and id <= 4510
order by id
---- RESULTS
4490,NULL
4490,NULL
4491,NULL
4492,NULL
4493,NULL
4494,NULL
4495,NULL
4496,NULL
4497,NULL
4498,NULL
4499,NULL
4500,4500
4500,4500
4501,4501
4502,4502
4503,4503
4504,4504
4505,4505
4506,4506
4507,4507
4508,4508
4509,4509
4510,4510
4510,4510
---- TYPES
INT,INT
====