impala

mirror of https://github.com/apache/impala.git synced 2025-12-23 21:08:39 -05:00

Files

Vuk Ercegovac db98dc6504 IMPALA-4993: extend dictionary filtering to collections

Currently, top-level scalar columns in parquet files can
be used at runtime to prune row-groups by evaluating certain
conjuncts over the column's dictionary (if available).

This change extends such pruning to scalar values that are
stored in collection type columns. Currently, dictionary
pruning works by finding eligible conjuncts for top-level
slots. Since only top-level slots are supported, the slots
are implicitly part of the scan node's tuple descriptor.
With this change, we track eligible conjuncts by slot as well
as the tuple that contains the slot (either top-level or
nested collection). Since collection conjuncts are already
managed by a map that associates tuple descriptors to a list
of their conjuncts, this extension follows the existing
representation.

The frontend builds the mapping of SlotId to conjuncts that
are dictionary filterable. This mapping now includes SlotId's
that reference nested tuples. The backend is adjusted to
use the same representation. In addition, collection
readers are decomposed into scalar filterable columns and
other, non-dictionary filterable readers. When filtering
a row group using a conjunct associated to a (possibly)
nested collection type, an additional tuple buffer is
allocated per tuple descriptor.

Testing:
- e2e test extended to illustrate row-groups that are pruned
  by nested collection dictionary filters.

Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd
Reviewed-on: http://gerrit.cloudera.org:8080/8775
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins

2018-01-19 20:37:25 +00:00

customer_multiblock.parquet

IMPALA-4993: extend dictionary filtering to collections

2018-01-19 20:37:25 +00:00

README

IMPALA-4993: extend dictionary filtering to collections

2018-01-19 20:37:25 +00:00

README

This file is created to test IMPALA-4993. The file contains a subset
of tpch_nested_parquet.customer, but written out using multiple row
groups. The file was created by following the instructions in
testdata/bin/load_nested.py to create the table, tmp_customer, which
is then written out in parquet format using hive:

SET parquet.block.size=8192;

CREATE TABLE customer
STORED AS PARQUET
TBLPROPERTIES('parquet.compression'='SNAPPY')
AS SELECT * FROM tmp_customer where c_custkey < 200;