Files
impala/testdata/workloads
Henry Robinson c14a6f11df IMPALA-3077: Enable runtime filters when PHJ spills
This patch changes when runtime filters are produced in the partitioned
hash-join node to allow filters to be produced even when the PHJ
spills. Filters are now produced during the level0 processing of the
PHJ's build-side input in ProcessBuildBatch().

Since this function is codegen'ed, so now is filter production. We use
constant-propagation via constant argument injection to disable filter
production at no cost when it is not needed (including in level1+
repartitioning). I inspected the IR to confirm that the constant
propagation works as expected.

This change also allows us to send filters earlier during build-side
processing. A tradeoff is that filters are still built even if the
expected FP rate is too high, although any too-permissive filters are
still not sent to the scan (see 'Performance impact' below).

The restriction that prevented filters from being computed inside a
sub-plan is removed as part of this cleanup (since the FE handles
assigning filters correctly in subplans), and a test is added to confirm
that one of the correct cases for filters in subplans works.

This patch also fixes a bug where re-partitioning beyond level0 would
not use the codegen'ed implementation of ProcessBuildBatch().

A new test is added to test_runtime_row_filters, for Parquet only, which
spills and confirms that filtering still occurs.

Finally, the legacy --enable_phj_probe_side_filtering /
--enable_probe_side_filtering flags have been deprecated, as runtime
filtering can be permanently disabled via setting
RUNTIME_FILTER_MODE=OFF. The implementation that the old flags referred
to has been removed.

Performance impact
------------------

We benchmark the performance loss due to always computing runtime
filters even when the FP-rate will turn out to be too high as follows:

select STRAIGHT_JOIN count(*) from (select id from functional.alltypes
LIMIT 1) a JOIN [BROADCAST] (select * FROM p LIMIT 100000000) b on a.id
= -b.id and b.part_col > 0

('p' is a two-column Parquet table with 1B rows).

This builds a 100M row build table (benchmarks run on one node). When
filtering is enabled, the filter is built but selects all rows from the
probe side (so that there's no benefit to having the filter, to
emphasise the cost of building the filter in the first place).

RUNTIME_FILTER_MODE    Avg. time (s) over 5 runs
OFF                    18.95
GLOBAL                 19.55
-------------------------------
Change                 +3%

Change-Id: I59a2d9ee03ccea6b674392584e4c7f272233571e
Reviewed-on: http://gerrit.cloudera.org:8080/2783
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2016-05-12 14:17:34 -07:00
..

This directory contains Impala test workloads. The directory layout for the workloads should follow:

workloads/
   <data set name>/<data set name>_dimensions.csv  <- The test dimension file
   <data set name>/<data set name>_core.csv  <- A test vector file
   <data set name>/<data set name>_pairwise.csv
   <data set name>/<data set name>_exhaustive.csv
   <data set name>/queries/<query test>.test <- The queries for this workload