Files
impala/tests/common
Tim Armstrong 2c2670e389 IMPALA-1305: streaming pre-aggregations
Aggregations are implemented as a distributed pre-aggregation, an
exchange, then a final aggregation that produces the results of the
aggregation. In many cases the pre-aggregation significantly reduces the
amount of data to be exchanged. However, in other cases, the
preaggregation does not greatly reduce the amount of data exchanged or
can use a lot of memory and starve other operators that would benefit
more from the additional memory.

In these cases we would be better off "passing through" some input tuples
by transforming them into intermediate tuples without aggregating them.

This patch adds a streaming pre-aggregation mode to
PartitionedAggregationNode that tries to aggregate input rows with a
hash table, but can switch to passing through the input tuples (after
transforming them into the appropriate tuple format). It does this if
it hits a memory limit or if the aggregation is not sufficiently
reducing the node's output (specifically, if the number of aggregated
rows in the hash table is more than half the number of unaggregated rows
consumed by the pre-aggregation). Pre-aggregations never need to spill
because they can pass through rows when under memory pressure.

This initial implementation is quite conservative: it retains the
partitioning of the previous implementation because switching to a
single partition proved to regress performance of some queries while
improving others. It also always keeps hash tables around and updates
them with matching input rows so that reduction statistics are updated
and early decisions to pass through data can be reversed.  Future work
could explore different approaches within the new framework to get
larger performance gains. Currently we see significant performance
benefits for queries with a very low reduction factor, e.g. group by on
a nearly unique column

Includes codegen support for the passthrough streaming.

Adds a query option, disable_streaming_preaggregations, in case a user
wants to revert to the old behaviour.

Adds TPC-H tests to exercise the new passthrough code path and updates
planner tests to include the new [STREAMING] detail added by the planner.

Change-Id: Ia40525340cba89a8c4e70164ae11447e96494664
Reviewed-on: http://gerrit.cloudera.org:8080/1698
Tested-by: Internal Jenkins
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
2016-02-11 19:03:51 +00:00
..
2016-01-20 23:00:25 +00:00