Files
impala/testdata/workloads
Tianyi Wang b660bd652f IMPALA-4794: Grouping distinct agg plan robust to data skew
This patch changes the query plan for grouping distinct aggregations to
be more robust to data skew in the grouping expressions. The existing
plan partitions data between phase-1 and phase-2 by the grouping exprs.
Under this strategy the data skewness on the grouping exprs directly
impacts performance. The new plan partitions data by both the grouping
exprs and distinct agg exprs, then adds one more aggregation and
exchange node. The new plan is more robust to data skew but does more
work than the old plan.

Testing: Modified existing planner tests which already provide
sufficient coverage. The pattern is that the distinct agg exprs are
added to the first exchange node, followed by an additional merge agg
and exchange node.

Change-Id: I7bdada0e328b555900c7b7ff8aabc8eb15ae8fa9
Reviewed-on: http://gerrit.cloudera.org:8080/7643
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-16 23:20:22 +00:00
..

This directory contains Impala test workloads. The directory layout for the workloads should follow:

workloads/
   <data set name>/<data set name>_dimensions.csv  <- The test dimension file
   <data set name>/<data set name>_core.csv  <- A test vector file
   <data set name>/<data set name>_pairwise.csv
   <data set name>/<data set name>_exhaustive.csv
   <data set name>/queries/<query test>.test <- The queries for this workload