mirror of
https://github.com/apache/impala.git
synced 2026-01-07 09:02:19 -05:00
This patch changes the query plan for grouping distinct aggregations to be more robust to data skew in the grouping expressions. The existing plan partitions data between phase-1 and phase-2 by the grouping exprs. Under this strategy the data skewness on the grouping exprs directly impacts performance. The new plan partitions data by both the grouping exprs and distinct agg exprs, then adds one more aggregation and exchange node. The new plan is more robust to data skew but does more work than the old plan. Testing: Modified existing planner tests which already provide sufficient coverage. The pattern is that the distinct agg exprs are added to the first exchange node, followed by an additional merge agg and exchange node. Change-Id: I7bdada0e328b555900c7b7ff8aabc8eb15ae8fa9 Reviewed-on: http://gerrit.cloudera.org:8080/7643 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
This directory contains Impala test workloads. The directory layout for the workloads should follow: workloads/ <data set name>/<data set name>_dimensions.csv <- The test dimension file <data set name>/<data set name>_core.csv <- A test vector file <data set name>/<data set name>_pairwise.csv <data set name>/<data set name>_exhaustive.csv <data set name>/queries/<query test>.test <- The queries for this workload