Commit Graph

3 Commits

Author SHA1 Message Date
Zoltan Borok-Nagy
ab975c9517 IMPALA-8969: Grouping aggregator can cause segmentation fault when doing multiple aggregations
Grouping aggregator always tried to serialize the 0th tuple regardless
of the aggregation index. This could lead to a segmentation fault
because the 0th tuple might be null.

Testing:
Added a query that triggers the error to multiple-distinct-aggs.test

Change-Id: I7acdd40c63166cd4986e546a992c0816f94823d5
Reviewed-on: http://gerrit.cloudera.org:8080/14290
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-09-25 16:58:14 +00:00
Thomas Tauber-Marshall
15e8ce4f27 IMPALA-7677: Fix DCHECK failure in GroupingAggregator
After inserting all of its input into its Aggregators,
StreamingAggregationNode performs some cleanup, such as calling
InputDone() on each Aggregator.

Previously, StreamingAggregationNode only checked that all of the
child's batches had been fetched before doing this cleanup, which
causes problems if the final child batch isn't processed fully in a
single GetNext() call. In this case, multiple calls to InputDone()
lead to a DCHECK failure.

The solution is to only perform the cleanup once the final child batch
has been fully processed.

Testing:
- Added an e2e test with a query that hits this condition.

Change-Id: I851007a60472d0e53081c076c863c866c516677c
Reviewed-on: http://gerrit.cloudera.org:8080/11626
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-10-24 01:03:38 +00:00
Thomas Tauber-Marshall
df53ec2385 IMPALA-110: Support for multiple DISTINCT
This patch adds support for having multiple aggregate functions in a
single SELECT block that use DISTINCT over different sets of columns.

Planner design:
- The existing tree-based plan shape with a two-phased
  aggregation is maintained.
- Existing plans are not changed.
- Aggregates are grouped into 'aggregation classes' based on their
  expressions in the distinct portion which may be empty for
  non-distinct aggregates.
- The aggregation framework is generalized to simultaneously process
  multiple aggregation classes within the tree-based plan. This
  process splits the results of different aggregation classes into
  separate rows, so a final aggregation is needed to transpose the
  results into the desired form.
- Main challenge: Each aggregation class consumes and produces
  different tuples, so conceptually a union-type of tuples flows
  through the runtime. The tuple union is represented by a TupleRow
  with one tuple per aggregation class. Only one tuple in such a
  TupleRow is non-NULL.
- Backend exec nodes in the aggregation plan will be aware of this
  tuple-union either explicitly in their implementation or by relying
  on expressions that distinguish the aggregation classes.
- To distinguish the aggregation classes, e.g. in hash exchanges,
  CASE expressions are crafted to hash/group on the appropriate slots.

Deferred FE work:
- Beautify/condense the long CASE exprs
- Push applicable conjuncts into individual aggregators before
  the transposition step
- Added a few testing TODOs to reduce the size of this patch
- Decide whether we want to change existing plans to the new model

Execution design:
- Previous patches separated out aggregation logic from the exec node
  into Aggregators. This is extended to support multiple Aggregators
  per node, with different grouping and aggregating functions.
- There is a fast path for aggregations with only one aggregator,
  which leaves the execution essentially unchanged from before.
- When there are multiple aggregators, the first aggregation node in
  the plan replicates its input to each aggregator. The output of this
  step is rows where only a single tuple is non-null, corresponding to
  the aggregator that produced the row.
- A new expr is introduced, ValidTupleId, which takes one of these
  rows and returns which tuple is non-null.
- For additional aggregation nodes, the input is split apart into
  'mini-batches' according to which aggregator the row corresponds to.

Testing:
- Added analyzer and planner tests
- Added end-to-end queries tests
- Ran hdfs/core tests
- Added support in the query generator and ran in a loop.

Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4
Reviewed-on: http://gerrit.cloudera.org:8080/10771
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-26 03:54:49 +00:00