impala

jprdonnelly/impala

Fork 0

mirror of https://github.com/apache/impala.git synced 2026-02-01 21:00:29 -05:00

Commit Graph

Author	SHA1	Message	Date
Zoltan Borok-Nagy	ab975c9517	IMPALA-8969: Grouping aggregator can cause segmentation fault when doing multiple aggregations Grouping aggregator always tried to serialize the 0th tuple regardless of the aggregation index. This could lead to a segmentation fault because the 0th tuple might be null. Testing: Added a query that triggers the error to multiple-distinct-aggs.test Change-Id: I7acdd40c63166cd4986e546a992c0816f94823d5 Reviewed-on: http://gerrit.cloudera.org:8080/14290 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-25 16:58:14 +00:00
Thomas Tauber-Marshall	15e8ce4f27	IMPALA-7677: Fix DCHECK failure in GroupingAggregator After inserting all of its input into its Aggregators, StreamingAggregationNode performs some cleanup, such as calling InputDone() on each Aggregator. Previously, StreamingAggregationNode only checked that all of the child's batches had been fetched before doing this cleanup, which causes problems if the final child batch isn't processed fully in a single GetNext() call. In this case, multiple calls to InputDone() lead to a DCHECK failure. The solution is to only perform the cleanup once the final child batch has been fully processed. Testing: - Added an e2e test with a query that hits this condition. Change-Id: I851007a60472d0e53081c076c863c866c516677c Reviewed-on: http://gerrit.cloudera.org:8080/11626 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-24 01:03:38 +00:00
Thomas Tauber-Marshall	df53ec2385	IMPALA-110: Support for multiple DISTINCT This patch adds support for having multiple aggregate functions in a single SELECT block that use DISTINCT over different sets of columns. Planner design: - The existing tree-based plan shape with a two-phased aggregation is maintained. - Existing plans are not changed. - Aggregates are grouped into 'aggregation classes' based on their expressions in the distinct portion which may be empty for non-distinct aggregates. - The aggregation framework is generalized to simultaneously process multiple aggregation classes within the tree-based plan. This process splits the results of different aggregation classes into separate rows, so a final aggregation is needed to transpose the results into the desired form. - Main challenge: Each aggregation class consumes and produces different tuples, so conceptually a union-type of tuples flows through the runtime. The tuple union is represented by a TupleRow with one tuple per aggregation class. Only one tuple in such a TupleRow is non-NULL. - Backend exec nodes in the aggregation plan will be aware of this tuple-union either explicitly in their implementation or by relying on expressions that distinguish the aggregation classes. - To distinguish the aggregation classes, e.g. in hash exchanges, CASE expressions are crafted to hash/group on the appropriate slots. Deferred FE work: - Beautify/condense the long CASE exprs - Push applicable conjuncts into individual aggregators before the transposition step - Added a few testing TODOs to reduce the size of this patch - Decide whether we want to change existing plans to the new model Execution design: - Previous patches separated out aggregation logic from the exec node into Aggregators. This is extended to support multiple Aggregators per node, with different grouping and aggregating functions. - There is a fast path for aggregations with only one aggregator, which leaves the execution essentially unchanged from before. - When there are multiple aggregators, the first aggregation node in the plan replicates its input to each aggregator. The output of this step is rows where only a single tuple is non-null, corresponding to the aggregator that produced the row. - A new expr is introduced, ValidTupleId, which takes one of these rows and returns which tuple is non-null. - For additional aggregation nodes, the input is split apart into 'mini-batches' according to which aggregator the row corresponds to. Testing: - Added analyzer and planner tests - Added end-to-end queries tests - Ran hdfs/core tests - Added support in the query generator and ran in a loop. Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4 Reviewed-on: http://gerrit.cloudera.org:8080/10771 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-26 03:54:49 +00:00

Author

SHA1

Message

Date

Zoltan Borok-Nagy

ab975c9517

IMPALA-8969: Grouping aggregator can cause segmentation fault when doing multiple aggregations

Grouping aggregator always tried to serialize the 0th tuple regardless
of the aggregation index. This could lead to a segmentation fault
because the 0th tuple might be null.

Testing:
Added a query that triggers the error to multiple-distinct-aggs.test

Change-Id: I7acdd40c63166cd4986e546a992c0816f94823d5
Reviewed-on: http://gerrit.cloudera.org:8080/14290
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2019-09-25 16:58:14 +00:00

Thomas Tauber-Marshall

15e8ce4f27

IMPALA-7677: Fix DCHECK failure in GroupingAggregator

After inserting all of its input into its Aggregators,
StreamingAggregationNode performs some cleanup, such as calling
InputDone() on each Aggregator.

Previously, StreamingAggregationNode only checked that all of the
child's batches had been fetched before doing this cleanup, which
causes problems if the final child batch isn't processed fully in a
single GetNext() call. In this case, multiple calls to InputDone()
lead to a DCHECK failure.

The solution is to only perform the cleanup once the final child batch
has been fully processed.

Testing:
- Added an e2e test with a query that hits this condition.

Change-Id: I851007a60472d0e53081c076c863c866c516677c
Reviewed-on: http://gerrit.cloudera.org:8080/11626
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2018-10-24 01:03:38 +00:00

Thomas Tauber-Marshall

df53ec2385

IMPALA-110: Support for multiple DISTINCT

This patch adds support for having multiple aggregate functions in a
single SELECT block that use DISTINCT over different sets of columns.

Planner design:
- The existing tree-based plan shape with a two-phased
  aggregation is maintained.
- Existing plans are not changed.
- Aggregates are grouped into 'aggregation classes' based on their
  expressions in the distinct portion which may be empty for
  non-distinct aggregates.
- The aggregation framework is generalized to simultaneously process
  multiple aggregation classes within the tree-based plan. This
  process splits the results of different aggregation classes into
  separate rows, so a final aggregation is needed to transpose the
  results into the desired form.
- Main challenge: Each aggregation class consumes and produces
  different tuples, so conceptually a union-type of tuples flows
  through the runtime. The tuple union is represented by a TupleRow
  with one tuple per aggregation class. Only one tuple in such a
  TupleRow is non-NULL.
- Backend exec nodes in the aggregation plan will be aware of this
  tuple-union either explicitly in their implementation or by relying
  on expressions that distinguish the aggregation classes.
- To distinguish the aggregation classes, e.g. in hash exchanges,
  CASE expressions are crafted to hash/group on the appropriate slots.

Deferred FE work:
- Beautify/condense the long CASE exprs
- Push applicable conjuncts into individual aggregators before
  the transposition step
- Added a few testing TODOs to reduce the size of this patch
- Decide whether we want to change existing plans to the new model

Execution design:
- Previous patches separated out aggregation logic from the exec node
  into Aggregators. This is extended to support multiple Aggregators
  per node, with different grouping and aggregating functions.
- There is a fast path for aggregations with only one aggregator,
  which leaves the execution essentially unchanged from before.
- When there are multiple aggregators, the first aggregation node in
  the plan replicates its input to each aggregator. The output of this
  step is rows where only a single tuple is non-null, corresponding to
  the aggregator that produced the row.
- A new expr is introduced, ValidTupleId, which takes one of these
  rows and returns which tuple is non-null.
- For additional aggregation nodes, the input is split apart into
  'mini-batches' according to which aggregator the row corresponds to.

Testing:
- Added analyzer and planner tests
- Added end-to-end queries tests
- Ran hdfs/core tests
- Added support in the query generator and ran in a loop.

Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4
Reviewed-on: http://gerrit.cloudera.org:8080/10771
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2018-09-26 03:54:49 +00:00

3 Commits