mirror of
https://github.com/apache/impala.git
synced 2026-02-03 00:00:40 -05:00
Planner often overestimates aggregation node memory estimate since it uses simple multiplication of NDVs of contributing grouping columns. This patch introduces new query options LARGE_AGG_MEM_THRESHOLD and AGG_MEM_CORRELATION_FACTOR. If the estimated perInstanceDataBytes from the NDV multiplication method exceed LARGE_AGG_MEM_THRESHOLD, recompute perInstanceDataBytes again by comparing against the max(NDV) & AGG_MEM_CORRELATION_FACTOR method. perInstanceDataBytes is kept at LARGE_AGG_MEM_THRESHOLD at a minimum so that low max(NDV) will not negatively impact query execution. Unlike PREAGG_BYTES_LIMIT, LARGE_AGG_MEM_THRESHOLD is evaluated on both preaggregation and final aggregation, and does not cap max memory reservation of the aggregation node (it may still increase memory allocation beyond the estimate if it is available). However, if a plan node is a streaming preaggregation node and PREAGG_BYTES_LIMIT is set, then PREAGG_BYTES_LIMIT will override the value of LARGE_AGG_MEM_THRESHOLD as a threshold. Testing: - Run the patch with 10 nodes, MT_DOP=12, against TPC-DS 3TB scale. Among 103 queries, 20 queries have lower "Per-Host Resource Estimates", 11 have lower "Cluster Memory Admitted", and 3 have over 10% reduced latency. No significant regression in query latency was observed. - Pass core tests. Change-Id: Ia4b4b2e519ee89f0a13fdb62d0471ee4047f6421 Reviewed-on: http://gerrit.cloudera.org:8080/20104 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>