Files
Zoltan Borok-Nagy 85d77b908b IMPALA-13756: Fix Iceberg V2 count(*) optimization for complex queries
We optimize plain count(*) queries on Iceberg tables the following way:

       AGGREGATE
       COUNT(*)
           |
       UNION ALL
      /        \
     /          \
    /            \
   SCAN all  ANTI JOIN
   datafiles  /      \
   without   /        \
   deletes  SCAN      SCAN
            datafiles deletes

            ||
          rewrite
            ||
            \/

  ArithmethicExpr: LHS + RHS
      /             \
     /               \
    /                 \
   record_count  AGGREGATE
   of all        COUNT(*)
   datafiles         |
   without       ANTI JOIN
   deletes      /         \
               /           \
               SCAN        SCAN
               datafiles   deletes

This optimization consists of two parts:
 1 Rewriting count(*) expression to count(*) + "record_count" (of data
   files without deletes)
 2 In IcebergScanPlanner we only need to consruct the right side of
   the original UNION ALL operator, i.e.:

            ANTI JOIN
           /         \
          /           \
         SCAN        SCAN
         datafiles   deletes

SelectStmt decides whether we can do the count(*) optimization, and if
so, does the following:

 1: SelectStmt sets 'TotalRecordsNumV2' in the analyzer, then during the
    expression rewrite phase the CountStarToConstRule rewrites the
    count(*) to count(*) + record_count
 2: SelectStmt sets "OptimizeCountStarForIcebergV2" in the query context
    then IcebergScanPlanner creates plan accordingly.

This mechanism works for simple queries, but can turn on count(*)
optimization in IcebergScanPlanner for all Iceberg V2 tables in complex
queries. Even if only one subquery enables count(*) optimization during
analysis.

With this patch the followings change:
 1: We introduce IcebergV2CountStarAccumulator which we use instead of
    the ArithmethicExpr. So after rewrite we still know if count(*)
    optimization should be enabled for the planner.
 2: Instead of using the query context, we pass the information to the
    IcebergScanPlanner via the TableRef object.

Testing
 * e2e tests

Change-Id: I1940031298eb634aa82c3d32bbbf16bce8eaf874
Reviewed-on: http://gerrit.cloudera.org:8080/23705
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2025-12-19 17:53:50 +00:00
..