IMPALA-14261: Take 'impala.computeStatsSnapshotId' into account when deciding between Puffin and HMS stats

Since IMPALA-13609, Impala writes snapshot information for each column
on COMPUTE STATS for Iceberg tables (see there for why it is useful),
but this information has so far been ignored.

After this change, snapshot information is used when deciding which of
HMS and Puffin NDV stats should be used (i.e. which is more recent).

This test also modifies the
IcebergUtil.ComputeStatsSnapshotPropertyConverter class: previously
Iceberg fieldIds were stored as Long, but now they are stored as
Integer, in accordance with the Iceberg spec.

Documentation:
 - updated the docs about Puffin stats in docs/topics/impala_iceberg.xml
Testing:
 - modified existing tests to fit the new decision mechanism

Change-Id: I95a5b152dd504e94dea368a107d412e33f67930c
Reviewed-on: http://gerrit.cloudera.org:8080/23251
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Daniel Becker <daniel.becker@cloudera.com>
This commit is contained in:
Daniel Becker
2025-07-22 16:08:17 +02:00
committed by Daniel Becker
parent a68f716458
commit 19c12e0e06
7 changed files with 70 additions and 63 deletions

View File

@@ -896,10 +896,11 @@ ORDER BY made_current_at;
come from different snapshots.
</p>
<p>
In case there are both HMS and Puffin stats for a column, the more recent one will
be used - for HMS stats we use the 'impala.lastComputeStatsTime' table property, and
for Puffin stats we use the snapshot timestamp to determine which one is more
recent.
In case there are both HMS and Puffin NDV stats for a column, the more recent one
will be used. For HMS stats we use the 'impala.computeStatsSnapshotId' table
property which stores, for each column, the snapshot for which HMS stats were
calculated. We compare this with the snapshot of the Puffin stats to decide which
is more recent.
</p>
<p>
Reading Puffin stats is disabled by default; set the "--enable_reading_puffin_stats"