mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
Currently, when COMPUTE STATS is run from Impala, we set the 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the other hand, store the snapshot id for which the stats were calculated. Although it is possible to retrieve the timestamp of a snapshot, comparing these two values is error-prone, e.g. in the following situation: - COMPUTE STATS calculation is running on snapshot N - snapshot N+1 is committed at time T - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + Delta - some engine writes Puffin statistics for snapshot N+1 After this, HMS stats will appear to be more recent even though they were calculated on snapshot N, while we have Puffin stats for snapshot N+1. To make comparisons easier, after this change, COMPUTE STATS sets a new table property, 'impala.computeStatsSnapshotIds'. This property stores the snapshot id for which stats have been computed, for each column. It is a comma-separated list of values of the form "fieldIdRangeStart[-fieldIdRangeEndIncl]:snapshotId". The fieldId part may be a single value or a contiguous, inclusive range. Storing the snapshot ids on a per-column basis is needed because COMPUTE STATS can be set to calculate stats for only a subset of the columns, and then a different subset in a subsequent run. The recency of the stats will then be different for each column. Storing the Iceberg field ids instead of column names makes the format easier to handle as we do not need to take care of escaping special characters. The 'impala.computeStatsSnapshotIds' table property is deleted after DROP STATS. Note that this change does not yet modify how Impala chooses between Puffin and HMS stats: that will be done in a separate change. Testing: - Added tests in iceberg-compute-stats.test checking that 'impala.computeStatsSnapshotIds' is set correctly and is deleted after DROP STATS - added unit tests in IcebergUtilTest.java that check the parsing and serialisation of the table property Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7 Reviewed-on: http://gerrit.cloudera.org:8080/22339 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>