6 Commits

Author SHA1 Message Date
Michael Smith
5137bb94ac IMPALA-14446: Clean up pom.xml
Cleans up repetitive patterns in pom.xml.

Centralize plugin configuration in pluginManagement. Replace inline
maven-compiler-plugin configuration with newer maven.compiler.release
and update to latest plugin version.

Centralize common dependencies in dependencyManagement, including
exclusions when appropriate. Remove exclusions that are no longer
relevant.

Compared before and after with dependency:tree; only difference is that
commons-cli now comes from hadoop and jersey-serv{let,er} are
effectively excluded; all versions matched. Also ensured
USE_APACHE_COMPONENTS=true compiles.

Adds com.amazonaws:aws-java-sdk-bundle to exclusion checking to ensure
it's not accidentally included alongside impala-minimal-s3a-aws-sdk.

Removes missed io.netty exclusion from IMPALA-12816.

Updates commons-dbcp2 to 2.12.0 to match Hive.

Change-Id: If96649840e23036b4a73ee23e8d12516497994f0
Reviewed-on: http://gerrit.cloudera.org:8080/23432
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-23 02:50:22 +00:00
Riza Suminto
35aa2e2add IMPALA-14187: Add IMPALA_JAVA_TARGET env var
Impala is preparing to switch to JDK17 for Java compilation by default.
While the source version might remain in 1.8 for longer, we should
experiment with targeting binary version 17.

This patch adds IMPALA_JAVA_TARGET env var to control target binary
version. It is initialized in impala-config-java.sh, depending on value
of IMPALA_JDK_VERSION env var.

Testing:
Pass data load and FE tests with IMPALA_JDK_VERSION=17.

Change-Id: If194d87c542d416b878661403c32c6adc2930199
Reviewed-on: http://gerrit.cloudera.org:8080/23096
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-27 00:41:57 +00:00
Peter Rozsa
1f70269392 IMPALA-13838: Update Impala version to 5.0.0-SNAPSHOT
Change-Id: I9c5a2d817b30e14333feeb5b2de3e0c40795723f
Reviewed-on: http://gerrit.cloudera.org:8080/22596
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-08 14:13:48 +00:00
Daniel Becker
c5b474d3f5 IMPALA-13594: Read Puffin stats also from older snapshots
Before this change, Puffin stats were only read from the current
snapshot. Now we also consider older snapshots, and for each column we
choose the most recent available stats. Note that this means that the
stats for different columns may come from different snapshots.

In case there are both HMS and Puffin stats for a column, the more
recent one will be used - for HMS stats we use the
'impala.lastComputeStatsTime' table property, and for Puffin stats we
use the snapshot timestamp to determine which is more recent.

This commit also renames the startup flag 'disable_reading_puffin_stats'
to 'enable_reading_puffin_stats' and the table property
'impala.iceberg_disable_reading_puffin_stats' to
'impala.iceberg_read_puffin_stats' to make them more intuitive. The
default values are flipped to keep the same behaviour as before.

The documentation of Puffin reading is updated in
docs/topics/impala_iceberg.xml

Testing:
 - updated existing test cases and added new ones in
   test_iceberg_with_puffin.py
 - reorganised the tests in TestIcebergTableWithPuffinStats in
   test_iceberg_with_puffin.py: tests that modify table properties and
   other state that other tests rely on are now run separately to
   provide a clean environment for all tests.

Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39
Reviewed-on: http://gerrit.cloudera.org:8080/22177
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-23 15:25:59 +00:00
Daniel Becker
e5919f13f9 IMPALA-13370: Read Puffin stats from metadata.json property if available
When Trino writes Puffin stats for a column, it includes the NDV as a
property (with key "ndv") in the "statistics" section of the
metadata.json file, in addition to the Theta sketch in the Puffin file.
When we are only reading the stats and not writing/updating them, it is
enough to read this property if it is present.

After this change, Impala only opens and reads a Puffin stats file if it
contains stats for at least one column for which the "ndv" property is
not set in the metadata.json file.

Testing:
 - added a test in test_iceberg_with_puffin.py that verifies that the
   Puffin stats file is not read if the the metadata.json file contains
   the NDV property. It uses the newly added stats file with corrupt
   datasketches: 'metadata_ndv_ok_sketches_corrupt.stats'.

Change-Id: I5e92056ce97c4849742db6309562af3b575f647b
Reviewed-on: http://gerrit.cloudera.org:8080/21959
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-23 16:04:06 +00:00
Daniel Becker
b05b408f17 IMPALA-13247: Support Reading Puffin files for the current snapshot
This change adds support for reading NDV statistics from Puffin files
when they are available for the current snapshot. Puffin files or blobs
that were written for other snapshots than the current one are ignored.
Because this behaviour is different from what we have for HMS stats and
may therefore be unintuitive for users, reading Puffin stats is disabled
by default; set the "--disable_reading_puffin_stats" startup flag to
false to enable it.

When Puffin stats reading is enabled, the NDV values read from Puffin
files take precedence over NDV values stored in the HMS. This is because
we only read Puffin stats for the current snapshot, so these values are
always up-to-date, while the values in the HMS may be stale.

Note that it is currently not possible to drop Puffin stats from Impala.
For this reason, this patch also introduces two ways of disabling the
reading of Puffin stats:
  - globally, with the aforementioned "--disable_reading_puffin_stats"
    startup flag: when it is set to true, Impala will never read Puffin
    stats
  - for specific tables, by setting the
    "impala.iceberg_disable_reading_puffin_stats" table property to
    true.

Note that this change is only about reading Puffin files, Impala does
not yet support writing them.

Testing:
 - created the PuffinDataGenerator tool which can generate Puffin files
   and metadata.json files for different scenarios (e.g. all stats are
   in the same Puffin file; stats for different columns are in different
   Puffin files; some Puffin files are corrupt etc.). The generated
   files are under the "testdata/ice_puffin/generated" directory.
 - The new custom cluster test class
   'test_iceberg_with_puffin.py::TestIcebergTableWithPuffinStats' uses
   the generated data to test various scenarios.
 - Added custom cluster tests that test the
   'disable_reading_puffin_stats' startup flag.

Change-Id: I50c1228988960a686d08a9b2942e01e366678866
Reviewed-on: http://gerrit.cloudera.org:8080/21605
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-19 22:14:59 +00:00