3209 Commits

Author SHA1 Message Date
Zoltan Borok-Nagy
6649b92cb2 IMPALA-14635: We should not check for exact file sizes in iceberg-metadata-tables.test
The Impala version string is written into the Parquet footer. This
means in our tests we shouldn't check for exact file sizes of tables
written during data loading/testing.

Change-Id: I589ade5f81879ede54ff41466b77b5db3349a14f
Reviewed-on: http://gerrit.cloudera.org:8080/23802
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-18 15:58:41 +00:00
Michael Smith
c3dc7f9667 IMPALA-13147: Limit concurrency of link jobs
Configure separate compile and link pools for ninja. Configures link
parallelism based on expected memory use, which can be reduced by
setting IMPALA_MINIMAL_DEBUG_INFO=true or IMPALA_SPLIT_DEBUG_INFO=true.

Adds IMPALA_MAKE_CMD to simplify using the ninja build tool for all make
operations in scripts. Install ninja on Ubuntu. Adds a '-make' option to
buildall.sh to force using 'make'.

Adds MOLD_JOBS=1 to avoid overloading the system when trying 'mold' and
linking test binaries. However 'mold' is not selected as the default
due to test failures around SASL/GSSAPI (see IMPALA-14527).

Switches bin/jenkins/all-tests.sh to use ninja and removes the guard in
bootstrap_development.sh limiting IMPALA_BUILD_THREADS as it's no longer
needed with ninja.

SKIP_BE_TEST_PATTERN in run-backend-tests is unused (only used with
TARGET_FILESYSTEM=local) so I don't attempt to make it work with ninja.

Tested with local 'IMPALA_SPLIT_DEBUG_INFO=true buildall.sh -skiptests'
with default (make) and IMPALA_MAKE_CMD=ninja.

Change-Id: I0952dc19ace5c9c42bed0d2ffb61499656c0a2db
Reviewed-on: http://gerrit.cloudera.org:8080/23572
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Pranav Lodha <pranav.lodha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-15 21:43:07 +00:00
Daniel Vanko
9d112dae23 IMPALA-14536: Fix CONVERT TO ICEBERG to not throw exception on Iceberg tables
Previously, running ALTER TABLE <table> CONVERT TO ICEBERG on an Iceberg
table produced an error. This patch fixes that, so the statement will do
nothing when called on an Iceberg table and return with 'Table has
already been migrated.' message.

This is achieved by adding a new flag to StatementBase to signal when a
statement ends up NO_OP, if that's true, the new TStmtType::NO_OP will
be set as TExecRequest's type and noop_result can be used to set result
from Frontend-side.

Tests:
 * extended fe and e2e tests

Change-Id: I41ecbfd350d38e4e3fd7b813a4fc27211d828f73
Reviewed-on: http://gerrit.cloudera.org:8080/23699
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
2025-12-12 15:35:28 +00:00
Xuebin Su
d54b75ccf1 IMPALA-14619: Reset levels_readahead_ for late materialization
Previously, `BaseScalarColumnReader::levels_readahead_` was not reset
when the reader did not do page filtering. If a query selected the last
row containing a collection value in a row group, `levels_readahead_`
would be set and would not be reset when advancing to the next row
group without page filtering. As a result, trying to skip collection
values at the start of the next row group would cause a check failure.

This patch fixes the failure by resetting `levels_readahead_` in
`BaseScalarColumnReader::Reset()`, which is always called when advancing
to the next row group.

`levels_readahead_` is also moved out of the "Members used for page
filtering" section as the variable is also used in late materialization.

Testing:
- Added an E2E test for the fix.

Change-Id: Idac138ffe4e1a9260f9080a97a1090b467781d00
Reviewed-on: http://gerrit.cloudera.org:8080/23779
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-12 15:12:50 +00:00
Nandor Kollar
65639f16b9 IMPALA-12330: Allow setting format-version in ALTER TABLE CONVERT TO
This change allows modifying the format version table property
in ALTER TABLE CONVERT TO statements. It adds verification for
the property value too: only 1 or 2 is supported as of now.

Change-Id: Iaed207feb83a277a1c2f81dcf58c42f0721c0865
Reviewed-on: http://gerrit.cloudera.org:8080/23721
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Peter Rozsa <prozsa@cloudera.com>
2025-12-12 08:18:08 +00:00
Arnab Karmakar
ddd82e02b9 IMPALA-14065: Support WHERE clause in SHOW PARTITIONS statement
This patch extends the SHOW PARTITIONS statement to allow an optional
WHERE clause that filters partitions based on partition column values.
The implementation adds support for various comparison operators,
IN lists, BETWEEN clauses, IS NULL, and logical AND/OR expressions
involving partition columns.

Non-partition columns, subqueries, and analytic expressions in the
WHERE clause are not allowed and will result in an analysis error.

New analyzer tests have been added to AnalyzeDDLTest#TestShowPartitions
to verify correct parsing, semantic validation, and error handling for
supported and unsupported cases.

Testing:
- Added new unit tests in AnalyzeDDLTest for valid and invalid WHERE
clause cases.
- Verified functional tests covering partition filtering behavior.

Change-Id: I2e2a14aabcea3fb17083d4ad6f87b7861113f89e
Reviewed-on: http://gerrit.cloudera.org:8080/23566
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-11 15:36:08 +00:00
Csaba Ringhofer
780e6683a2 IMPALA-14573: port critical geospatial functions to c++ (part 1)
This commit contains the simpler parts from
https://gerrit.cloudera.org/#/c/20602

This mainly means accessors for the header of the binary
format and bounding box check (st_envIntersects).
New tests for not yet covered functions / overloads are also added.

For details of the binary format see be/src/exprs/geo/shape-format.h

Differences from the PR above:

Only a subset of functions are added. The criteria was:
1. the native function must be fully compatible with the Java version*
2. must not rely on (de)serializing the full geometry
3. the function must be tested

1 implies 2 because (de)serialization is not implemented yet in
the original patch for >2d geometries, which would break compatibility
for the Java version for ZYZ/XYM/XYZM geometries.

*: there are 2 known differences:
 1. NULL handling: the Java functions return error instead of NULL
    when getting a NULL parameter
 2. st_envIntersects() doesn't check if the SRID matches - the Java
    library looks inconsistant about this

Because the native functions are fairly safe replacements for the Java
ones, they are always used when geospatial_library=HIVE_ESRI.

Change-Id: I0ff950a25320549290a83a3b1c31ce828dd68e3c
Reviewed-on: http://gerrit.cloudera.org:8080/23700
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-06 07:50:23 +00:00
jichen0919
7e29ac23da IMPALA-14092 Part2: Support querying of paimon data table via JNI
This patch mainly implement the querying of paimon data table
through JNI based scanner.

Features implemented:
- support column pruning.
The partition pruning and predicate push down will be submitted
as the third part of the patch.

We implemented this by treating the paimon table as normal
unpartitioned table. When querying paimon table:
- PaimonScanNode will decide paimon splits need to be scanned,
  and then transfer splits to BE do the jni-based scan operation.

- We also collect the required columns that need to be scanned,
  and pass the columns to Scanner for column pruning. This is
  implemented by passing the field ids of the columns to BE,
  instead of column position to support schema evolution.

- In the original implementation, PaimonJniScanner will directly
  pass paimon row object to BE, and call corresponding paimon row
  field accessor, which is a java method to convert row fields to
  impala row batch tuples. We find it is slow due to overhead of
  JVM method calling.
  To minimize the overhead, we refashioned the implementation,
  the PaimonJniScanner will convert the paimon row batches to
  arrow recordbatch, which stores data in offheap region of
  impala JVM. And PaimonJniScanner will pass the arrow offheap
  record batch memory pointer to the BE backend.
  BE PaimonJniScanNode will directly read data from JVM offheap
  region, and convert the arrow record batch to impala row batch.

  The benchmark shows the later implementation is 2.x better
  than the original implementation.

  The lifecycle of arrow row batch is mainly like this:
  the arrow row batch is generated in FE,and passed to BE.
  After the record batch is imported to BE successfully,
  BE will be in charge of freeing the row batch.
  There are two free paths: the normal path, and the
  exception path. For the normal path, when the arrow batch
  is totally consumed by BE, BE will call jni to fetch the next arrow
  batch. For this case, the arrow batch is freed automatically.
  For the exceptional path, it happends when query  is cancelled, or memory
  failed to allocate. For these corner cases, arrow batch is freed in the
  method close if it is not totally consumed by BE.

Current supported impala data types for query includes:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE

TODO:
    - Patches pending submission:
        - Support tpcds/tpch data-loading
          for paimon data table.
        - Virtual Column query support for querying
          paimon data table.
        - Query support with time travel.
        - Query support for paimon meta tables.
    - WIP:
        - Snapshot incremental read.
        - Complex type query support.
        - Native paimon table scanner, instead of
          jni based.

Testing:
    - Create tests table in functional_schema_template.sql
    - Add TestPaimonScannerWithLimit in test_scanners.py
    - Add test_paimon_query in test_paimon.py.
    - Already passed the tpcds/tpch test for paimon table, due to the
      testing table data is currently generated by spark, and it is
      not supported by impala now, we have to do this since hive
      doesn't support generating paimon table for dynamic-partitioned
      tables. we plan to submit a separate patch for tpcds/tpch data
      loading and associated tpcds/tpch query tests.
    - JVM Offheap memory leak tests, have run looped tpch tests for
      1 day, no obvious offheap memory increase is observed,
      offheap memory usage is within 10M.

Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384
Reviewed-on: http://gerrit.cloudera.org:8080/23613
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-12-05 18:19:57 +00:00
ttttttz
5d1f1e0180 IMPALA-14183: Rename the environment variable USE_APACHE_HIVE to USE_APACHE_HIVE_3
When the environment variable USE_APACHE_HIVE is set to true, build
Impala for adapting to Apache Hive 3.x. In order to better distinguish it
from Apache Hive 2.x later, rename USE_APACHE_HIVE to USE_APACHE_HIVE_3.
Additionally, to facilitate referencing different versions of the Hive
MetastoreShim, the major version of Hive has been added to the environment
variable IMPALA_HIVE_DIST_TYPE.

Change-Id: I11b5fe1604b6fc34469fb357c98784b7ad88574d
Reviewed-on: http://gerrit.cloudera.org:8080/21724
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-03 13:38:45 +00:00
Peter Rozsa
d67ab6f11f IMPALA-14569: (addendum) Fix 'partitions' row matching
IMPALA-14569 introduced a test that asserts for a profile row like
'HDFS partitions' and it's possible for test environments to run on a
different storage system. This change omits the storage type from the
row_regex.

Change-Id: If9b223f2be2dfe7be8724423fefdfb56ffeeba6e
Reviewed-on: http://gerrit.cloudera.org:8080/23727
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-12-01 23:06:47 +00:00
Peter Rozsa
6cf21464b4 IMPALA-14569: Fix IllegalStateException in partition pruning on type mismatch
This fixes an IllegalStateException in HdfsPartitionPruner when
evaluating 'IN' predicates whose consist of two compatible types, for
example DATE and STRING: date_col in (<date as string>).

Previously, 'canEvalUsingPartitionMd' did not check if the slot type
matched the literal type. This caused the frontend to attempt invalid
comparisons via 'LiteralExpr.compareTo', leading to
IllegalStateException or incorrect pruning.

The fix ensures 'canEvalUsingPartitionMd' returns false on type
mismatches, deferring evaluation to the backend where proper casting
occurs.

Testing:
- Added regression test in hdfs-partition-pruning.test.

Change-Id: Idc226a628c8df559329a060cb963b81e27e21eda
Reviewed-on: http://gerrit.cloudera.org:8080/23706
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-27 02:48:28 +00:00
jasonmfehr
2ac5a24dc0 IMPALA-14455: Cleanup OpenTelemetry Tracing Startup Flags
Fixes several issues with the OpenTelemetry tracing startup flags:

1. otel_trace_beeswax -- Removes this hidden flag which enabled
   tracing of queries submitted over Beeswax. Since this protocol is
   deprecated and no tests assert the traces generated by Beeswax
   queries, this flag was removed to eliminate an extra check when
   determining if OpenTelemetry tracing should be enabled.

2. otel_trace_tls_minimum_version -- Fixes parsing of this flag's
   value. This flag is in the format "tlsv1.2" or "tlsv1.3", but the
   OpenTelemetry C++ SDK expects the minimum TLS version to be in the
   format "1.2" or "1.3". The code now removes the "tlsv" prefix before
   passing the value to the OpenTelemetry C++ SDK.

3. otel_trace_tls_insecure_skip_verify -- Fixes the guidance to only
   set this flag to true in dev/testing.

Adds ctest tests for the functions that configure the TraceProvider
singleton to ensure startup flags are correctly parsed and applied.

Modifies the http_exporter_config and init_otel_tracer function
signatures in otel.cc to return the actual object they create instead
of a Status since these functions only ever returned OK.

Updates the OpenTelemetry collector docker-compose file to support
the collector receiving traces over both HTTP and HTTPS. This setup
is used to manually smoke test the integration from Impala to an
OpenTelemetry collector.

Change-Id: Ie321fa37c0fd260f783dc6cf47924d53a06d82ea
Reviewed-on: http://gerrit.cloudera.org:8080/23440
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-11-24 23:46:57 +00:00
Daniel Vanko
3d22c7fe05 IMPALA-12209: Always include format-version in DESCRIBE FORMATTED and SHOW CREATE TABLE for Iceberg tables
HiveCatalog does not include format-version for Iceberg tables in the
table's parameters, therefore the output of SHOW CREATE TABLE may not
replicate the original table.
This patch makes sure to add it to both the SHOW CREATE TABLE and
DESCRIBE FORMATTED/EXTENDED output.

Additionally, adds ICEBERG_DEFAULT_FORMAT_VERSION variable to E2E
tests, deducting from IMPALA_ICEBERG_VERSION environment variable.

If Iceberg version is at least 1.4, default format-version is 2, before
1.4 it's 1. This way tests can work with multiple Iceberg versions.

Testing:
 * updated show-create-table.test and show-create-table-with-stats.test
   for Iceberg tables
 * added format-version checks to multiple DESCRIBE FORMATTED tests

Change-Id: I991edf408b24fa73e8a8abe64ac24929aeb8e2f8
Reviewed-on: http://gerrit.cloudera.org:8080/23514
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-24 21:48:17 +00:00
Csaba Ringhofer
f6ceca2b4d IMPALA-14571: increase planner cost of java functions
The main motivation is to evaluate expensive geospatial
functions (which are Java functions) last in predicates.
Java functions have a major overhead anyway from the JNI
call, so bumping all Java function costs seems beneficial.

Note that currently geospatial functions are the only
built-in Java functions.

Change-Id: I11d1652d76092ec60af18a33502dacc25b284fcc
Reviewed-on: http://gerrit.cloudera.org:8080/22733
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-24 16:52:59 +00:00
Csaba Ringhofer
f12bb87d42 IMPALA-14081: (addendum) add ';' to CREATE part in dataload
The missing ';' can cause problems for the next created
table.

Change-Id: I719872de23941bf81289340ce246d25ee113223a
Reviewed-on: http://gerrit.cloudera.org:8080/23704
Reviewed-by: Daniel Vanko <dvanko@cloudera.com>
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-11-21 12:29:48 +00:00
Steve Carlin
54c0074b33 IMPALA-14405 ADDENDUM: Catch exception for bad column names
This commit is a fix on top of IMPALA-14405 for the Calcite
planner. The original commit matches column names from the
expression in the select clause.

For instance, if the query is "select 1 + 1", the label in
impala-shell will be "1 + 1". It accomplished this by
retrieving the string from the SqlNode object through the
MySql dialect.

However, when the expression doesn't succeed in the MySql
dialect, an AssertionError gets thrown, causing the query to
fail. We don't want the query to fail, we just want to go
back to using the Calcite expression, e.g. EXPR$0. This
occurred with this specific query:

"select timestamp_col + interval 3 nanoseconds"

So now the exception is caught and the default label name
is used. Eventually we should try to match what Impala has,
but this is a harder problem to fix.

Change-Id: I6c4d76a25fb2486eb1ef19485bce7888d45d282f
Reviewed-on: http://gerrit.cloudera.org:8080/23665
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-11-18 21:34:29 +00:00
Arnab Karmakar
a2a11dec62 IMPALA-13263: Add single-argument overload for ST_ConvexHull()
Implemented a single-argument version of ST_ConvexHull() to align with
PostGIS behavior and simplify usage across geometry types.

Testing:
Added new tests in test_geospatial_functions.py for ST_ConvexHull(),
which previously had no test coverage, to verify correctness across
supported geometry types.

Change-Id: Idb17d98f5e75929ec0143aa16195a84dd6e50796
Reviewed-on: http://gerrit.cloudera.org:8080/23604
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-11-18 10:26:04 +00:00
Steve Carlin
52334ba426 IMPALA-14421: Calcite planner: case statement returning wrong types for char, varchar
The 'case' function resolver in the original Impala planner has a quirk in it
which caused issues in the Calcite planner.

The function resolver for the original planner resolves all case statements with
the "boolean" version.  Later on, in the analysis of the CaseExpr, the proper
types are assessed and the necessary casting is added.

The Calcite planner follows a similar path. The resolver always returns boolean
as well and the coerce nodes module determines the proper return type for
the case statement.

Two other related issues are also fixed here:

Literal strings should be treated as type STRING instead of CHAR(X), but a null
should literal should not be changed from a CHAR(x) to a STRING.  This broke a
'case' test in the test framework where the columns were non-literals with type
char(x), and the return value was a "null" which should not have forced a cast
to string.

A cast from a varchar to a varchar should be ignored.

Testing:
Added a test to calcite.test.
Ensured the existing cast test in test_chars.py passed.
Ran through the Jenkins Calcite testing framework.

Change-Id: I82d657f4bfce432c458ee8198188dadf9f23f2ef
Reviewed-on: http://gerrit.cloudera.org:8080/23560
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-18 07:47:39 +00:00
Riza Suminto
f2243b76b5 IMPALA-14557: Fix flaky test_show_files_partition
TestIcebergTable.test_show_files_partition is unstable because files are
alphanumerically sorted and the order between a random UUID and
"delete-*" is not guaranteed.

This patch fix the flakiness by specifying VERIFY_IS_SUBSET and using
negative lookahead of "delete" word to detect valid Iceberg data file.

Testing:
- Loop and pass test_show_files_partition 50 times. Before, it can fail
  in less than 10 loops.

Change-Id: I6243585a5b7ab7cf7c95d5a9530ce2f2825c550e
Reviewed-on: http://gerrit.cloudera.org:8080/23680
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-17 17:13:19 +00:00
Michael Smith
166b39547e IMPALA-14553: Run schema eval concurrently
The majority of time spent in generate-schema-statements.py is in
eval_section for schema operations that shell out, often uploading files
via the hadoop CLI or generating data files. These operations should be
independent.

Runs eval_section at the beginning so we don't repeat it for each row in
test_vectors, and executes them in parallel via a ThreadPool. Defaults
to NUM_CONCURRENT_TESTS threads because the underlying operations have
some concurrency to them (such as HDFS mirroring writes).

Also collects existing tables into a set to optimize lookup.

Reduces generate-schema-statements by ~60%, from 2m30s to 1m. Confirmed
that contents of logs/data_loading/sql/functional are identical.

Change-Id: I2a78d05fd6a0005c83561978713237da2dde6af2
Reviewed-on: http://gerrit.cloudera.org:8080/23627
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-17 16:34:22 +00:00
Steve Carlin
bc99705252 IMPALA-13902: Calcite planner: Implement is_spool_query_results
The is_spool_query_results query option is now supported in Calcite. The
returnAtMostOneRow method is now implemented to support this.
PlanRootSink is refactored to extract sanitizing query options (a new
method sanitizeSpoolingOptions()) out of
PlanRootSink.computeResourceProfile(). The bulk of memory bounding
calculation is also extracted out to a new class SpoolingMemoryBound.

Added "sleep" in ImpalaOperatorTable.java since some EE tests related to
result spooling calls sleep() function. Changed ImpalaPlanRel to extends
RelNode interface.

A sanity test has been added to calcite.test, but the bulk of the
testing will be done through the Impala test framework when it is
enabled.

Testing:
- Pass FE tests PlannerTest#testResultSpooling, TpcdsCpuCostPlannerTest,
  and all java tests under calcite-planner project.
- Pass query_test/test_result_spooling.py and
  custom_cluster/test_result_spooling.py.

Co-authored-by: Riza Suminto

Change-Id: I5b9bf49e2874ee12de212b892bd898c296774c6f
Reviewed-on: http://gerrit.cloudera.org:8080/23562
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-16 02:33:02 +00:00
Riza Suminto
898e03e9d5 IMPALA-14552: (addendum) Fix bad testcase in show-create-table.test
The original IMPALA-14552 patch pass precommit tests before
IMPALA-12893: (part 2) (275f03f) merged. As consequence, it does not
catch missing comma in updated show-create-table.test. This patch add
that missing comma.

Testing:
Pass metadata/test_show_create_table.py

Change-Id: Ib06e690a81e6b0ca483b3647cc59c73802a0a7b7
Reviewed-on: http://gerrit.cloudera.org:8080/23673
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-15 21:34:44 +00:00
Mihaly Szjatinya
087b715a2b IMPALA-14108: Add support for SHOW FILES IN table PARTITION for Iceberg
tables

This patch implements partition filtering support for the SHOW FILES
statement on Iceberg tables, based on the functionality added in
IMPALA-12243. Prior to this change, the syntax resulted in a
NullPointerException.

Key changes:
- Added ShowFilesStmt.analyzeIceberg() to validate and transform
  partition expressions using IcebergPartitionExpressionRewriter and
  IcebergPartitionPredicateConverter. After that, it collects matching
  file paths using IcebergUtil.planFiles().
- Added FeIcebergTable.Utils.getIcebergTableFilesFromPaths() to
  accept pre-filtered file lists from the analysis phase.
- Enhanced TShowFilesParams thrift struct with optional selected_files
  field to pass pre-filtered file paths from frontend to backend.

Testing:
- Analyzer tests for negative cases: non-existent partitions, invalid
  expressions, non-partition columns, unsupported transforms.
- Analyzer tests for positive cases: all transform types, complex
  expressions.
- Authorization tests for non-filtered and filtered syntaxes.
- E2E tests covering every partition transform type with various
  predicates.
- Schema evolution and rollback scenarios.

The implementation follows AlterTableDropPartition's pattern where the
analysis phase performs validation/metadata retrieval and the execution
phase handles result formatting and display.

Change-Id: Ibb9913e078e6842861bdbb004ed5d67286bd3152
Reviewed-on: http://gerrit.cloudera.org:8080/23455
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-14 21:43:10 +00:00
Zoltan Borok-Nagy
275f03f10d IMPALA-12893: (part 2): Upgrade Iceberg to version 1.5.2
This patch updates CDP_BUILD_NUMBER to 71942734 to in order to
upgrade Iceberg to 1.5.2.

This patch updates some tests so they pass with Iceberg 1.5.2. The
behavior changes of Iceberg 1.5.2 are (compared to 1.3.1):
 * Iceberg V2 tables are created by default
 * Metadata tables have different schema
 * Parquet compression is explicitly set for new tables (even for ORC
   tables)
 * Sequence numbers are assigned a bit differently

Updated the tests where needed.

Code changes to accomodate for the above behavior changes:
 * SHOW CREATE TABLE adds 'format-version'='1' for Iceberg V1 tables
 * CREATE TABLE statements don't throw errors when Parquet compression
   is set for ORC tables

Change-Id: Ic4f9ed3f7ee9f686044023be938d6b1d18c8842e
Reviewed-on: http://gerrit.cloudera.org:8080/23670
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-14 01:27:45 +00:00
Joe McDonnell
5f91838ada IMPALA-14545: Don't use absolute hdfs paths for JDBC table driver.url
After IMPALA-13661 merged, S3PlannerTest.testDataSourceTables
has been failing with an error trying to fetch the JDBC driver
for functional.jdbc_decimal_tbl. This particular table's
definition uses a path like 'hdfs://localhost:20500/test-warehouse/...'
which explicitly depends on HDFS rather than relying
on the default filesystem. Changing this to use a path like
'/test-warehouse/...' without the HDFS dependency fixes the
S3PlannerTest. This changes create-ext-data-source-table.sql
to a template using WAREHOUSE_LOCATION_PREFIX and replaces that
variable before executing it. This is important for Ozone, as
Ozone uses a WAREHOUSE_LOCATION_PREFIX set to the Ozone volume.

Testing:
 - Ran S3 and regular HDFS fe tests

Change-Id: I3f2c86fcc6c1dee75d7d9a9be04468cb197ae13c
Reviewed-on: http://gerrit.cloudera.org:8080/23658
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 22:17:44 +00:00
Arnab Karmakar
760eb4f2fa IMPALA-13066: Extend SHOW CREATE TABLE to include stats and partitions
Adds a new WITH STATS option to the SHOW CREATE TABLE statement to
emit additional SQL statements for recreating table statistics and
partitions.

When specified, Impala outputs:

- Base CREATE TABLE statement.

- ALTER TABLE ... SET TBLPROPERTIES for table-level stats.

- ALTER TABLE ... SET COLUMN STATS for all non-partition columns,
restoring column stats.

- For partitioned tables:

  - ALTER TABLE ... ADD PARTITION statements to recreate partitions.

  - Per-partition ALTER TABLE ... PARTITION (...) SET TBLPROPERTIES
  to restore partition-level stats.

Partition output is limited by the PARTITION_LIMIT query option
(default 1000). Setting PARTITION_LIMIT=0 includes all partitions and
emits a warning if the limit is exceeded.

Tests added to verify correctness of emitted statements. Default
behavior of SHOW CREATE TABLE remains unchanged for compatibility.

Change-Id: I87950ae9d9bb73cb2a435cf5bcad076df1570dc2
Reviewed-on: http://gerrit.cloudera.org:8080/23536
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 06:11:37 +00:00
Xuebin Su
6b6f7e614d IMPALA-14472: Add create/read support for ARRAY column of Kudu
Initial implementation of KUDU-1261 (array column type) recently merged
in upstream Apache Kudu repository. This patch add initial Impala
support for working with Kudu tables having array type columns.

Unlike rows, the elements of a Kudu array are stored in a different
format than Impala. Instead of per-row bit flag for NULL info, values
and NULL bits are stored in separate arrays.

The following types of queries are not supported in this patch:
- (IMPALA-14538) Queries that reference an array column as a table, e.g.
  ```sql
  SELECT item FROM kudu_array.array_int;
  ```
- (IMPALA-14539) Queries that create duplicate collection slots, e.g.
  ```sql
  SELECT array_int FROM kudu_array AS t, t.array_int AS unnested;
  ```

Testing:
- Add some FE tests in AnalyzeDDLTest and AnalyzeKuduDDLTest.
- Add EE test test_kudu.py::TestKuduArray.
  Since Impala does not support inserting complex types, including
  array, the data insertion part of the test is achieved through
  custom C++ code kudu-array-inserter.cc that insert into Kudu via
  Kudu C++ client. It would be great if we could migrate it to Python so
  that it can be moved to the same file as the test (IMPALA-14537).
- Pass core tests.

Co-authored-by: Riza Suminto

Change-Id: I9282aac821bd30668189f84b2ed8fff7047e7310
Reviewed-on: http://gerrit.cloudera.org:8080/23493
Reviewed-by: Alexey Serbin <alexey@apache.org>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-08 06:41:07 +00:00
Riza Suminto
671a7fcada IMPALA-14529: (addendum) Fix kudu_create.test
Kudu throws different error message after IMPALA-14529. This patch
adjust the error message in kudu_create.test to let the test pass.

Testing:
Pass TestDdlStatements.test_create_kudu and
TestKuduHMSIntegration.test_create_managed_kudu_tables.

Change-Id: Iff4cd08f46626d03b1f0800828e5872b83f522ca
Reviewed-on: http://gerrit.cloudera.org:8080/23648
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-06 22:42:34 +00:00
Steve Carlin
62bf609942 IMPALA-14414: Calcite planner: Added new code to handle nan/inf
The current code works for NaN and Inf, but it breaks when upgrading
to v1.40.  This commit changes the code to handle these when we do
the upgrade to 1.40 and adds a basic test into the calcite.test to ensure
that when the upgrade happens, it does not break.

Change-Id: I8593a4942a2fe785a0c77134b78a9d97257225fc
Reviewed-on: http://gerrit.cloudera.org:8080/23561
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-05 12:55:39 +00:00
Riza Suminto
f34dea9b6f IMPALA-14522: Fix test_paimon_show_stats after DST ends
Test failed due to mismatch on "Last Creation Time" matching. This patch
fix the assertion with simple regex.

Testing:
Pass test_paimon.py.

Change-Id: I6855c0014111cef18318cdc4904782097a070ced
Reviewed-on: http://gerrit.cloudera.org:8080/23619
Reviewed-by: Mihaly Szjatinya <mszjat@pm.me>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-03 21:25:42 +00:00
jichen0919
541fb3f405 IMPALA-14092 Part1: Prohibit Unsupported Operation for paimon table
This patch is to prohibit un-supported operation against
paimon table. All unsupported operations are added the
checked in the analyze stage in order to avoid
mis-operation. Currently only CREATE/DROP statement
is supported, the prohibition will be removed later
after the corresponding operation is truly supported.

TODO:
    - Patches pending submission:
        - Support jni based query for paimon data table.
        - Support tpcds/tpch data-loading
          for paimon data table.
        - Virtual Column query support for querying
          paimon data table.
        - Query support with time travel.
        - Query support for paimon meta tables.

Testing:
    - Add unit test for AnalyzeDDLTest.java.
    - Add unit test for AnalyzerTest.java.
    - Add test_paimon_negative and test_paimon_query in test_paimon.py.

Change-Id: Ie39fa4836cb1be1b1a53aa62d5c02d7ec8fdc9d7
Reviewed-on: http://gerrit.cloudera.org:8080/23530
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 23:06:08 +00:00
Michael Smith
ea0ef5a799 IMPALA-14511: Fix pgrep to avoid warning
kill-all.sh tries to find a process named mini-impalad-cluster with,
which results in an (ignored) error

    pgrep: pattern that searches for process name longer than 15
    characters will result in zero matches
    Try `pgrep -f' option to match against the complete command line.

This was accidentally changed from mini-impala-cluster in 2015. Neither
term is used anymore, so this process name will never exist. Remove it
to fix the error.

Change-Id: Id1340e85cbcd3b699b333316da618774cb4e9dcd
Reviewed-on: http://gerrit.cloudera.org:8080/23586
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-23 22:00:36 +00:00
pranav.lodha
7f77176970 IMPALA-13869: Support for 'hive.sql.query' property for Hive JDBC tables
This patch adds support for the hive.sql.query table property in Hive
JDBC tables accessed through Impala. Impala has support for Hive
JDBC tables using the hive.sql.table property, which limits users
to simple table access. However, many use cases demand the ability
to expose complex joins, filters, aggregations, or derived columns
as external views. Hive.sql.query leads to a custom SQL query that
returns a virtual table(subquery) instead of pointing to a physical
table. These use cases cannot be achieved with just the hive.sql.table
property. This change allows Impala to:
 • Interact with views or complex queries defined on external
 systems without needing schema-level access to base tables.
 • Expose materialized logic (such as filters, joins, or
 transformations) via Hive to Impala consumers in a secure,
 abstracted way.
 • Better align with data virtualization use cases where
 physical data location and structure should be hidden from
 the querying engine.
This patch also lays the groundwork for future enhancements such
as predicate pushdown and performance optimizations for Hive
JDBC tables backed by queries.

Testing: End-to-end tests are included in
test_ext_data_sources.py.

Change-Id: I039fcc1e008233a3eeed8d09554195fdb8c8706b
Reviewed-on: http://gerrit.cloudera.org:8080/22865
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 21:34:29 +00:00
Michael Smith
a12e49e38d IMPALA-14509: Let Ozone set OZONE_OPTS
Remove our customization of OZONE_OPTS as it's redundant with
ozone-functions.sh. Our options also didn't work with Java 17.

Change-Id: If600dd160e6bc72320081ecee2cb0de3c73eb7bd
Reviewed-on: http://gerrit.cloudera.org:8080/23580
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 15:11:39 +00:00
Joe McDonnell
1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
Steve Carlin
c67b19daf6 IMPALA-14405: Labels for Calcite expressions not matching original planner
Calcite sets literal expressions to EXPR$<x> which did not match
expressions given by the Impala planner. For literal expressions
such as "select 1 + 1", Impala creates the column name as "1 + 1".

The field names can be found in the abstract syntax tree, so
they are not set within the CalciteRelNodeConverter before the
logical tree is created.

A small test was added to calcite.test for a basic sanity check,
but more comprehensive tests will be run in the tests/shell module
(e.g. in test_shell_commandline.py and test_shell_interactive) which
contain tests for labels.

Change-Id: Ibd3e6366a284f53807b4b2c42efafa279249c1ea
Reviewed-on: http://gerrit.cloudera.org:8080/23516
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-22 03:37:48 +00:00
Steve Carlin
420e357b95 IMPALA-13695: Calcite planner: fix for ndv with 2 args
The NDV function was crashing when called with the "scale" arg. This
requires special processing which exists in FunctionCallExpr.

The validation for this is now done in ImpalaNdvFunction
and the special calculation is done within ImpalaAggRel

This also fixes ndv for varchar types. The aggregation call
within CoerceNodes was not differentiating between varchar
and string. A cast to string function is needed in order
to run the ndv function on a varchar column.

Change-Id: I82419f77e043e9975865a042ffb8db75a26931f7
Reviewed-on: http://gerrit.cloudera.org:8080/23513
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-20 23:28:39 +00:00
Zoltan Borok-Nagy
bfae4d0b32 IMPALA-14496: Impala crashes when it writes multiple delete files per partition in a single DELETE operation
Impala crashes when it needs to write multiple delete files per
partition in a single DELETE operation. It is because
IcebergBufferedDeleteSink has its own DmlExecState object, but
sometimes the methods in TableSinkBase use the RuntimeState's
DmlExecState object. I.e. it can happen that we add a partition
to the IcebergBufferedDeleteSink's DmlExecState, but later we
expect to find it in the RuntimeState's DmlExecState.

This patch adds new methods to TableSinkBase that are specific
for writing delete files, and they always take a DmlExecState
object as a parameter. They are now used by IcebergBufferedDeleteSink.

Testing
 * added e2e tests

Change-Id: I46266007a6356e9ff3b63369dd855aff1396bb72
Reviewed-on: http://gerrit.cloudera.org:8080/23537
Reviewed-by: Mihaly Szjatinya <mszjat@pm.me>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-15 19:58:37 +00:00
Zoltan Borok-Nagy
7e34cabed7 IMPALA-14481: Use $JAVA instead of java in run-iceberg-rest-server.sh
Using the plain 'java' command in run-iceberg-rest-server.sh might
result in using a different Java version than what we used for
compilation.

$JAVA is set in bin/impala-config.sh to the desired Java version,
and we should use it in our scripts instead of just using 'java'.

Change-Id: I5f9c21de4c85d38dca7690fc110c4c44448840ed
Reviewed-on: http://gerrit.cloudera.org:8080/23539
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-14 21:35:47 +00:00
Steve Carlin
cde4bc016c IMPALA-14115: Calcite planner: Added top-n analytic PlanNode optimization.
Impala has an optimization for analytic expressions that have a rank filter on
top of the analytic expression. It can add a top-n plan node to reduce the amount
of rows examined. This is tested in tpcds query 67.

The optimization logic relies on an unassigned rank conjunct within the analyzer
while creating the analytic plan node.

A slight reorganization of the code was needed to implement this optimization.
The SlotRefs for the AnalyticInfo needed to be created a little earlier from
where it was done in the previous commit.

A small fix was made to normalize binary predicates. A non-normalized binary
predicate prevents the optimization from being used.

A call to the checkAndApplyLimitPushdown is needed for some of the optimizations
to kick in.

A new AllProjectInfo internal class was created to hold the relationships
between the Calcite RexNode objects and the Impala Analytic expressions.

Also, IMPALA-14158 is fixed by this commit. The nullsFirst value was
incorrect when the syntax was explicit in the query.

A new Calcite planner test was added in the junit tests to ensure the
optimization kicks in. The new test file is in the
PlannerTest/calcite/limit-pushdown-analytic-calcite.test file. This is a copy
of the limit-pushdown-analytic.test file in its parent directory but with some
modified results. Most of the differences are trivial, but IMPALA-14469 has been
filed to deal with one optimization that did not get fixed, which is when
the order by clause has a constant expression.

Change-Id: Ie6fa6781db56771b13b0cf49bd236f776016bf8d
Reviewed-on: http://gerrit.cloudera.org:8080/23317
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
2025-10-10 17:11:45 +00:00
pranav.lodha
a77fec6391 IMPALA-13661: Support parallelism above JDBC tables for joins/aggregates
Impala's planner generates a single-fragment, single-
threaded scan node for queries on JDBC tables because table
statistics are not properly available from the external
JDBC source. As a result, even large JDBC tables are
executed serially, causing suboptimal performance for joins,
aggregations, and scans over millions of rows.

This patch enables Impala to estimate the number of rows in a JDBC
table by issuing a COUNT(*) query at query preparation time. The
estimation is returned via TPrepareResult.setNum_rows_estimate()
and propagated into DataSourceScanNode. The scan node then uses
this cardinality to drive planner heuristics such as join order,
fragment parallelization, and scanner thread selection.

The design leverages the existing JDBC accessor layer:
- JdbcDataSource.prepare() constructs the configuration and invokes
  GenericJdbcDatabaseAccessor.getTotalNumberOfRecords().
- The accessor wraps the underlying query in:
      SELECT COUNT(*) FROM (<query>) tmptable
  ensuring correctness for both direct table scans and parameterized
  query strings.
- The result is captured as num_rows_estimate, which is then applied
  during computeStats() in DataSourceScanNode.
With accurate (or approximate) row counts, the planner can now:
- Assign multiple scanner threads to JDBC scan nodes instead of
   falling back to a single-thread plan.
- Introduce exchange nodes where beneficial, parallelizing data
   fetches across multiple JDBC connections.
- Produce better join orders by comparing JDBC row cardinalities
   against native Impala tables.
- Avoid severe underestimation that previously defaulted to wrong
   table statistics, leading to degenerate plans.

For a sample join query mentioned in the test file,
these are the improvements:

Before Optimization:
- Cardinality fixed at 1 for all JDBC scans
- Single fragment, single thread per query
- Max per-host resource reservation: ~9.7 MB, 1 thread
- No EXCHANGE or MERGING EXCHANGE operators
- No broadcast distribution; joins executed serially
- Example query runtime: ~77s

SCAN JDBC A
   \
    HASH JOIN
       \
        SCAN JDBC B
           \
            HASH JOIN
               \
                SCAN JDBC C
                   \
                    TOP-N -> ROOT

After Optimization:
- Cardinality derived from COUNT(*) (e.g. 150K, 1.5M rows)
- Multiple fragments per scan, 7 threads per query
- Max per-host resource reservation: ~123 MB, 7 threads
- Plans include EXCHANGE and MERGING EXCHANGE operators
- Broadcast joins on small sides, improving parallelism
- Example query runtime: ~38s (~2x faster)

SCAN JDBC A --> EXCHANGE(SND) --+
                                  \
                                   EXCHANGE(RCV) -> HASH JOIN(BCAST) --+
SCAN JDBC B --> EXCHANGE(SND) ----/                                   \
                                                                         HASH JOIN(BCAST) --+
SCAN JDBC C --> EXCHANGE(SND) ------------------------------------------/                 \
                                                                                             TOP-N
                                                                                               \
                                                                                                MERGING EXCHANGE -> ROOT

Also added a new backend configuration flag
--min_jdbc_scan_cardinality (default: 10) to provide a
lower bound for scan node cardinality estimates
during planning. This flag is propagated from BE
to FE via TBackendGflags and surfaced through
BackendConfig, ensuring the planner never produces
unrealistically low cardinality values.

TODO: Add a query option for this optimization
to avoid extra JDBC round trip for smaller
queries (IMPALA-14417).

Testing: All cases of Planner tests are written in
jdbc-parallel.test. Some basic metrics
are also mentioned in the commit message.

Change-Id: If47d29bdda5b17a1b369440f04d4e209d12133d9
Reviewed-on: http://gerrit.cloudera.org:8080/23112
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2025-10-04 15:42:38 +00:00
Steve Carlin
6aa4df4443 IMPALA-14105: Calcite planner: Runtime filters not being applied with outer joins
Previous to this commit, outer join conjuncts were not being placed into
the ValueTransfersGraph which prevented them from being considered for
runtime filters.  This caused a slowdown in some tpcds queries.

The conjuncts are now registered with the ImpalaJoinRel. The appropriate TableRef
objects are picked up from the underyling plan nodes.

Change-Id: I9e06d3f35a10f35ff8b57ba25dbab1bc6a35238a
Reviewed-on: http://gerrit.cloudera.org:8080/23318
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-28 23:40:32 +00:00
Steve Carlin
a6dbd4015c IMPALA-14106: Calcite planner: Register equivalent union expressions in value transfer graph
This commit registers the equivalent union expressions in the value
transfer graph when the physical union node is created for the Calcite
planner.

Change-Id: I4c858ae82a1cb7b89b0ae4e70205d8eeaeb28687
Reviewed-on: http://gerrit.cloudera.org:8080/23316
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-24 22:58:33 +00:00
Joe McDonnell
e05d92cb3d IMPALA-13548: Schedule scan ranges oldest to newest for tuple caching
Scheduling does not sort scan ranges by modification time. When a new
file is added to a table, its order in the list of scan ranges is
not based on modification time. Instead, it is based on which partition
it belongs to and what its filename is. A new file that is added early
in the list of scan ranges can cause cascading differences in scheduling.
For tuple caching, this means that multiple runtime cache keys could
change due to adding a single file.

To minimize that disruption, this adds the ability to sort the scan
ranges by modification time and schedule scan ranges oldest to newest.
This enables it for scan nodes that feed into tuple cache nodes
(similar to deterministic scan range assignment).

Testing:
 - Modified TestTupleCacheFullCluster::test_scan_range_distributed
   to have stricter checks about how many cache keys change after
   an insert (only one should change)
 - Modified TupleCacheTest#testDeterministicScheduling to verify that
   oldest to newest scheduling is also enabled.

Change-Id: Ia4108c7a00c6acf8bbfc036b2b76e7c02ae44d47
Reviewed-on: http://gerrit.cloudera.org:8080/23228
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-22 20:51:46 +00:00
Peter Rozsa
b0f1d49042 IMPALA-14016: Add multi-catalog support for local catalog mode
This patch adds a new MetaProvider called MultiMetaProvider, which is
capable of handling multiple MetaProviders at once, prioritizing one
primary provider over multiple secondary providers. The primary
provider handles some methods exclusively for deterministic behavior.
In database listings, if one database name occurs multiple times the
contained tables are merged under that database name; if the two
separate databases contain a table with the same name, the query
analyzation fails with an error.
This change also modifies the local catalog implementation's
initialization. If catalogd is deployed, then it instantiates the
CatalogdMetaProvider and checks if the catalog configuration directory
is set as a backend flag. If it's set, then it tries to load every
configuration from the folder, and tries to instantiate the
IcebergMetaProvider from those configs. If the instantiation fails, an
error is reported to the logs, but the startup is not interrupted.

Tests:
 - E2E tests for multi-catalog behavior
 - Unit test for ConfigLoader

Change-Id: Ifbdd0f7085345e7954d9f6f264202699182dd1e1
Reviewed-on: http://gerrit.cloudera.org:8080/22878
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2025-09-19 15:03:59 +00:00
Joe McDonnell
ca356a8df5 IMPALA-13437 (part 2): Implement cost-based tuple cache placement
This changes the default behavior of the tuple cache to consider
cost when placing the TupleCacheNodes. It tries to pick the best
locations within a budget. First, it eliminates unprofitable locations
via a threshold. Next, it ranks the remaining locations by their
profitability. Finally, it picks the best locations in rank order
until it reaches the budget.

The threshold is based on the ratio of processing cost for regular
execution versus the processing cost for reading from the cache.
If the ratio is below the threshold, the location is eliminated.
The threshold is specified by the tuple_cache_required_cost_reduction_factor
query option. This defaults to 3.0, which means that the cost of
reading from the cache must be less than 1/3 the cost of computing
the value normally. A higher value makes this more restrictive
about caching locations, which pushes in the direction of lower
overhead.

The ranking is based on the cost reduction per byte. This is given
by the formula:
 (regular processing cost - cost to read from cache) / estimated serialized size
This prefers locations with small results or high reduction in cost.

The budget is based on the estimated serialized size per node. This
limits the total caching that a query will do. A higher value allows more
caching, which can increase the overhead on the first run of a query. A lower
value is less aggressive and can limit the overhead at the expense of less
caching. This uses a per-node limit as the limit should scale based on the
size of the executor group as each executor brings extra capacity. The budget
is specified by the tuple_cache_budget_bytes_per_executor.

The old behavior to place the tuple cache at all eligible locations is
still available via the tuple_cache_placement_policy query option. The
default is the cost_based policy described above, but the old behavior
is available via the all_eligible policy. This is useful for correctness
testing (and the existing tuple cache test cases).

This changes the explain plan output:
 - The hash trace is only enabled at VERBOSE level. This means that the regular
   profile will not contain the hash trace, as the regular profile uses EXTENDED.
 - This adds additional information at VERBOSE to display the cost information
   for each plan node. This can help trace why a particular location was
   not picked.

Testing:
 - This adds a TPC-DS planner test with tuple caching enabled (based on the
   existing TpcdsCpuCostPlannerTest)
 - This modifies existing tests to adapt to changes in the explain plan output

Change-Id: Ifc6e7b95621a7937d892511dc879bf7c8da07cdc
Reviewed-on: http://gerrit.cloudera.org:8080/23219
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-18 21:02:51 +00:00
Mihaly Szjatinya
591bf48c72 IMPALA-14013: DROP INCREMENTAL STATS throws NullPointerException for
Iceberg tables

Similarly to 'COMPUTE INCREMENTAL STATS', 'DROP INCREMENTAL STATS'
should prohibit the partition variant for Iceberg tables.

Testing:
- FE: fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
- EE: tests/query_test/test_iceberg.py

Change-Id: If3d9ef45a9c9ddce9a5e43c5058ae84f919e0283
Reviewed-on: http://gerrit.cloudera.org:8080/23394
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-18 09:54:26 +00:00
Noemi Pap-Takacs
821c7347d1 IMPALA-13267: Display number of partitions for Iceberg tables
Before this change, query plans and profile reported only a single
partition even for partitioned Iceberg tables, which was misleading
for users.
Now we can display the number of scanned partitions correctly for
both partitioned and unpartitioned Iceberg tables. This is achieved by
extracting the partition values from the file descriptors and storing
them in the IcebergContentFileStore. Instead of storing this information
redundantly in all file descriptors, we store them in one place and
reference the partition metadata in the FDs with an id.
This also gives the opportunity to optimize memory consumption in the
Catalog and Coordinator as well as reduce network traffic between them
in the future.

Time travel is handled similarly to oldFileDescMap. In that case
we don't know the total number of partitions in the old snapshot,
so the output is [Num scanned partitions]/unknown.

Testing:
 - Planner tests
 - E2E tests
   - partition transforms
   - partition evolution
   - DROP PARTITION
   - time travel

Change-Id: Ifb2f654bc6c9bdf9cfafc27b38b5ca2f7b6b4872
Reviewed-on: http://gerrit.cloudera.org:8080/23113
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-12 20:36:10 +00:00
Daniel Vanko
492b2b7f46 IMPALA-14406: Fix test_column_case_sensitivity for newer Iceberg versions
New test introduced in IMPALA-14290 depends on the Iceberg versions,
because newer ones (e.g. 1.5.2) will show

{"start_time_month":"646","end_time_day":"19916"}

instead of

{"start_time_day":null,"end_time_month":null,"start_time_month":"646","end_time_day":"19916"}

The test now accepts both cases.

Testing:
 * ran query_test/test_iceberg.py with both Iceberg 1.3.1 and 1.5.2

Change-Id: I17e368ac043d1fbf80a78dcac6ab1be5a297b6ea
Reviewed-on: http://gerrit.cloudera.org:8080/23389
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-11 18:28:46 +00:00
jichen0919
826c8cf9b0 IMPALA-14081: Support create/drop paimon table for impala
This patch mainly implement the creation/drop of paimon table
through impala.

Supported impala data types:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE

Syntax for creating paimon table:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
[col_name data_type ,...]
[PRIMARY KEY (col1,col2)]
)
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
STORED AS PAIMON
[LOCATION 'hdfs_path']
[TBLPROPERTIES (
'primary-key'='col1,col2',
'file.format' = 'orc/parquet',
'bucket' = '2',
'bucket-key' = 'col3',
];

Two types of paimon catalogs are supported.

(1) Create table with hive catalog:

CREATE TABLE paimon_hive_cat(userid INT,movieId INT)
STORED AS PAIMON;

(2) Create table with hadoop catalog:

CREATE [EXTERNAL] TABLE paimon_hadoop_cat
STORED AS PAIMON
TBLPROPERTIES('paimon.catalog'='hadoop',
'paimon.catalog_location'='/path/to/paimon_hadoop_catalog',
'paimon.table_identifier'='paimondb.paimontable');

SHOW TABLE STAT/SHOW COLUMN STAT/SHOW PARTITIONS/SHOW FILES
statements are also supported.

TODO:
    - Patches pending submission:
        - Query support for paimon data files.
        - Partition pruning and predicate push down.
        - Query support with time travel.
        - Query support for paimon meta tables.
    - WIP:
        - Complex type query support.
        - Virtual Column query support for querying
          paimon data table.
        - Native paimon table scanner, instead of
          jni based.
Testing:
    - Add unit test for paimon impala type conversion.
    - Add unit test for ToSqlTest.java.
    - Add unit test for AnalyzeDDLTest.java.
    - Update default_file_format TestEnumCase in
      be/src/service/query-options-test.cc.
    - Update test case in
      testdata/workloads/functional-query/queries/QueryTest/set.test.
    - Add test cases in metadata/test_show_create_table.py.
    - Add custom test test_paimon.py.

Change-Id: I57e77f28151e4a91353ef77050f9f0cd7d9d05ef
Reviewed-on: http://gerrit.cloudera.org:8080/22914
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-09-10 21:24:49 +00:00