Commit Graph

12336 Commits

Author SHA1 Message Date
jasonmfehr
336034debd IMPALA-14480: Optional OpenTelemetry DCHECKs
The code in span-manager.cc contains aggressive DCHECKS that rely on
the query lifecycle to be deterministic. In reality, the query
lifecycle is not completely deterministic due to multiple threads
being involved in execution, result retrieval, query shutdown, etc.

On debug builds only, a new flag named, otel_trace_exhaustive_dchecks
will be available with a default of 'false'. If set to 'true', then
optional DCHECKs will be enabled in the SpanManager class to enable
identification of edge cases where the query lifecycle proceeds in an
unexpected way.

The DCHECKs that are controlled by the new flag are those that rely
on a specific ordering of start/end child span and add child span
event calls.

Change-Id: Id6507f3f0e23ecf7c2bece9a6b6c2d86bfac1e57
Reviewed-on: http://gerrit.cloudera.org:8080/23518
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-26 04:48:46 +00:00
Noemi Pap-Takacs
fdad9d3204 IMPALA-13725: Add Iceberg table repair functionalities
In some cases users delete files directly from storage without
going through the Iceberg API, e.g. they remove old partitions.

This corrupts the table, and makes queries that try to read the
missing files fail.
This change introduces a repair statement that deletes the
dangling references of missing files from the metadata.
Note that the table cannot be repaired if there are missing
delete files because Iceberg's DeleteFiles API which is used
to execute the operation allows removing only data files.

Testing:
 - E2E
   - HDFS
   - S3, Ozone
 - analysis

Change-Id: I514403acaa3b8c0a7b2581d676b82474d846d38e
Reviewed-on: http://gerrit.cloudera.org:8080/23512
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-25 13:03:52 +00:00
jasonmfehr
2ac5a24dc0 IMPALA-14455: Cleanup OpenTelemetry Tracing Startup Flags
Fixes several issues with the OpenTelemetry tracing startup flags:

1. otel_trace_beeswax -- Removes this hidden flag which enabled
   tracing of queries submitted over Beeswax. Since this protocol is
   deprecated and no tests assert the traces generated by Beeswax
   queries, this flag was removed to eliminate an extra check when
   determining if OpenTelemetry tracing should be enabled.

2. otel_trace_tls_minimum_version -- Fixes parsing of this flag's
   value. This flag is in the format "tlsv1.2" or "tlsv1.3", but the
   OpenTelemetry C++ SDK expects the minimum TLS version to be in the
   format "1.2" or "1.3". The code now removes the "tlsv" prefix before
   passing the value to the OpenTelemetry C++ SDK.

3. otel_trace_tls_insecure_skip_verify -- Fixes the guidance to only
   set this flag to true in dev/testing.

Adds ctest tests for the functions that configure the TraceProvider
singleton to ensure startup flags are correctly parsed and applied.

Modifies the http_exporter_config and init_otel_tracer function
signatures in otel.cc to return the actual object they create instead
of a Status since these functions only ever returned OK.

Updates the OpenTelemetry collector docker-compose file to support
the collector receiving traces over both HTTP and HTTPS. This setup
is used to manually smoke test the integration from Impala to an
OpenTelemetry collector.

Change-Id: Ie321fa37c0fd260f783dc6cf47924d53a06d82ea
Reviewed-on: http://gerrit.cloudera.org:8080/23440
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-11-24 23:46:57 +00:00
Daniel Vanko
3d22c7fe05 IMPALA-12209: Always include format-version in DESCRIBE FORMATTED and SHOW CREATE TABLE for Iceberg tables
HiveCatalog does not include format-version for Iceberg tables in the
table's parameters, therefore the output of SHOW CREATE TABLE may not
replicate the original table.
This patch makes sure to add it to both the SHOW CREATE TABLE and
DESCRIBE FORMATTED/EXTENDED output.

Additionally, adds ICEBERG_DEFAULT_FORMAT_VERSION variable to E2E
tests, deducting from IMPALA_ICEBERG_VERSION environment variable.

If Iceberg version is at least 1.4, default format-version is 2, before
1.4 it's 1. This way tests can work with multiple Iceberg versions.

Testing:
 * updated show-create-table.test and show-create-table-with-stats.test
   for Iceberg tables
 * added format-version checks to multiple DESCRIBE FORMATTED tests

Change-Id: I991edf408b24fa73e8a8abe64ac24929aeb8e2f8
Reviewed-on: http://gerrit.cloudera.org:8080/23514
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-24 21:48:17 +00:00
Csaba Ringhofer
f6ceca2b4d IMPALA-14571: increase planner cost of java functions
The main motivation is to evaluate expensive geospatial
functions (which are Java functions) last in predicates.
Java functions have a major overhead anyway from the JNI
call, so bumping all Java function costs seems beneficial.

Note that currently geospatial functions are the only
built-in Java functions.

Change-Id: I11d1652d76092ec60af18a33502dacc25b284fcc
Reviewed-on: http://gerrit.cloudera.org:8080/22733
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-24 16:52:59 +00:00
Csaba Ringhofer
f12bb87d42 IMPALA-14081: (addendum) add ';' to CREATE part in dataload
The missing ';' can cause problems for the next created
table.

Change-Id: I719872de23941bf81289340ce246d25ee113223a
Reviewed-on: http://gerrit.cloudera.org:8080/23704
Reviewed-by: Daniel Vanko <dvanko@cloudera.com>
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-11-21 12:29:48 +00:00
Joe McDonnell
5eea4f6f79 IMPALA-14559: Ship calcite-planner jar in Impala packages
This adds the java/impala-package Maven project to make it easier
to ship / test the Calcite planner. impala-package has a dependency
on impala-frontend and calcite-planner, so its classpath requires
no extra work when constructing the classpath.

An additional cleanup is that this no longer puts the
impala-frontend-*-tests.jar on the classpath by default. This requires
updating the query event hooks test, as it relies on that jar being
present.

This does not change the default value for the use_calcite_planner
query option, so there is no change in behavior.

Testing:
 - Ran a core job
 - Built docker images and OS packages locally

Change-Id: I81dec2a5b59e279229a735c8bb1a23c77111a793
Reviewed-on: http://gerrit.cloudera.org:8080/23497
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-21 03:36:12 +00:00
Zoltan Borok-Nagy
5ea4dc342e IMPALA-14565: Update Apache component versions after CDP_BUILD_NUMBER bump to 71942734
CDP_BUILD_NUMBER was bumped to 71942734 which upgraded Iceberg to
version 1.5.2. We should update our Apache component dependencies
(not just Iceberg) accordingly.

Change-Id: Ic353bbef64a59365b708a20bd0d5ed502cb6d44e
Reviewed-on: http://gerrit.cloudera.org:8080/23678
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-21 01:40:05 +00:00
Steve Carlin
e67b627858 IMPALA-14408: (addendum) Log Calcite exception in profile
This addendum logs the exception thrown in the runtime profile
under the CalciteFailureReason key.

Testing: test_ranger.py uses this.

Change-Id: Ia18a52c488f9c73d51690997b277fd8e918c645f
Reviewed-on: http://gerrit.cloudera.org:8080/23686
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-20 21:08:48 +00:00
Steve Carlin
a6bb0c7c45 IMPALA-14408: Use regular path for Calcite planner instead of CalciteJniFrontend
When the --use_calcite_planner=true option is set at the server level,
the queries will no longer go through CalciteJniFrontend. Instead, they
will go through the regular JniFrontend, which is the path that is used
when the query option for "use_calcite_planner" is set.

The CalciteJniFrontend will be removed in a later commit.

This commit also enables fallback to the original planner when an unsupported
feature exception is thrown. This needed to be added to allow the tests to run
properly. During initial database load, there are queries that access complex
columns which throws the unsupported exception.

Change-Id: I732516ca8f7ea64f73484efd67071910c9b62c8f
Reviewed-on: http://gerrit.cloudera.org:8080/23523
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
Tested-by: Steve Carlin <scarlin@cloudera.com>
2025-11-20 21:08:48 +00:00
Riza Suminto
64c4abe6ed IMPALA-14547: Bumping Kudu version to pickup KUDU-3716
Redhat 9 environments recently switched to OpenSSL 3.5.1. On those
machines, the Kudu minicluster fails to start up with CSR signature
verification error. KUDU-3716 fixed this issue.

This patch update Toolchain and Kudu version to pick up KUDU-3716.

Testing:
Pass data loading with in Redhat 9.

Change-Id: I7262267939a9f08650af85443240950afbb3323f
Reviewed-on: http://gerrit.cloudera.org:8080/23697
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-20 15:16:57 +00:00
Joe McDonnell
3ce0004c12 IMPALA-14512: Remove dependency on sh python package
This modifies bin/single_node_perf_run.py to stop using the sh
python package. It replaces sh with calls to subprocess. It
stops installing sh for both the Python 2 and 3 virtualenvs.

Testing:
 - Ran perf-AB-test job with it and examined the logs

Change-Id: Ic5f9316a5d83c5c0dc37d4a94c55b6a655765fe3
Reviewed-on: http://gerrit.cloudera.org:8080/23600
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-20 03:29:48 +00:00
Joe McDonnell
001263f58a IMPALA-14514: Handle serializing bytes in bin/run-workload.py
On python 3, when Impyla receives a result with a string that is
not valid UTF-8, it returns that as bytes. TPC-DS Q30 on scale 20
has a result that contains invalid UTF-8, so bin/run-workload.py
can fail while trying to dump this to JSON.

This modifies CustomJSONEncoder to handle serializing bytes by
converting it to a string with invalid unicode handled with
backslashes.

Testing:
 - Ran bin/run-workload.py against TPC-DS scale 20

Change-Id: Ibe31c656de4fc65f8580c7b3b49bf655b8a5ecea
Reviewed-on: http://gerrit.cloudera.org:8080/23602
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-11-20 03:29:48 +00:00
Gabriella Gyorgyevics
c4c9adf592 IMPALA-14386: Add benchmarks for Byte Stream Split encoding
This patch adds benchmarks to the Byte Stream Split encoding. It
compares different ways to use the decoder.

I added benchmarks for the following comparisons:
  * Compile VS Runtime initialized decoder
  * Float VS Int VS Double VS Long VS 6 and 11 byte size types
  * Repeating VS Sequential VS Random ordered data
  * Decoding one by one VS in batch VS with stride (!= byte_size)
  * Small VS Medium (10x small) VS Large (100x small) stride

Conclusions:
  * Passing the byte size as a template parameter is almost 5 times
    as fast as passing it in the constructor.
  * The size of the type heavily influences the speed
  * The data variation doesn't influence the speed at all
  * Reading values in batch is much faster than one-by-one
  * The stride sizes have a small influence on the speed

For more details and graphs, go to
https://docs.google.com/spreadsheets/d/129LwvR6gpZInlRhlVWktn6Haugwo_fnloAAYfI0Qp2s

Change-Id: I708af625348b0643aa3f37525b8a6e74f0c47057
Reviewed-on: http://gerrit.cloudera.org:8080/23401
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-19 17:49:42 +00:00
m-sanjana19
134c28d445 IMPALA-13788: [DOCS] Docs for query options SYNC_HMS_EVENTS_WAIT_TIME_S
and SYNC_HMS_EVENTS_STRICT_MODE

The commit documents query options SYNC_HMS_EVENTS_WAIT_TIME_S
and SYNC_HMS_EVENTS_STRICT_MODE

Url: https://impala.apache.org/docs/build/html/topics/impala_set.html

Change-Id: Ia11663c5e84794d4bca658124cde59bf97aa7158
Reviewed-on: http://gerrit.cloudera.org:8080/23592
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
2025-11-19 07:42:54 +00:00
Steve Carlin
54c0074b33 IMPALA-14405 ADDENDUM: Catch exception for bad column names
This commit is a fix on top of IMPALA-14405 for the Calcite
planner. The original commit matches column names from the
expression in the select clause.

For instance, if the query is "select 1 + 1", the label in
impala-shell will be "1 + 1". It accomplished this by
retrieving the string from the SqlNode object through the
MySql dialect.

However, when the expression doesn't succeed in the MySql
dialect, an AssertionError gets thrown, causing the query to
fail. We don't want the query to fail, we just want to go
back to using the Calcite expression, e.g. EXPR$0. This
occurred with this specific query:

"select timestamp_col + interval 3 nanoseconds"

So now the exception is caught and the default label name
is used. Eventually we should try to match what Impala has,
but this is a harder problem to fix.

Change-Id: I6c4d76a25fb2486eb1ef19485bce7888d45d282f
Reviewed-on: http://gerrit.cloudera.org:8080/23665
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-11-18 21:34:29 +00:00
Zoltan Borok-Nagy
454cb07e7c IMPALA-14556: Move Hive ACID stress tests to exhaustive tests
Currently Hive ACID stress tests run with "core" exploration strategy.
It was important to get instant feedback about this feature when this
was actively developed. Since then development activity around Hive ACID
decreased significantly, as focus shifted towards Iceberg.

This patch moves Hive ACID tests to exhaustive tests where they will
be still executed regularly, but won't slow down pre-commit tests.

Change-Id: Id7181fea62e2e3f8fcf7897a70e54a1708ef3f3e
Reviewed-on: http://gerrit.cloudera.org:8080/23677
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-18 16:20:16 +00:00
Arnab Karmakar
a2a11dec62 IMPALA-13263: Add single-argument overload for ST_ConvexHull()
Implemented a single-argument version of ST_ConvexHull() to align with
PostGIS behavior and simplify usage across geometry types.

Testing:
Added new tests in test_geospatial_functions.py for ST_ConvexHull(),
which previously had no test coverage, to verify correctness across
supported geometry types.

Change-Id: Idb17d98f5e75929ec0143aa16195a84dd6e50796
Reviewed-on: http://gerrit.cloudera.org:8080/23604
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-11-18 10:26:04 +00:00
Steve Carlin
52334ba426 IMPALA-14421: Calcite planner: case statement returning wrong types for char, varchar
The 'case' function resolver in the original Impala planner has a quirk in it
which caused issues in the Calcite planner.

The function resolver for the original planner resolves all case statements with
the "boolean" version.  Later on, in the analysis of the CaseExpr, the proper
types are assessed and the necessary casting is added.

The Calcite planner follows a similar path. The resolver always returns boolean
as well and the coerce nodes module determines the proper return type for
the case statement.

Two other related issues are also fixed here:

Literal strings should be treated as type STRING instead of CHAR(X), but a null
should literal should not be changed from a CHAR(x) to a STRING.  This broke a
'case' test in the test framework where the columns were non-literals with type
char(x), and the return value was a "null" which should not have forced a cast
to string.

A cast from a varchar to a varchar should be ignored.

Testing:
Added a test to calcite.test.
Ensured the existing cast test in test_chars.py passed.
Ran through the Jenkins Calcite testing framework.

Change-Id: I82d657f4bfce432c458ee8198188dadf9f23f2ef
Reviewed-on: http://gerrit.cloudera.org:8080/23560
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-18 07:47:39 +00:00
Peter Rozsa
8eb1d87edc IMPALA-14272: Add extra flags option for coverage_helper.sh
This change adds an optional flag to coverage_helper.sh script that
accepts additional parameters for the wrapped gcovr call.

Tests:
 - manually validated that the script has the original behaviour if the
newly added flag is not set, also if it's set, the parameters are pushed
down correctly.

Change-Id: Iea26c9967b62b06ded6a0cb4c0346f0e789beb80
Reviewed-on: http://gerrit.cloudera.org:8080/23290
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Peter Rozsa <prozsa@cloudera.com>
2025-11-18 07:12:28 +00:00
Arnab Karmakar
068158e495 IMPALA-12401: Support more info types for HS2 GetInfo() API
This patch adds support for 40+ additional TGetInfoType values in the
HiveServer2 GetInfo() API, improving ODBC/JDBC driver compatibility.

Previously, only 3 info types were supported (CLI_SERVER_NAME,
CLI_DBMS_NAME, CLI_DBMS_VER).

The implementation follows the ODBC CLI specification and matches the
behavior of Hive's GetInfo implementation where applicable.

Testing:
- Added unit tests in test_hs2.py for new info types
- Tests verify correct return values and data types for each info type

Change-Id: I1ce5f2b9dcc2e4633b4679b002f57b5b4ea3e8bf
Reviewed-on: http://gerrit.cloudera.org:8080/23528
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-11-17 19:32:50 +00:00
Riza Suminto
f2243b76b5 IMPALA-14557: Fix flaky test_show_files_partition
TestIcebergTable.test_show_files_partition is unstable because files are
alphanumerically sorted and the order between a random UUID and
"delete-*" is not guaranteed.

This patch fix the flakiness by specifying VERIFY_IS_SUBSET and using
negative lookahead of "delete" word to detect valid Iceberg data file.

Testing:
- Loop and pass test_show_files_partition 50 times. Before, it can fail
  in less than 10 loops.

Change-Id: I6243585a5b7ab7cf7c95d5a9530ce2f2825c550e
Reviewed-on: http://gerrit.cloudera.org:8080/23680
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-17 17:13:19 +00:00
Michael Smith
166b39547e IMPALA-14553: Run schema eval concurrently
The majority of time spent in generate-schema-statements.py is in
eval_section for schema operations that shell out, often uploading files
via the hadoop CLI or generating data files. These operations should be
independent.

Runs eval_section at the beginning so we don't repeat it for each row in
test_vectors, and executes them in parallel via a ThreadPool. Defaults
to NUM_CONCURRENT_TESTS threads because the underlying operations have
some concurrency to them (such as HDFS mirroring writes).

Also collects existing tables into a set to optimize lookup.

Reduces generate-schema-statements by ~60%, from 2m30s to 1m. Confirmed
that contents of logs/data_loading/sql/functional are identical.

Change-Id: I2a78d05fd6a0005c83561978713237da2dde6af2
Reviewed-on: http://gerrit.cloudera.org:8080/23627
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-17 16:34:22 +00:00
Steve Carlin
bc99705252 IMPALA-13902: Calcite planner: Implement is_spool_query_results
The is_spool_query_results query option is now supported in Calcite. The
returnAtMostOneRow method is now implemented to support this.
PlanRootSink is refactored to extract sanitizing query options (a new
method sanitizeSpoolingOptions()) out of
PlanRootSink.computeResourceProfile(). The bulk of memory bounding
calculation is also extracted out to a new class SpoolingMemoryBound.

Added "sleep" in ImpalaOperatorTable.java since some EE tests related to
result spooling calls sleep() function. Changed ImpalaPlanRel to extends
RelNode interface.

A sanity test has been added to calcite.test, but the bulk of the
testing will be done through the Impala test framework when it is
enabled.

Testing:
- Pass FE tests PlannerTest#testResultSpooling, TpcdsCpuCostPlannerTest,
  and all java tests under calcite-planner project.
- Pass query_test/test_result_spooling.py and
  custom_cluster/test_result_spooling.py.

Co-authored-by: Riza Suminto

Change-Id: I5b9bf49e2874ee12de212b892bd898c296774c6f
Reviewed-on: http://gerrit.cloudera.org:8080/23562
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-16 02:33:02 +00:00
Riza Suminto
898e03e9d5 IMPALA-14552: (addendum) Fix bad testcase in show-create-table.test
The original IMPALA-14552 patch pass precommit tests before
IMPALA-12893: (part 2) (275f03f) merged. As consequence, it does not
catch missing comma in updated show-create-table.test. This patch add
that missing comma.

Testing:
Pass metadata/test_show_create_table.py

Change-Id: Ib06e690a81e6b0ca483b3647cc59c73802a0a7b7
Reviewed-on: http://gerrit.cloudera.org:8080/23673
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-15 21:34:44 +00:00
Zoltan Borok-Nagy
6810368c10 IMPALA-14552: test_show_create_table should be more strict with TBLPROPERTIES contents
Currently we use this regex to parse the contents of TBLPROPERTIES:

  kv_regex = "'([^\']+)'\\s*=\\s*'([^\']+)'"
  kv_results = dict(re.findall(kv_regex, map_match.group(1)))

This allows strings like:
 'X'='Y'='Z'
 'X'='Z'$'A'='B'

This means it's easy to write strings in .test files that are not valid
SQL. This patch adds a few extra checks to validate the TBLPROPERTIES
contents.

Change-Id: I94110f50720c01dc7807ee56c794d235f4990282
Reviewed-on: http://gerrit.cloudera.org:8080/23671
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-11-14 23:58:47 +00:00
Mihaly Szjatinya
087b715a2b IMPALA-14108: Add support for SHOW FILES IN table PARTITION for Iceberg
tables

This patch implements partition filtering support for the SHOW FILES
statement on Iceberg tables, based on the functionality added in
IMPALA-12243. Prior to this change, the syntax resulted in a
NullPointerException.

Key changes:
- Added ShowFilesStmt.analyzeIceberg() to validate and transform
  partition expressions using IcebergPartitionExpressionRewriter and
  IcebergPartitionPredicateConverter. After that, it collects matching
  file paths using IcebergUtil.planFiles().
- Added FeIcebergTable.Utils.getIcebergTableFilesFromPaths() to
  accept pre-filtered file lists from the analysis phase.
- Enhanced TShowFilesParams thrift struct with optional selected_files
  field to pass pre-filtered file paths from frontend to backend.

Testing:
- Analyzer tests for negative cases: non-existent partitions, invalid
  expressions, non-partition columns, unsupported transforms.
- Analyzer tests for positive cases: all transform types, complex
  expressions.
- Authorization tests for non-filtered and filtered syntaxes.
- E2E tests covering every partition transform type with various
  predicates.
- Schema evolution and rollback scenarios.

The implementation follows AlterTableDropPartition's pattern where the
analysis phase performs validation/metadata retrieval and the execution
phase handles result formatting and display.

Change-Id: Ibb9913e078e6842861bdbb004ed5d67286bd3152
Reviewed-on: http://gerrit.cloudera.org:8080/23455
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-14 21:43:10 +00:00
Zoltan Borok-Nagy
275f03f10d IMPALA-12893: (part 2): Upgrade Iceberg to version 1.5.2
This patch updates CDP_BUILD_NUMBER to 71942734 to in order to
upgrade Iceberg to 1.5.2.

This patch updates some tests so they pass with Iceberg 1.5.2. The
behavior changes of Iceberg 1.5.2 are (compared to 1.3.1):
 * Iceberg V2 tables are created by default
 * Metadata tables have different schema
 * Parquet compression is explicitly set for new tables (even for ORC
   tables)
 * Sequence numbers are assigned a bit differently

Updated the tests where needed.

Code changes to accomodate for the above behavior changes:
 * SHOW CREATE TABLE adds 'format-version'='1' for Iceberg V1 tables
 * CREATE TABLE statements don't throw errors when Parquet compression
   is set for ORC tables

Change-Id: Ic4f9ed3f7ee9f686044023be938d6b1d18c8842e
Reviewed-on: http://gerrit.cloudera.org:8080/23670
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-14 01:27:45 +00:00
Xuebin Su
e4a508529c IMPALA-14544: Fix use-after-poison for Kudu arrays
This patch fixes the use-after-poison error caused by using the memory
in the MemPool after calling `MemPool::Clear()` when reading Kudu
arrays.

Testing:
- The ASAN build passed the core tests.

Change-Id: I9b729fc6003e64856ea0e197b1e3c74dad7247a1
Reviewed-on: http://gerrit.cloudera.org:8080/23668
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 22:38:32 +00:00
Joe McDonnell
5f91838ada IMPALA-14545: Don't use absolute hdfs paths for JDBC table driver.url
After IMPALA-13661 merged, S3PlannerTest.testDataSourceTables
has been failing with an error trying to fetch the JDBC driver
for functional.jdbc_decimal_tbl. This particular table's
definition uses a path like 'hdfs://localhost:20500/test-warehouse/...'
which explicitly depends on HDFS rather than relying
on the default filesystem. Changing this to use a path like
'/test-warehouse/...' without the HDFS dependency fixes the
S3PlannerTest. This changes create-ext-data-source-table.sql
to a template using WAREHOUSE_LOCATION_PREFIX and replaces that
variable before executing it. This is important for Ozone, as
Ozone uses a WAREHOUSE_LOCATION_PREFIX set to the Ozone volume.

Testing:
 - Ran S3 and regular HDFS fe tests

Change-Id: I3f2c86fcc6c1dee75d7d9a9be04468cb197ae13c
Reviewed-on: http://gerrit.cloudera.org:8080/23658
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 22:17:44 +00:00
Arnab Karmakar
760eb4f2fa IMPALA-13066: Extend SHOW CREATE TABLE to include stats and partitions
Adds a new WITH STATS option to the SHOW CREATE TABLE statement to
emit additional SQL statements for recreating table statistics and
partitions.

When specified, Impala outputs:

- Base CREATE TABLE statement.

- ALTER TABLE ... SET TBLPROPERTIES for table-level stats.

- ALTER TABLE ... SET COLUMN STATS for all non-partition columns,
restoring column stats.

- For partitioned tables:

  - ALTER TABLE ... ADD PARTITION statements to recreate partitions.

  - Per-partition ALTER TABLE ... PARTITION (...) SET TBLPROPERTIES
  to restore partition-level stats.

Partition output is limited by the PARTITION_LIMIT query option
(default 1000). Setting PARTITION_LIMIT=0 includes all partitions and
emits a warning if the limit is exceeded.

Tests added to verify correctness of emitted statements. Default
behavior of SHOW CREATE TABLE remains unchanged for compatibility.

Change-Id: I87950ae9d9bb73cb2a435cf5bcad076df1570dc2
Reviewed-on: http://gerrit.cloudera.org:8080/23536
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 06:11:37 +00:00
ttttttz
75c639c9cd IMPALA-14498: Fix a bug in initial code review checks
When conducting a code review using flake8-diff, it may fail in some code sections
due to the use of non-raw strings. This patch modifies one instance to successfully
pass the initial code review. Although it is currently working, it may not cover
all instances.

Change-Id: I71889a117c64500bab13928971a2bce063a72cd4
Reviewed-on: http://gerrit.cloudera.org:8080/23656
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Quanlong Huang <huangquanlong@gmail.com>
2025-11-12 01:05:10 +00:00
Michael Smith
d09940b5dd IMPALA-13563: Cleanup logging
Cleans up calls to logDebug and a few other locations:
- exit early if producing debug message input is expensive
- use slf4j parameterized logging
- normalize on logDebug handling isDebugEnabled checks

Change-Id: I32e1c62511c292d36aa879c60ae3d91ed4f65697
Reviewed-on: http://gerrit.cloudera.org:8080/22090
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-11 05:29:58 +00:00
Xuebin Su
6b6f7e614d IMPALA-14472: Add create/read support for ARRAY column of Kudu
Initial implementation of KUDU-1261 (array column type) recently merged
in upstream Apache Kudu repository. This patch add initial Impala
support for working with Kudu tables having array type columns.

Unlike rows, the elements of a Kudu array are stored in a different
format than Impala. Instead of per-row bit flag for NULL info, values
and NULL bits are stored in separate arrays.

The following types of queries are not supported in this patch:
- (IMPALA-14538) Queries that reference an array column as a table, e.g.
  ```sql
  SELECT item FROM kudu_array.array_int;
  ```
- (IMPALA-14539) Queries that create duplicate collection slots, e.g.
  ```sql
  SELECT array_int FROM kudu_array AS t, t.array_int AS unnested;
  ```

Testing:
- Add some FE tests in AnalyzeDDLTest and AnalyzeKuduDDLTest.
- Add EE test test_kudu.py::TestKuduArray.
  Since Impala does not support inserting complex types, including
  array, the data insertion part of the test is achieved through
  custom C++ code kudu-array-inserter.cc that insert into Kudu via
  Kudu C++ client. It would be great if we could migrate it to Python so
  that it can be moved to the same file as the test (IMPALA-14537).
- Pass core tests.

Co-authored-by: Riza Suminto

Change-Id: I9282aac821bd30668189f84b2ed8fff7047e7310
Reviewed-on: http://gerrit.cloudera.org:8080/23493
Reviewed-by: Alexey Serbin <alexey@apache.org>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-08 06:41:07 +00:00
Riza Suminto
671a7fcada IMPALA-14529: (addendum) Fix kudu_create.test
Kudu throws different error message after IMPALA-14529. This patch
adjust the error message in kudu_create.test to let the test pass.

Testing:
Pass TestDdlStatements.test_create_kudu and
TestKuduHMSIntegration.test_create_managed_kudu_tables.

Change-Id: Iff4cd08f46626d03b1f0800828e5872b83f522ca
Reviewed-on: http://gerrit.cloudera.org:8080/23648
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-06 22:42:34 +00:00
Yida Wu
f2f297a00f IMPALA-14533: Fix crash in ASAN/TSAN builds due to nullptr TcmallocMetric::BYTES_IN_USE
Impala uses SanitizerMallocMetric::BYTES_ALLOCATED instead of
TcmallocMetric::BYTES_IN_USE in ASAN or TSAN builds. However, the
admissiond logic in IMPALA-14493 still uses uninitialized
TcmallocMetric::BYTES_IN_USE under these builds, leading to a
nullptr crash.

To fix this issue, we will use SanitizerMallocMetric::BYTES_ALLOCATED
instead for ASAN and TSAN builds in admission controller, which is
the same logic in memory-metrics.cc to use a different metric for
those builds.

Tests:
Passed ASAN and TSAN builds testing.
Passed core tests.

Change-Id: Ic4fbdc134ea302f7302d177d073eb49136ba775c
Reviewed-on: http://gerrit.cloudera.org:8080/23646
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-06 21:56:53 +00:00
Michael Smith
8ed6d5c3ba IMPALA-14530: Use minimal debug info in Jenkins
Uses IMPALA_MINIMAL_DEBUG_INFO=true in Jenkins
build-all-flag-combinations.sh to reduce memory usage during linking and
avoid OOM kills. This script uses -skiptests to build all test binaries,
but doesn't run them, so debug info is not needed.

Change-Id: I4605b98d8d197e07c2eaac8218ff985c798875ed
Reviewed-on: http://gerrit.cloudera.org:8080/23641
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-06 16:09:56 +00:00
Michael Smith
2688e30ae5 IMPALA-14532: Fix SKIP_TOOLCHAIN_BOOTSTRAP
Fixes 'NATIVE_TOOLCHAIN_HOME: unbound variable' error when setting
'SKIP_TOOLCHAIN_BOOTSTRAP=true'.

Change-Id: I6562d49114590d89d2f43a4c23bba4a65e8abd74
Reviewed-on: http://gerrit.cloudera.org:8080/23640
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-05 22:42:22 +00:00
Michael Smith
0b9d6a7059 IMPALA-14531: Ignore new Hive config
Change-Id: Ic59caffc1f8b2a4e8693cb5e2770787f4817167e
Reviewed-on: http://gerrit.cloudera.org:8080/23639
Reviewed-by: Sai Hemanth Gantasala <saihemanth@cloudera.com>
Reviewed-by: Fang-Yu Rao <fangyu.rao@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-05 22:41:04 +00:00
Riza Suminto
0572dba245 IMPALA-14529: Bumping Kudu version to pickup latest KUDU-1261 patch
This commit bump Impala toolchain to pickup latest Kudu version up to
commit 60f5e5267b92c39485a66121d3ce3cc7ef57b0e0 (KUDU-1261 make
ArrayCellMetadataView::Init() more robust).

Change-Id: I68009e5fefd053882f5504cd2520bacb189a1b04
Reviewed-on: http://gerrit.cloudera.org:8080/23631
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-11-05 16:41:51 +00:00
Steve Carlin
62bf609942 IMPALA-14414: Calcite planner: Added new code to handle nan/inf
The current code works for NaN and Inf, but it breaks when upgrading
to v1.40.  This commit changes the code to handle these when we do
the upgrade to 1.40 and adds a basic test into the calcite.test to ensure
that when the upgrade happens, it does not break.

Change-Id: I8593a4942a2fe785a0c77134b78a9d97257225fc
Reviewed-on: http://gerrit.cloudera.org:8080/23561
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-05 12:55:39 +00:00
Riza Suminto
f34dea9b6f IMPALA-14522: Fix test_paimon_show_stats after DST ends
Test failed due to mismatch on "Last Creation Time" matching. This patch
fix the assertion with simple regex.

Testing:
Pass test_paimon.py.

Change-Id: I6855c0014111cef18318cdc4904782097a070ced
Reviewed-on: http://gerrit.cloudera.org:8080/23619
Reviewed-by: Mihaly Szjatinya <mszjat@pm.me>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-03 21:25:42 +00:00
stiga-huang
d358f6e87e IMPALA-14520: Fix wrong column numbers in document impala_workload_mgmt.xml
The tables in the doc actually have 4 columns. This patch fixes the
wrong properties in the doc which causes tables not showing correctly
in the PDF.

Tests:
 - Build PDF, plain-html and asf-site-html of the doc.

Change-Id: Ic05d8d963d3791ada6f5a4ac144796b710f9af70
Reviewed-on: http://gerrit.cloudera.org:8080/23615
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
2025-11-03 17:17:02 +00:00
Michael Smith
599b89306d IMPALA-13145: Upgrade mold to 2.40.4
Upgrades mold to the latest release.

Change-Id: If926b8065cccc4c9038c064c274b6ba97fdc2888
Reviewed-on: http://gerrit.cloudera.org:8080/23582
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-27 15:05:01 +00:00
Michael Smith
1152eef9bb IMPALA-14501: (Addendum) Fix single node perf run
Fixes open in generate_profile_files to read binary with Python 3,
matching generate_profile_file.

Change-Id: Ibd815e7eb989d7a2bcf52cadfcde4f355c18a148
Reviewed-on: http://gerrit.cloudera.org:8080/23596
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-10-25 17:31:06 +00:00
Joe McDonnell
3398f20afe IMPALA-14491: Fix run-workload.py's handling of HS2's exec summary
Recently, we switched bin/run-workload.py to use HS2. It turns
out that the HS2 client code is not producing the same data
structure for the exec summary. report_benchmark_results.py
relies on that data structure and fails for HS2.

This changes the HS2 client code to use the same representation
as the beeswax. There is already a function that does this
conversion (build_summary_table_from_thrift) for our regular
tests, so this reuses that function.

Testing:
 - Ran bin/run-workload.py twice to produce json files and
   processed them with report_benchmark_results.py. This
   failed before the change and passed afterward.

Change-Id: I0a041bdebe748b6b3a05b552584e0ca2327cff67
Reviewed-on: http://gerrit.cloudera.org:8080/23597
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-25 16:37:46 +00:00
jichen0919
541fb3f405 IMPALA-14092 Part1: Prohibit Unsupported Operation for paimon table
This patch is to prohibit un-supported operation against
paimon table. All unsupported operations are added the
checked in the analyze stage in order to avoid
mis-operation. Currently only CREATE/DROP statement
is supported, the prohibition will be removed later
after the corresponding operation is truly supported.

TODO:
    - Patches pending submission:
        - Support jni based query for paimon data table.
        - Support tpcds/tpch data-loading
          for paimon data table.
        - Virtual Column query support for querying
          paimon data table.
        - Query support with time travel.
        - Query support for paimon meta tables.

Testing:
    - Add unit test for AnalyzeDDLTest.java.
    - Add unit test for AnalyzerTest.java.
    - Add test_paimon_negative and test_paimon_query in test_paimon.py.

Change-Id: Ie39fa4836cb1be1b1a53aa62d5c02d7ec8fdc9d7
Reviewed-on: http://gerrit.cloudera.org:8080/23530
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 23:06:08 +00:00
Michael Smith
ea0ef5a799 IMPALA-14511: Fix pgrep to avoid warning
kill-all.sh tries to find a process named mini-impalad-cluster with,
which results in an (ignored) error

    pgrep: pattern that searches for process name longer than 15
    characters will result in zero matches
    Try `pgrep -f' option to match against the complete command line.

This was accidentally changed from mini-impala-cluster in 2015. Neither
term is used anymore, so this process name will never exist. Remove it
to fix the error.

Change-Id: Id1340e85cbcd3b699b333316da618774cb4e9dcd
Reviewed-on: http://gerrit.cloudera.org:8080/23586
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-23 22:00:36 +00:00
pranav.lodha
7f77176970 IMPALA-13869: Support for 'hive.sql.query' property for Hive JDBC tables
This patch adds support for the hive.sql.query table property in Hive
JDBC tables accessed through Impala. Impala has support for Hive
JDBC tables using the hive.sql.table property, which limits users
to simple table access. However, many use cases demand the ability
to expose complex joins, filters, aggregations, or derived columns
as external views. Hive.sql.query leads to a custom SQL query that
returns a virtual table(subquery) instead of pointing to a physical
table. These use cases cannot be achieved with just the hive.sql.table
property. This change allows Impala to:
 • Interact with views or complex queries defined on external
 systems without needing schema-level access to base tables.
 • Expose materialized logic (such as filters, joins, or
 transformations) via Hive to Impala consumers in a secure,
 abstracted way.
 • Better align with data virtualization use cases where
 physical data location and structure should be hidden from
 the querying engine.
This patch also lays the groundwork for future enhancements such
as predicate pushdown and performance optimizations for Hive
JDBC tables backed by queries.

Testing: End-to-end tests are included in
test_ext_data_sources.py.

Change-Id: I039fcc1e008233a3eeed8d09554195fdb8c8706b
Reviewed-on: http://gerrit.cloudera.org:8080/22865
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 21:34:29 +00:00
Michael Smith
a12e49e38d IMPALA-14509: Let Ozone set OZONE_OPTS
Remove our customization of OZONE_OPTS as it's redundant with
ozone-functions.sh. Our options also didn't work with Java 17.

Change-Id: If600dd160e6bc72320081ecee2cb0de3c73eb7bd
Reviewed-on: http://gerrit.cloudera.org:8080/23580
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 15:11:39 +00:00