Commit Graph

12312 Commits

Author SHA1 Message Date
Riza Suminto
898e03e9d5 IMPALA-14552: (addendum) Fix bad testcase in show-create-table.test
The original IMPALA-14552 patch pass precommit tests before
IMPALA-12893: (part 2) (275f03f) merged. As consequence, it does not
catch missing comma in updated show-create-table.test. This patch add
that missing comma.

Testing:
Pass metadata/test_show_create_table.py

Change-Id: Ib06e690a81e6b0ca483b3647cc59c73802a0a7b7
Reviewed-on: http://gerrit.cloudera.org:8080/23673
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-15 21:34:44 +00:00
Zoltan Borok-Nagy
6810368c10 IMPALA-14552: test_show_create_table should be more strict with TBLPROPERTIES contents
Currently we use this regex to parse the contents of TBLPROPERTIES:

  kv_regex = "'([^\']+)'\\s*=\\s*'([^\']+)'"
  kv_results = dict(re.findall(kv_regex, map_match.group(1)))

This allows strings like:
 'X'='Y'='Z'
 'X'='Z'$'A'='B'

This means it's easy to write strings in .test files that are not valid
SQL. This patch adds a few extra checks to validate the TBLPROPERTIES
contents.

Change-Id: I94110f50720c01dc7807ee56c794d235f4990282
Reviewed-on: http://gerrit.cloudera.org:8080/23671
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-11-14 23:58:47 +00:00
Mihaly Szjatinya
087b715a2b IMPALA-14108: Add support for SHOW FILES IN table PARTITION for Iceberg
tables

This patch implements partition filtering support for the SHOW FILES
statement on Iceberg tables, based on the functionality added in
IMPALA-12243. Prior to this change, the syntax resulted in a
NullPointerException.

Key changes:
- Added ShowFilesStmt.analyzeIceberg() to validate and transform
  partition expressions using IcebergPartitionExpressionRewriter and
  IcebergPartitionPredicateConverter. After that, it collects matching
  file paths using IcebergUtil.planFiles().
- Added FeIcebergTable.Utils.getIcebergTableFilesFromPaths() to
  accept pre-filtered file lists from the analysis phase.
- Enhanced TShowFilesParams thrift struct with optional selected_files
  field to pass pre-filtered file paths from frontend to backend.

Testing:
- Analyzer tests for negative cases: non-existent partitions, invalid
  expressions, non-partition columns, unsupported transforms.
- Analyzer tests for positive cases: all transform types, complex
  expressions.
- Authorization tests for non-filtered and filtered syntaxes.
- E2E tests covering every partition transform type with various
  predicates.
- Schema evolution and rollback scenarios.

The implementation follows AlterTableDropPartition's pattern where the
analysis phase performs validation/metadata retrieval and the execution
phase handles result formatting and display.

Change-Id: Ibb9913e078e6842861bdbb004ed5d67286bd3152
Reviewed-on: http://gerrit.cloudera.org:8080/23455
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-14 21:43:10 +00:00
Zoltan Borok-Nagy
275f03f10d IMPALA-12893: (part 2): Upgrade Iceberg to version 1.5.2
This patch updates CDP_BUILD_NUMBER to 71942734 to in order to
upgrade Iceberg to 1.5.2.

This patch updates some tests so they pass with Iceberg 1.5.2. The
behavior changes of Iceberg 1.5.2 are (compared to 1.3.1):
 * Iceberg V2 tables are created by default
 * Metadata tables have different schema
 * Parquet compression is explicitly set for new tables (even for ORC
   tables)
 * Sequence numbers are assigned a bit differently

Updated the tests where needed.

Code changes to accomodate for the above behavior changes:
 * SHOW CREATE TABLE adds 'format-version'='1' for Iceberg V1 tables
 * CREATE TABLE statements don't throw errors when Parquet compression
   is set for ORC tables

Change-Id: Ic4f9ed3f7ee9f686044023be938d6b1d18c8842e
Reviewed-on: http://gerrit.cloudera.org:8080/23670
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-14 01:27:45 +00:00
Xuebin Su
e4a508529c IMPALA-14544: Fix use-after-poison for Kudu arrays
This patch fixes the use-after-poison error caused by using the memory
in the MemPool after calling `MemPool::Clear()` when reading Kudu
arrays.

Testing:
- The ASAN build passed the core tests.

Change-Id: I9b729fc6003e64856ea0e197b1e3c74dad7247a1
Reviewed-on: http://gerrit.cloudera.org:8080/23668
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 22:38:32 +00:00
Joe McDonnell
5f91838ada IMPALA-14545: Don't use absolute hdfs paths for JDBC table driver.url
After IMPALA-13661 merged, S3PlannerTest.testDataSourceTables
has been failing with an error trying to fetch the JDBC driver
for functional.jdbc_decimal_tbl. This particular table's
definition uses a path like 'hdfs://localhost:20500/test-warehouse/...'
which explicitly depends on HDFS rather than relying
on the default filesystem. Changing this to use a path like
'/test-warehouse/...' without the HDFS dependency fixes the
S3PlannerTest. This changes create-ext-data-source-table.sql
to a template using WAREHOUSE_LOCATION_PREFIX and replaces that
variable before executing it. This is important for Ozone, as
Ozone uses a WAREHOUSE_LOCATION_PREFIX set to the Ozone volume.

Testing:
 - Ran S3 and regular HDFS fe tests

Change-Id: I3f2c86fcc6c1dee75d7d9a9be04468cb197ae13c
Reviewed-on: http://gerrit.cloudera.org:8080/23658
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 22:17:44 +00:00
Arnab Karmakar
760eb4f2fa IMPALA-13066: Extend SHOW CREATE TABLE to include stats and partitions
Adds a new WITH STATS option to the SHOW CREATE TABLE statement to
emit additional SQL statements for recreating table statistics and
partitions.

When specified, Impala outputs:

- Base CREATE TABLE statement.

- ALTER TABLE ... SET TBLPROPERTIES for table-level stats.

- ALTER TABLE ... SET COLUMN STATS for all non-partition columns,
restoring column stats.

- For partitioned tables:

  - ALTER TABLE ... ADD PARTITION statements to recreate partitions.

  - Per-partition ALTER TABLE ... PARTITION (...) SET TBLPROPERTIES
  to restore partition-level stats.

Partition output is limited by the PARTITION_LIMIT query option
(default 1000). Setting PARTITION_LIMIT=0 includes all partitions and
emits a warning if the limit is exceeded.

Tests added to verify correctness of emitted statements. Default
behavior of SHOW CREATE TABLE remains unchanged for compatibility.

Change-Id: I87950ae9d9bb73cb2a435cf5bcad076df1570dc2
Reviewed-on: http://gerrit.cloudera.org:8080/23536
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-12 06:11:37 +00:00
ttttttz
75c639c9cd IMPALA-14498: Fix a bug in initial code review checks
When conducting a code review using flake8-diff, it may fail in some code sections
due to the use of non-raw strings. This patch modifies one instance to successfully
pass the initial code review. Although it is currently working, it may not cover
all instances.

Change-Id: I71889a117c64500bab13928971a2bce063a72cd4
Reviewed-on: http://gerrit.cloudera.org:8080/23656
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Quanlong Huang <huangquanlong@gmail.com>
2025-11-12 01:05:10 +00:00
Michael Smith
d09940b5dd IMPALA-13563: Cleanup logging
Cleans up calls to logDebug and a few other locations:
- exit early if producing debug message input is expensive
- use slf4j parameterized logging
- normalize on logDebug handling isDebugEnabled checks

Change-Id: I32e1c62511c292d36aa879c60ae3d91ed4f65697
Reviewed-on: http://gerrit.cloudera.org:8080/22090
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-11 05:29:58 +00:00
Xuebin Su
6b6f7e614d IMPALA-14472: Add create/read support for ARRAY column of Kudu
Initial implementation of KUDU-1261 (array column type) recently merged
in upstream Apache Kudu repository. This patch add initial Impala
support for working with Kudu tables having array type columns.

Unlike rows, the elements of a Kudu array are stored in a different
format than Impala. Instead of per-row bit flag for NULL info, values
and NULL bits are stored in separate arrays.

The following types of queries are not supported in this patch:
- (IMPALA-14538) Queries that reference an array column as a table, e.g.
  ```sql
  SELECT item FROM kudu_array.array_int;
  ```
- (IMPALA-14539) Queries that create duplicate collection slots, e.g.
  ```sql
  SELECT array_int FROM kudu_array AS t, t.array_int AS unnested;
  ```

Testing:
- Add some FE tests in AnalyzeDDLTest and AnalyzeKuduDDLTest.
- Add EE test test_kudu.py::TestKuduArray.
  Since Impala does not support inserting complex types, including
  array, the data insertion part of the test is achieved through
  custom C++ code kudu-array-inserter.cc that insert into Kudu via
  Kudu C++ client. It would be great if we could migrate it to Python so
  that it can be moved to the same file as the test (IMPALA-14537).
- Pass core tests.

Co-authored-by: Riza Suminto

Change-Id: I9282aac821bd30668189f84b2ed8fff7047e7310
Reviewed-on: http://gerrit.cloudera.org:8080/23493
Reviewed-by: Alexey Serbin <alexey@apache.org>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-08 06:41:07 +00:00
Riza Suminto
671a7fcada IMPALA-14529: (addendum) Fix kudu_create.test
Kudu throws different error message after IMPALA-14529. This patch
adjust the error message in kudu_create.test to let the test pass.

Testing:
Pass TestDdlStatements.test_create_kudu and
TestKuduHMSIntegration.test_create_managed_kudu_tables.

Change-Id: Iff4cd08f46626d03b1f0800828e5872b83f522ca
Reviewed-on: http://gerrit.cloudera.org:8080/23648
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-06 22:42:34 +00:00
Yida Wu
f2f297a00f IMPALA-14533: Fix crash in ASAN/TSAN builds due to nullptr TcmallocMetric::BYTES_IN_USE
Impala uses SanitizerMallocMetric::BYTES_ALLOCATED instead of
TcmallocMetric::BYTES_IN_USE in ASAN or TSAN builds. However, the
admissiond logic in IMPALA-14493 still uses uninitialized
TcmallocMetric::BYTES_IN_USE under these builds, leading to a
nullptr crash.

To fix this issue, we will use SanitizerMallocMetric::BYTES_ALLOCATED
instead for ASAN and TSAN builds in admission controller, which is
the same logic in memory-metrics.cc to use a different metric for
those builds.

Tests:
Passed ASAN and TSAN builds testing.
Passed core tests.

Change-Id: Ic4fbdc134ea302f7302d177d073eb49136ba775c
Reviewed-on: http://gerrit.cloudera.org:8080/23646
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-06 21:56:53 +00:00
Michael Smith
8ed6d5c3ba IMPALA-14530: Use minimal debug info in Jenkins
Uses IMPALA_MINIMAL_DEBUG_INFO=true in Jenkins
build-all-flag-combinations.sh to reduce memory usage during linking and
avoid OOM kills. This script uses -skiptests to build all test binaries,
but doesn't run them, so debug info is not needed.

Change-Id: I4605b98d8d197e07c2eaac8218ff985c798875ed
Reviewed-on: http://gerrit.cloudera.org:8080/23641
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-06 16:09:56 +00:00
Michael Smith
2688e30ae5 IMPALA-14532: Fix SKIP_TOOLCHAIN_BOOTSTRAP
Fixes 'NATIVE_TOOLCHAIN_HOME: unbound variable' error when setting
'SKIP_TOOLCHAIN_BOOTSTRAP=true'.

Change-Id: I6562d49114590d89d2f43a4c23bba4a65e8abd74
Reviewed-on: http://gerrit.cloudera.org:8080/23640
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-05 22:42:22 +00:00
Michael Smith
0b9d6a7059 IMPALA-14531: Ignore new Hive config
Change-Id: Ic59caffc1f8b2a4e8693cb5e2770787f4817167e
Reviewed-on: http://gerrit.cloudera.org:8080/23639
Reviewed-by: Sai Hemanth Gantasala <saihemanth@cloudera.com>
Reviewed-by: Fang-Yu Rao <fangyu.rao@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-11-05 22:41:04 +00:00
Riza Suminto
0572dba245 IMPALA-14529: Bumping Kudu version to pickup latest KUDU-1261 patch
This commit bump Impala toolchain to pickup latest Kudu version up to
commit 60f5e5267b92c39485a66121d3ce3cc7ef57b0e0 (KUDU-1261 make
ArrayCellMetadataView::Init() more robust).

Change-Id: I68009e5fefd053882f5504cd2520bacb189a1b04
Reviewed-on: http://gerrit.cloudera.org:8080/23631
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-11-05 16:41:51 +00:00
Steve Carlin
62bf609942 IMPALA-14414: Calcite planner: Added new code to handle nan/inf
The current code works for NaN and Inf, but it breaks when upgrading
to v1.40.  This commit changes the code to handle these when we do
the upgrade to 1.40 and adds a basic test into the calcite.test to ensure
that when the upgrade happens, it does not break.

Change-Id: I8593a4942a2fe785a0c77134b78a9d97257225fc
Reviewed-on: http://gerrit.cloudera.org:8080/23561
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-05 12:55:39 +00:00
Riza Suminto
f34dea9b6f IMPALA-14522: Fix test_paimon_show_stats after DST ends
Test failed due to mismatch on "Last Creation Time" matching. This patch
fix the assertion with simple regex.

Testing:
Pass test_paimon.py.

Change-Id: I6855c0014111cef18318cdc4904782097a070ced
Reviewed-on: http://gerrit.cloudera.org:8080/23619
Reviewed-by: Mihaly Szjatinya <mszjat@pm.me>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-03 21:25:42 +00:00
stiga-huang
d358f6e87e IMPALA-14520: Fix wrong column numbers in document impala_workload_mgmt.xml
The tables in the doc actually have 4 columns. This patch fixes the
wrong properties in the doc which causes tables not showing correctly
in the PDF.

Tests:
 - Build PDF, plain-html and asf-site-html of the doc.

Change-Id: Ic05d8d963d3791ada6f5a4ac144796b710f9af70
Reviewed-on: http://gerrit.cloudera.org:8080/23615
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
2025-11-03 17:17:02 +00:00
Michael Smith
599b89306d IMPALA-13145: Upgrade mold to 2.40.4
Upgrades mold to the latest release.

Change-Id: If926b8065cccc4c9038c064c274b6ba97fdc2888
Reviewed-on: http://gerrit.cloudera.org:8080/23582
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-27 15:05:01 +00:00
Michael Smith
1152eef9bb IMPALA-14501: (Addendum) Fix single node perf run
Fixes open in generate_profile_files to read binary with Python 3,
matching generate_profile_file.

Change-Id: Ibd815e7eb989d7a2bcf52cadfcde4f355c18a148
Reviewed-on: http://gerrit.cloudera.org:8080/23596
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-10-25 17:31:06 +00:00
Joe McDonnell
3398f20afe IMPALA-14491: Fix run-workload.py's handling of HS2's exec summary
Recently, we switched bin/run-workload.py to use HS2. It turns
out that the HS2 client code is not producing the same data
structure for the exec summary. report_benchmark_results.py
relies on that data structure and fails for HS2.

This changes the HS2 client code to use the same representation
as the beeswax. There is already a function that does this
conversion (build_summary_table_from_thrift) for our regular
tests, so this reuses that function.

Testing:
 - Ran bin/run-workload.py twice to produce json files and
   processed them with report_benchmark_results.py. This
   failed before the change and passed afterward.

Change-Id: I0a041bdebe748b6b3a05b552584e0ca2327cff67
Reviewed-on: http://gerrit.cloudera.org:8080/23597
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-25 16:37:46 +00:00
jichen0919
541fb3f405 IMPALA-14092 Part1: Prohibit Unsupported Operation for paimon table
This patch is to prohibit un-supported operation against
paimon table. All unsupported operations are added the
checked in the analyze stage in order to avoid
mis-operation. Currently only CREATE/DROP statement
is supported, the prohibition will be removed later
after the corresponding operation is truly supported.

TODO:
    - Patches pending submission:
        - Support jni based query for paimon data table.
        - Support tpcds/tpch data-loading
          for paimon data table.
        - Virtual Column query support for querying
          paimon data table.
        - Query support with time travel.
        - Query support for paimon meta tables.

Testing:
    - Add unit test for AnalyzeDDLTest.java.
    - Add unit test for AnalyzerTest.java.
    - Add test_paimon_negative and test_paimon_query in test_paimon.py.

Change-Id: Ie39fa4836cb1be1b1a53aa62d5c02d7ec8fdc9d7
Reviewed-on: http://gerrit.cloudera.org:8080/23530
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 23:06:08 +00:00
Michael Smith
ea0ef5a799 IMPALA-14511: Fix pgrep to avoid warning
kill-all.sh tries to find a process named mini-impalad-cluster with,
which results in an (ignored) error

    pgrep: pattern that searches for process name longer than 15
    characters will result in zero matches
    Try `pgrep -f' option to match against the complete command line.

This was accidentally changed from mini-impala-cluster in 2015. Neither
term is used anymore, so this process name will never exist. Remove it
to fix the error.

Change-Id: Id1340e85cbcd3b699b333316da618774cb4e9dcd
Reviewed-on: http://gerrit.cloudera.org:8080/23586
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-23 22:00:36 +00:00
pranav.lodha
7f77176970 IMPALA-13869: Support for 'hive.sql.query' property for Hive JDBC tables
This patch adds support for the hive.sql.query table property in Hive
JDBC tables accessed through Impala. Impala has support for Hive
JDBC tables using the hive.sql.table property, which limits users
to simple table access. However, many use cases demand the ability
to expose complex joins, filters, aggregations, or derived columns
as external views. Hive.sql.query leads to a custom SQL query that
returns a virtual table(subquery) instead of pointing to a physical
table. These use cases cannot be achieved with just the hive.sql.table
property. This change allows Impala to:
 • Interact with views or complex queries defined on external
 systems without needing schema-level access to base tables.
 • Expose materialized logic (such as filters, joins, or
 transformations) via Hive to Impala consumers in a secure,
 abstracted way.
 • Better align with data virtualization use cases where
 physical data location and structure should be hidden from
 the querying engine.
This patch also lays the groundwork for future enhancements such
as predicate pushdown and performance optimizations for Hive
JDBC tables backed by queries.

Testing: End-to-end tests are included in
test_ext_data_sources.py.

Change-Id: I039fcc1e008233a3eeed8d09554195fdb8c8706b
Reviewed-on: http://gerrit.cloudera.org:8080/22865
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 21:34:29 +00:00
Michael Smith
a12e49e38d IMPALA-14509: Let Ozone set OZONE_OPTS
Remove our customization of OZONE_OPTS as it's redundant with
ozone-functions.sh. Our options also didn't work with Java 17.

Change-Id: If600dd160e6bc72320081ecee2cb0de3c73eb7bd
Reviewed-on: http://gerrit.cloudera.org:8080/23580
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 15:11:39 +00:00
Abhishek Rawat
a8618c6a65 IMPALA-10204: Make AdmitQuery params more efficient
The admission request may contain the lineage graphs and
other stuff that the admission control service doesn't need.
For example, currently the admission controller service would
hold onto the full TQueryExecRequest object for the entire
lifetime of a query, even after the admission decision was
complete. This led to unnecessary memory consumption.

This commit introduces two optimizations for reducing the
memory footprint:
1.  A lightweight copy of TQueryExecRequest is now created
on the client side before sending to the admission control
service. Fields that are not required for admission
decisions (e.g., query_plan, lineage_graph) are cleared from
this copy.
2.  The AdmissionState now uses a unique_ptr to manage the
TQueryExecRequest. This allows the object's memory to be
explicitly released as soon as the query schedule is generated
and the request object is no longer needed.

During a customized high concurrent TPCDS run, without the
change, the peak memory usage in admissiond was around 2GB.
With this change, it required less than half that memory.

Tests:
Passed exhaustive tests.

Change-Id: I1ba5e8818336bd1fc3ad604a0acee5eb7a1116c4
Reviewed-on: http://gerrit.cloudera.org:8080/23546
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
2025-10-23 14:33:57 +00:00
Yida Wu
1bc7cdbff6 IMPALA-14493: Cap memory usage of global admission service
The global admission service can experience OOM errors under
high concurrency because its process memory tracker is inaccurate
and doesn't account for all memory allocations.

Ensuring memory tracker accurately accounts for every allocation
could be difficult, this patch uses a simpler solution to
introduce a hard memory cap using tcmalloc statistics, which
accurately reflect the true process memory usage. If a new query
is submitted while tcmalloc memory usage is over the process
limit, the query will be rejected immediately to protect from OOM.

Adds a new flag enable_admission_service_mem_safeguard allowing
this feature to be enabled or disabled. By default, this feature is
turned on

Tests:
Added test test_admission_service_low_mem_limit.
Passed exhaustive tests.

Change-Id: I2ee2c942a73fcd69358851fc2fdc0fc4fe531c73
Reviewed-on: http://gerrit.cloudera.org:8080/23542
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 12:13:11 +00:00
stiga-huang
ff8bb33b91 IMPALA-12870: Tag query id for Java pool threads
Logs from Java threads running in ExecutorService are missing the query
id which is stored in the C++ thread-local ThreadDebugInfo variable.
This patch adds JNI calls for Java threads to manage the ThreadDebugInfo
variable. Currently two thread pools are changed:
 - MissingTable loading pool in StmtMetadataLoader.parallelTableLoad().
 - Table loading pool in TableLoadingMgr.

MissingTable loading pool only lives within the parallelTableLoad()
method. So we initialize ThreadDebugInfo with the queryId at the
beginning of the thread and delete it at the end of the thread. Note
that a thread might be reused to load different tables, but they all
belong to the same query.

Table loading pool is a long running pool in catalogd that never
shut down. Threads in it is used to load tables triggered by different
queries. We initialize ThreadDebugInfo as the above but update it when
the thread starts loading table for a different query id, and reset it
when the loading is done. The query id is passed down from the catalogd
RPC request headers.

Tests:
 - Added e2e test to verify the logs.
 - Ran existing CORE tests.

Change-Id: I83cca55edc72de35f5e8c5422efc104e6aa894c1
Reviewed-on: http://gerrit.cloudera.org:8080/23558
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 03:35:29 +00:00
Joe McDonnell
1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
Steve Carlin
c67b19daf6 IMPALA-14405: Labels for Calcite expressions not matching original planner
Calcite sets literal expressions to EXPR$<x> which did not match
expressions given by the Impala planner. For literal expressions
such as "select 1 + 1", Impala creates the column name as "1 + 1".

The field names can be found in the abstract syntax tree, so
they are not set within the CalciteRelNodeConverter before the
logical tree is created.

A small test was added to calcite.test for a basic sanity check,
but more comprehensive tests will be run in the tests/shell module
(e.g. in test_shell_commandline.py and test_shell_interactive) which
contain tests for labels.

Change-Id: Ibd3e6366a284f53807b4b2c42efafa279249c1ea
Reviewed-on: http://gerrit.cloudera.org:8080/23516
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-22 03:37:48 +00:00
Steve Carlin
420e357b95 IMPALA-13695: Calcite planner: fix for ndv with 2 args
The NDV function was crashing when called with the "scale" arg. This
requires special processing which exists in FunctionCallExpr.

The validation for this is now done in ImpalaNdvFunction
and the special calculation is done within ImpalaAggRel

This also fixes ndv for varchar types. The aggregation call
within CoerceNodes was not differentiating between varchar
and string. A cast to string function is needed in order
to run the ndv function on a varchar column.

Change-Id: I82419f77e043e9975865a042ffb8db75a26931f7
Reviewed-on: http://gerrit.cloudera.org:8080/23513
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-20 23:28:39 +00:00
Michael Smith
512a73771f IMPALA-14452: Fix impala-shell SSL with Python 3.12
Removes deprecated ImpalaHttpClient constructor that supported port and
path as it has been deprecated since at least 2020 and appears unused.

Removes cert_file and key_file as they were also never used, and if
required must now be passed in via ssl_context.

Updates TSSLSocket fixes for Thrift 0.16 and Python 3.12. _validate_cert
was removed by Thrift 0.16, but everything worked because Thrift used
ssl.match_hostname instead. With Python 3.12 ssl.match_hostname no
longer exists so we rely on OpenSSL to handle verification with
ssl.PROTOCOL_TLS_CLIENT.

Only uses ssl.PROTOCOL_TLS_CLIENT when match_hostname is unavailable to
avoid changing existing behavior. THRIFT-792 identifies that TSocket
suppresses connection errors, where we would otherwise see SSL hostname
verification errors like

    ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED]
    certificate verify failed: IP address mismatch, certificate is not
    valid for '::1'. (_ssl.c:1131)

Python 2.7.9 and 3.2 are minimum required versions; both have been EOL
for several years.

Testing:
- ran custom_cluster/{test_client_ssl.py,test_ipv6.py} on Ubuntu 24 with
  Python 3.12, OpenSSL 3.0.13.
- ran custom_cluster/test_client_ssl.py on RHEL 7.9 with Python 2.7.5
  and Python 3.6.8, OpenSSL 1.0.2k-fips.
- adds test that hostname checking is configured.

Change-Id: I046a9010ac4cb1f7d705935054b306cddaf8bdc7
Reviewed-on: http://gerrit.cloudera.org:8080/23519
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-10-20 09:55:22 +00:00
stiga-huang
ec31324eb5 IMPALA-14502: Not tracking metrics in IncompleteTable
Tables that are in unloaded state are represented as IncompleteTable.
Table level metrics of them won't be used at all but occupy around 7KB
of memory for each table. This is a significant amount comparing to the
table name strings.

This patch skips initializing these metrics for IncompleteTable to save
memory usage. This reduces the initial memory requirement to launch
catalogd.

To avoid other codes unintentionally add new metrics to IncompleteTable,
overrides all Table methods that use metrics_ to return simple results,
e.g. IncompleteTable.getMedianTableLoadingTime() always returns 0.

IncompleteTable.getMetrics() shouldn't be used. Added a Precondition
check for this.

Tests:
 - Verified in a heap dump file after loading 1.3M IncompleteTables that
   the heap usage reduces to 2GB and only few instances of
   com.codahale.metrics.Timer are created. Previously catalogd OOM in a
   heap size of 18GB when running global IM, and the number of
   com.codahale.metrics.Timer instances is similar to the number of
   IncompleteTables.
 - Passed CORE tests.

Change-Id: If0fcfeab99bbfbefe618d0abf7f2482a0cc5ef9f
Reviewed-on: http://gerrit.cloudera.org:8080/23547
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2025-10-17 20:17:48 +00:00
Michael Smith
7fb986e47a IMPALA-14504: Use shaded hbase, protobuf from Hadoop
Switches to shaded Hbase so it can include its own versions of
dependencies. Note that hbase-client includes hbase-common,
hbase-protocol.

Excludes older protobuf-java from mysql-connector so we get it from
Hadoop.

Allows orc-format 1.0, which is a dependency in future ORC releases.

Change-Id: I386d03c3123ce1159abc54c505f60e0ae619f5fe
Reviewed-on: http://gerrit.cloudera.org:8080/23553
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-17 01:47:18 +00:00
Steve Carlin
69813a8c40 IMPALA-14464: Calcite planner should allow semi-colon in statement
The Calcite planner now handles a sql statement that has a semi-colon
at the end. Note that impala-shell doesn't pass the semi-colon into
the server. This is only seen with a direct call to the server.

Change-Id: Ie690159cd03f28f6b793628aa946292af71b6970
Reviewed-on: http://gerrit.cloudera.org:8080/23517
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-17 00:59:44 +00:00
stiga-huang
f0a781806f IMPALA-14494: Tag catalogd logs of GetPartialCatalogObject requests with correct query ids
Catalogd logs of GetPartialCatalogObject requests are not tagged with
correct query ids. Instead, the query id that is previously using that
thread is printed in the logs. This is fixed by using
ScopedThreadContext which resets the query id at the end of the RPC
code.

Add DCHECKs to make sure ThreadDebugInfo is initialized before being
used in Catalog methods. An instance is added in CatalogdMain() for
this.

This patch also adds the query id in GetPartialCatalogObject requests so
catalogd can tag the responding thread with it.

Some codes are copied from Michael Smith's patch: https://gerrit.cloudera.org/c/22738/

Tested by enabling TRACE logging in org.apache.impala.common.JniUtil to
verify logs of GetPartialCatalogObject requests.

I20251014 09:39:39.685225 342587 JniUtil.java:165] 964e37e9303d6f8a:eab7096000000000] getPartialCatalogObject request: Getting partial catalog object of CATALOG_SERVICE_ID
I20251014 09:39:39.690346 342587 JniUtil.java:176] 964e37e9303d6f8a:eab7096000000000] Finished getPartialCatalogObject request: Getting partial catalog object of CATALOG_SERVICE_ID. Time spent: 5ms
I20251014 09:39:39.699471 342587 JniUtil.java:165] 964e37e9303d6f8a:eab7096000000000] getPartialCatalogObject request: Getting partial catalog object of DATABASE:functional
I20251014 09:39:39.701821 342587 JniUtil.java:176] 964e37e9303d6f8a:eab7096000000000] Finished getPartialCatalogObject request: Getting partial catalog object of DATABASE:functional. Time spent: 2ms
I20251014 09:39:39.711462 341074 TAcceptQueueServer.cpp:368] New connection to server CatalogService from client <Host: 127.0.0.1 Port: 42084>
I20251014 09:39:39.719146 342588 JniUtil.java:165] 964e37e9303d6f8a:eab7096000000000] getPartialCatalogObject request: Getting partial catalog object of TABLE:functional.alltypestiny

Change-Id: Ie63363ac60e153e3a69f2a4cf6a0f4ce10701674
Reviewed-on: http://gerrit.cloudera.org:8080/23535
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-16 07:06:29 +00:00
Riza Suminto
3560621931 IMPALA-14503: Log maven dependency when building frontend
Impala Frontend has plenty of dependency, along with multitudes of
dependency exclusion/inclusion rules in it. This patch adds maven
dependency tree log to logs/mvn/mvn.log when invoking "make java"
command.

Testing:
Manually run "make java" from $IMPALA_HOME and verify that the
dependency trees are logged to logs/mvn/mvn.log.

Change-Id: I8cbe20faeab24bae708733d54996bd6c1dd97757
Reviewed-on: http://gerrit.cloudera.org:8080/23551
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-10-15 22:56:05 +00:00
Zoltan Borok-Nagy
bfae4d0b32 IMPALA-14496: Impala crashes when it writes multiple delete files per partition in a single DELETE operation
Impala crashes when it needs to write multiple delete files per
partition in a single DELETE operation. It is because
IcebergBufferedDeleteSink has its own DmlExecState object, but
sometimes the methods in TableSinkBase use the RuntimeState's
DmlExecState object. I.e. it can happen that we add a partition
to the IcebergBufferedDeleteSink's DmlExecState, but later we
expect to find it in the RuntimeState's DmlExecState.

This patch adds new methods to TableSinkBase that are specific
for writing delete files, and they always take a DmlExecState
object as a parameter. They are now used by IcebergBufferedDeleteSink.

Testing
 * added e2e tests

Change-Id: I46266007a6356e9ff3b63369dd855aff1396bb72
Reviewed-on: http://gerrit.cloudera.org:8080/23537
Reviewed-by: Mihaly Szjatinya <mszjat@pm.me>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-15 19:58:37 +00:00
Michael Smith
1a74ee03f3 IMPALA-14500: Clarify usage of SYSTEM_VERSION
Clarifies that SYSTEM_VERSION in Iceberg queries refers to a snapshot
id.

Change-Id: I64c4dc9ce82af320602f8de7c435242aa2f90d77
Reviewed-on: http://gerrit.cloudera.org:8080/23543
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2025-10-14 22:59:52 +00:00
Zoltan Borok-Nagy
7e34cabed7 IMPALA-14481: Use $JAVA instead of java in run-iceberg-rest-server.sh
Using the plain 'java' command in run-iceberg-rest-server.sh might
result in using a different Java version than what we used for
compilation.

$JAVA is set in bin/impala-config.sh to the desired Java version,
and we should use it in our scripts instead of just using 'java'.

Change-Id: I5f9c21de4c85d38dca7690fc110c4c44448840ed
Reviewed-on: http://gerrit.cloudera.org:8080/23539
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-14 21:35:47 +00:00
Riza Suminto
141f8b97ff IMPALA-14492: Document delete orphan files for Iceberg table
This patch adds documentation for REMOVE_ORPHAN_FILES query added by
IMPALA-12337.

Change-Id: Ie8de6112bf9ccd879ea3e14d86e67b99e1087c0f
Reviewed-on: http://gerrit.cloudera.org:8080/23532
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2025-10-14 16:13:23 +00:00
Riza Suminto
1008decc07 IMPALA-14447: Parallelize table loading in getMissingTables()
StmtMetadataLoader.getMissingTables() load missing tables in serial
manner. In local catalog mode, large number of serial table loading can
incur significant round trip latency to CatalogD. This patch parallelize
the table loading by using executor service to lookup and gather all
non-null FeTables from given TableName set.

Modify LocalCatalog.loadDbs() and LocalDb.loadTableNames() slightly to
make it thread-safe. Change FrontendProfile.Scope to support nested
scope referencing the same FrontendProfile instance.

Added new flag max_stmt_metadata_loader_threads to control the maximum
number of threads to use for loading table metadata during query
compilation. It is deafult to 8 threads per query compilation.

If there is only one table to load, max_stmt_metadata_loader_threads set
to 1, or RejectedExecutionException raised, fallback to load table
serially.

Testing:
Run and pass few tests such as test_catalogd_ha.py,
test_concurrent_ddls.py, and test_observability.py.
Add FE tests CatalogdMetaProviderTest.testProfileParallelLoad.
Manually run following query and observe parallel loading by setting
TRACE level log in CatalogdMetaProvider.java.

use functional;
select count(*) from alltypesnopart
union select count(*) from alltypessmall
union select count(*) from alltypestiny
union select count(*) from alltypesagg;

Change-Id: I97a5165844ae846b28338d62e93a20121488d79f
Reviewed-on: http://gerrit.cloudera.org:8080/23436
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-13 12:53:47 +00:00
Steve Carlin
cde4bc016c IMPALA-14115: Calcite planner: Added top-n analytic PlanNode optimization.
Impala has an optimization for analytic expressions that have a rank filter on
top of the analytic expression. It can add a top-n plan node to reduce the amount
of rows examined. This is tested in tpcds query 67.

The optimization logic relies on an unassigned rank conjunct within the analyzer
while creating the analytic plan node.

A slight reorganization of the code was needed to implement this optimization.
The SlotRefs for the AnalyticInfo needed to be created a little earlier from
where it was done in the previous commit.

A small fix was made to normalize binary predicates. A non-normalized binary
predicate prevents the optimization from being used.

A call to the checkAndApplyLimitPushdown is needed for some of the optimizations
to kick in.

A new AllProjectInfo internal class was created to hold the relationships
between the Calcite RexNode objects and the Impala Analytic expressions.

Also, IMPALA-14158 is fixed by this commit. The nullsFirst value was
incorrect when the syntax was explicit in the query.

A new Calcite planner test was added in the junit tests to ensure the
optimization kicks in. The new test file is in the
PlannerTest/calcite/limit-pushdown-analytic-calcite.test file. This is a copy
of the limit-pushdown-analytic.test file in its parent directory but with some
modified results. Most of the differences are trivial, but IMPALA-14469 has been
filed to deal with one optimization that did not get fixed, which is when
the order by clause has a constant expression.

Change-Id: Ie6fa6781db56771b13b0cf49bd236f776016bf8d
Reviewed-on: http://gerrit.cloudera.org:8080/23317
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
2025-10-10 17:11:45 +00:00
Michael Smith
98f993da43 IMPALA-14478: Add CDP ORC build
Adds CDP_ORC_JAVA_VERSION so we can build and test with Apache or CDP
versions of ORC.

Change-Id: Id9ba78051aff9c9129c244b1734b6f8a523858b5
Reviewed-on: http://gerrit.cloudera.org:8080/23506
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-08 23:34:55 +00:00
Riza Suminto
3d61c5ea9f IMPALA-14476: Workaround TSAN issue in KuduClient
Since the toolchain was bumped to pick up Kudu's array column
feature (KUDU-1261), Impala's TSAN builds on the master branch
consistently break during dataload with a data race detected by TSAN.

The source of data race lies within libkudu_client.so and only trigger
if Impala build machine has both ipv4 and ipv6 associated with
localhost. Until the exact root cause is found and fixed, this patch
workaround the TSAN issue by fixing KUDU_MASTER_HOSTS env var to
127.0.0.1.

Testing:
Run TSAN build and confirm no data race error is emmitted.

Change-Id: I511ab625d18c6007567083557fcdf98980a6ac6f
Reviewed-on: http://gerrit.cloudera.org:8080/23507
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-10-08 14:40:50 +00:00
jasonmfehr
c0b3580754 IMPALA-14372: Output OpenTelemetry SDK Logs to Impala Logs
Emits log messages from the OpenTelemetry SDK to the Impalad DEBUG,
INFO, WARNING, and ERROR logs. Previously, these SDK log messages
were dropped.

Modifies the function of the 'otel_debug' startup flag. This flag
defaults to 'false' which causes log messages from the SDK to be
dropped. When set to 'true', log messages from the OpenTelemetry SDK
will be sent to the Impala logging system. The overall glog level is
applied to all messages sent from the OpenTelemetry SDK, thus DEBUG
SDK logs will not appear in the Impalad logs unless the glog level
is greater than or equal to 2.

When a trace is successfully sent to the OpenTelemetry collector,
zero log lines are generated. When a trace cannot be sent, local
testing showed 12 lines with a total size around 3k were written
between the impalad.ERROR and impalad.WARNING log files. The request
body is not included in these log messages unless the glog level is
greater than or equal to 2 thus log message size will not grow or
shrink based on the size of the trace(s).

This patch also removes the completely useless
'LoggingInstrumentation' class. Previously, the 'otel_debug' flag
caused this class to log messages, but those messages provided no
insightful information.

Generated-by: Github Copilot (Claude Sonnet 3.7)
Change-Id: I41aba21f46233e6430eede9606be1e791071717a
Reviewed-on: http://gerrit.cloudera.org:8080/23418
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-08 00:00:01 +00:00
Peter Rozsa
570ab71c9d IMPALA-14287: Resolve environment variables in REST server configurations
This change adds a step to REST server configuration loading that
resolves environment variables noted as ${ENV:VARIABLE_NAME} format. If
the environment variable is not set, then the reference text remains the
same and Impala logs an error.

Tests:
  - unit tests added

Change-Id: I3faccc15d012c389703c58371a4d38cca82bef60
Reviewed-on: http://gerrit.cloudera.org:8080/23457
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-07 17:29:45 +00:00
Joe McDonnell
762fe0a4f5 IMPALA-14473: Fix absolute path logic for sorting scan ranges oldest to newest
When IMPALA-14462 added tie-breaking logic to
ScanRangeOldestToNewestComparator, it relied on absolute path
being unset if the relative path is set. However, the code
always sets absolute path and uses an empty string to indicate
whether it is set. This caused the tie-breaking logic to see
two unrelated scan ranges as equal, triggering a DCHECK when
running query_test/test_tuple_cache_tpc_queries.py.

The fix is to rearrange the logic to check whether the relative
path is not empty rather than checking whether the absolute
path is set.

Testing:
 - Ran query_test/test_tuple_cache_tpc_queries.py
 - Ran custom_cluster/test_tuple_cache.py

Change-Id: I449308f4a0efdca7fc238e3dda24985a2931dd37
Reviewed-on: http://gerrit.cloudera.org:8080/23495
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-10-07 17:02:43 +00:00
pranav.lodha
a77fec6391 IMPALA-13661: Support parallelism above JDBC tables for joins/aggregates
Impala's planner generates a single-fragment, single-
threaded scan node for queries on JDBC tables because table
statistics are not properly available from the external
JDBC source. As a result, even large JDBC tables are
executed serially, causing suboptimal performance for joins,
aggregations, and scans over millions of rows.

This patch enables Impala to estimate the number of rows in a JDBC
table by issuing a COUNT(*) query at query preparation time. The
estimation is returned via TPrepareResult.setNum_rows_estimate()
and propagated into DataSourceScanNode. The scan node then uses
this cardinality to drive planner heuristics such as join order,
fragment parallelization, and scanner thread selection.

The design leverages the existing JDBC accessor layer:
- JdbcDataSource.prepare() constructs the configuration and invokes
  GenericJdbcDatabaseAccessor.getTotalNumberOfRecords().
- The accessor wraps the underlying query in:
      SELECT COUNT(*) FROM (<query>) tmptable
  ensuring correctness for both direct table scans and parameterized
  query strings.
- The result is captured as num_rows_estimate, which is then applied
  during computeStats() in DataSourceScanNode.
With accurate (or approximate) row counts, the planner can now:
- Assign multiple scanner threads to JDBC scan nodes instead of
   falling back to a single-thread plan.
- Introduce exchange nodes where beneficial, parallelizing data
   fetches across multiple JDBC connections.
- Produce better join orders by comparing JDBC row cardinalities
   against native Impala tables.
- Avoid severe underestimation that previously defaulted to wrong
   table statistics, leading to degenerate plans.

For a sample join query mentioned in the test file,
these are the improvements:

Before Optimization:
- Cardinality fixed at 1 for all JDBC scans
- Single fragment, single thread per query
- Max per-host resource reservation: ~9.7 MB, 1 thread
- No EXCHANGE or MERGING EXCHANGE operators
- No broadcast distribution; joins executed serially
- Example query runtime: ~77s

SCAN JDBC A
   \
    HASH JOIN
       \
        SCAN JDBC B
           \
            HASH JOIN
               \
                SCAN JDBC C
                   \
                    TOP-N -> ROOT

After Optimization:
- Cardinality derived from COUNT(*) (e.g. 150K, 1.5M rows)
- Multiple fragments per scan, 7 threads per query
- Max per-host resource reservation: ~123 MB, 7 threads
- Plans include EXCHANGE and MERGING EXCHANGE operators
- Broadcast joins on small sides, improving parallelism
- Example query runtime: ~38s (~2x faster)

SCAN JDBC A --> EXCHANGE(SND) --+
                                  \
                                   EXCHANGE(RCV) -> HASH JOIN(BCAST) --+
SCAN JDBC B --> EXCHANGE(SND) ----/                                   \
                                                                         HASH JOIN(BCAST) --+
SCAN JDBC C --> EXCHANGE(SND) ------------------------------------------/                 \
                                                                                             TOP-N
                                                                                               \
                                                                                                MERGING EXCHANGE -> ROOT

Also added a new backend configuration flag
--min_jdbc_scan_cardinality (default: 10) to provide a
lower bound for scan node cardinality estimates
during planning. This flag is propagated from BE
to FE via TBackendGflags and surfaced through
BackendConfig, ensuring the planner never produces
unrealistically low cardinality values.

TODO: Add a query option for this optimization
to avoid extra JDBC round trip for smaller
queries (IMPALA-14417).

Testing: All cases of Planner tests are written in
jdbc-parallel.test. Some basic metrics
are also mentioned in the commit message.

Change-Id: If47d29bdda5b17a1b369440f04d4e209d12133d9
Reviewed-on: http://gerrit.cloudera.org:8080/23112
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2025-10-04 15:42:38 +00:00