188 Commits

Author SHA1 Message Date
Riza Suminto
d4992d532b Revert "IMPALA-14454: Exclude log4j 2 dependencies"
This reverts commit 52b87fcefd.

The original commit caused an issue when Impala is deployed together
with Apache Atlas. Coordinator failed to start with error message:

java.lang.NoClassDefFoundError: org/apache/logging/log4j/core/Layout

Solved minor conflict in impala-config.sh due to IMPALA-14478 applied
after IMPALA-14454.

Change-Id: I77127db8d833c675c18c30eb3d6542ca906cd2a9
Reviewed-on: http://gerrit.cloudera.org:8080/23788
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-16 00:26:34 +00:00
Arnab Karmakar
ddd82e02b9 IMPALA-14065: Support WHERE clause in SHOW PARTITIONS statement
This patch extends the SHOW PARTITIONS statement to allow an optional
WHERE clause that filters partitions based on partition column values.
The implementation adds support for various comparison operators,
IN lists, BETWEEN clauses, IS NULL, and logical AND/OR expressions
involving partition columns.

Non-partition columns, subqueries, and analytic expressions in the
WHERE clause are not allowed and will result in an analysis error.

New analyzer tests have been added to AnalyzeDDLTest#TestShowPartitions
to verify correct parsing, semantic validation, and error handling for
supported and unsupported cases.

Testing:
- Added new unit tests in AnalyzeDDLTest for valid and invalid WHERE
clause cases.
- Verified functional tests covering partition filtering behavior.

Change-Id: I2e2a14aabcea3fb17083d4ad6f87b7861113f89e
Reviewed-on: http://gerrit.cloudera.org:8080/23566
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-11 15:36:08 +00:00
jichen0919
7e29ac23da IMPALA-14092 Part2: Support querying of paimon data table via JNI
This patch mainly implement the querying of paimon data table
through JNI based scanner.

Features implemented:
- support column pruning.
The partition pruning and predicate push down will be submitted
as the third part of the patch.

We implemented this by treating the paimon table as normal
unpartitioned table. When querying paimon table:
- PaimonScanNode will decide paimon splits need to be scanned,
  and then transfer splits to BE do the jni-based scan operation.

- We also collect the required columns that need to be scanned,
  and pass the columns to Scanner for column pruning. This is
  implemented by passing the field ids of the columns to BE,
  instead of column position to support schema evolution.

- In the original implementation, PaimonJniScanner will directly
  pass paimon row object to BE, and call corresponding paimon row
  field accessor, which is a java method to convert row fields to
  impala row batch tuples. We find it is slow due to overhead of
  JVM method calling.
  To minimize the overhead, we refashioned the implementation,
  the PaimonJniScanner will convert the paimon row batches to
  arrow recordbatch, which stores data in offheap region of
  impala JVM. And PaimonJniScanner will pass the arrow offheap
  record batch memory pointer to the BE backend.
  BE PaimonJniScanNode will directly read data from JVM offheap
  region, and convert the arrow record batch to impala row batch.

  The benchmark shows the later implementation is 2.x better
  than the original implementation.

  The lifecycle of arrow row batch is mainly like this:
  the arrow row batch is generated in FE,and passed to BE.
  After the record batch is imported to BE successfully,
  BE will be in charge of freeing the row batch.
  There are two free paths: the normal path, and the
  exception path. For the normal path, when the arrow batch
  is totally consumed by BE, BE will call jni to fetch the next arrow
  batch. For this case, the arrow batch is freed automatically.
  For the exceptional path, it happends when query  is cancelled, or memory
  failed to allocate. For these corner cases, arrow batch is freed in the
  method close if it is not totally consumed by BE.

Current supported impala data types for query includes:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE

TODO:
    - Patches pending submission:
        - Support tpcds/tpch data-loading
          for paimon data table.
        - Virtual Column query support for querying
          paimon data table.
        - Query support with time travel.
        - Query support for paimon meta tables.
    - WIP:
        - Snapshot incremental read.
        - Complex type query support.
        - Native paimon table scanner, instead of
          jni based.

Testing:
    - Create tests table in functional_schema_template.sql
    - Add TestPaimonScannerWithLimit in test_scanners.py
    - Add test_paimon_query in test_paimon.py.
    - Already passed the tpcds/tpch test for paimon table, due to the
      testing table data is currently generated by spark, and it is
      not supported by impala now, we have to do this since hive
      doesn't support generating paimon table for dynamic-partitioned
      tables. we plan to submit a separate patch for tpcds/tpch data
      loading and associated tpcds/tpch query tests.
    - JVM Offheap memory leak tests, have run looped tpch tests for
      1 day, no obvious offheap memory increase is observed,
      offheap memory usage is within 10M.

Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384
Reviewed-on: http://gerrit.cloudera.org:8080/23613
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-12-05 18:19:57 +00:00
Joe McDonnell
5eea4f6f79 IMPALA-14559: Ship calcite-planner jar in Impala packages
This adds the java/impala-package Maven project to make it easier
to ship / test the Calcite planner. impala-package has a dependency
on impala-frontend and calcite-planner, so its classpath requires
no extra work when constructing the classpath.

An additional cleanup is that this no longer puts the
impala-frontend-*-tests.jar on the classpath by default. This requires
updating the query event hooks test, as it relies on that jar being
present.

This does not change the default value for the use_calcite_planner
query option, so there is no change in behavior.

Testing:
 - Ran a core job
 - Built docker images and OS packages locally

Change-Id: I81dec2a5b59e279229a735c8bb1a23c77111a793
Reviewed-on: http://gerrit.cloudera.org:8080/23497
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-21 03:36:12 +00:00
Steve Carlin
e67b627858 IMPALA-14408: (addendum) Log Calcite exception in profile
This addendum logs the exception thrown in the runtime profile
under the CalciteFailureReason key.

Testing: test_ranger.py uses this.

Change-Id: Ia18a52c488f9c73d51690997b277fd8e918c645f
Reviewed-on: http://gerrit.cloudera.org:8080/23686
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-20 21:08:48 +00:00
Steve Carlin
a6bb0c7c45 IMPALA-14408: Use regular path for Calcite planner instead of CalciteJniFrontend
When the --use_calcite_planner=true option is set at the server level,
the queries will no longer go through CalciteJniFrontend. Instead, they
will go through the regular JniFrontend, which is the path that is used
when the query option for "use_calcite_planner" is set.

The CalciteJniFrontend will be removed in a later commit.

This commit also enables fallback to the original planner when an unsupported
feature exception is thrown. This needed to be added to allow the tests to run
properly. During initial database load, there are queries that access complex
columns which throws the unsupported exception.

Change-Id: I732516ca8f7ea64f73484efd67071910c9b62c8f
Reviewed-on: http://gerrit.cloudera.org:8080/23523
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
Tested-by: Steve Carlin <scarlin@cloudera.com>
2025-11-20 21:08:48 +00:00
Steve Carlin
54c0074b33 IMPALA-14405 ADDENDUM: Catch exception for bad column names
This commit is a fix on top of IMPALA-14405 for the Calcite
planner. The original commit matches column names from the
expression in the select clause.

For instance, if the query is "select 1 + 1", the label in
impala-shell will be "1 + 1". It accomplished this by
retrieving the string from the SqlNode object through the
MySql dialect.

However, when the expression doesn't succeed in the MySql
dialect, an AssertionError gets thrown, causing the query to
fail. We don't want the query to fail, we just want to go
back to using the Calcite expression, e.g. EXPR$0. This
occurred with this specific query:

"select timestamp_col + interval 3 nanoseconds"

So now the exception is caught and the default label name
is used. Eventually we should try to match what Impala has,
but this is a harder problem to fix.

Change-Id: I6c4d76a25fb2486eb1ef19485bce7888d45d282f
Reviewed-on: http://gerrit.cloudera.org:8080/23665
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-11-18 21:34:29 +00:00
Steve Carlin
52334ba426 IMPALA-14421: Calcite planner: case statement returning wrong types for char, varchar
The 'case' function resolver in the original Impala planner has a quirk in it
which caused issues in the Calcite planner.

The function resolver for the original planner resolves all case statements with
the "boolean" version.  Later on, in the analysis of the CaseExpr, the proper
types are assessed and the necessary casting is added.

The Calcite planner follows a similar path. The resolver always returns boolean
as well and the coerce nodes module determines the proper return type for
the case statement.

Two other related issues are also fixed here:

Literal strings should be treated as type STRING instead of CHAR(X), but a null
should literal should not be changed from a CHAR(x) to a STRING.  This broke a
'case' test in the test framework where the columns were non-literals with type
char(x), and the return value was a "null" which should not have forced a cast
to string.

A cast from a varchar to a varchar should be ignored.

Testing:
Added a test to calcite.test.
Ensured the existing cast test in test_chars.py passed.
Ran through the Jenkins Calcite testing framework.

Change-Id: I82d657f4bfce432c458ee8198188dadf9f23f2ef
Reviewed-on: http://gerrit.cloudera.org:8080/23560
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-18 07:47:39 +00:00
Steve Carlin
bc99705252 IMPALA-13902: Calcite planner: Implement is_spool_query_results
The is_spool_query_results query option is now supported in Calcite. The
returnAtMostOneRow method is now implemented to support this.
PlanRootSink is refactored to extract sanitizing query options (a new
method sanitizeSpoolingOptions()) out of
PlanRootSink.computeResourceProfile(). The bulk of memory bounding
calculation is also extracted out to a new class SpoolingMemoryBound.

Added "sleep" in ImpalaOperatorTable.java since some EE tests related to
result spooling calls sleep() function. Changed ImpalaPlanRel to extends
RelNode interface.

A sanity test has been added to calcite.test, but the bulk of the
testing will be done through the Impala test framework when it is
enabled.

Testing:
- Pass FE tests PlannerTest#testResultSpooling, TpcdsCpuCostPlannerTest,
  and all java tests under calcite-planner project.
- Pass query_test/test_result_spooling.py and
  custom_cluster/test_result_spooling.py.

Co-authored-by: Riza Suminto

Change-Id: I5b9bf49e2874ee12de212b892bd898c296774c6f
Reviewed-on: http://gerrit.cloudera.org:8080/23562
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-16 02:33:02 +00:00
Michael Smith
d09940b5dd IMPALA-13563: Cleanup logging
Cleans up calls to logDebug and a few other locations:
- exit early if producing debug message input is expensive
- use slf4j parameterized logging
- normalize on logDebug handling isDebugEnabled checks

Change-Id: I32e1c62511c292d36aa879c60ae3d91ed4f65697
Reviewed-on: http://gerrit.cloudera.org:8080/22090
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-11 05:29:58 +00:00
Steve Carlin
62bf609942 IMPALA-14414: Calcite planner: Added new code to handle nan/inf
The current code works for NaN and Inf, but it breaks when upgrading
to v1.40.  This commit changes the code to handle these when we do
the upgrade to 1.40 and adds a basic test into the calcite.test to ensure
that when the upgrade happens, it does not break.

Change-Id: I8593a4942a2fe785a0c77134b78a9d97257225fc
Reviewed-on: http://gerrit.cloudera.org:8080/23561
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-11-05 12:55:39 +00:00
Steve Carlin
c67b19daf6 IMPALA-14405: Labels for Calcite expressions not matching original planner
Calcite sets literal expressions to EXPR$<x> which did not match
expressions given by the Impala planner. For literal expressions
such as "select 1 + 1", Impala creates the column name as "1 + 1".

The field names can be found in the abstract syntax tree, so
they are not set within the CalciteRelNodeConverter before the
logical tree is created.

A small test was added to calcite.test for a basic sanity check,
but more comprehensive tests will be run in the tests/shell module
(e.g. in test_shell_commandline.py and test_shell_interactive) which
contain tests for labels.

Change-Id: Ibd3e6366a284f53807b4b2c42efafa279249c1ea
Reviewed-on: http://gerrit.cloudera.org:8080/23516
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-22 03:37:48 +00:00
Steve Carlin
420e357b95 IMPALA-13695: Calcite planner: fix for ndv with 2 args
The NDV function was crashing when called with the "scale" arg. This
requires special processing which exists in FunctionCallExpr.

The validation for this is now done in ImpalaNdvFunction
and the special calculation is done within ImpalaAggRel

This also fixes ndv for varchar types. The aggregation call
within CoerceNodes was not differentiating between varchar
and string. A cast to string function is needed in order
to run the ndv function on a varchar column.

Change-Id: I82419f77e043e9975865a042ffb8db75a26931f7
Reviewed-on: http://gerrit.cloudera.org:8080/23513
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-20 23:28:39 +00:00
Michael Smith
7fb986e47a IMPALA-14504: Use shaded hbase, protobuf from Hadoop
Switches to shaded Hbase so it can include its own versions of
dependencies. Note that hbase-client includes hbase-common,
hbase-protocol.

Excludes older protobuf-java from mysql-connector so we get it from
Hadoop.

Allows orc-format 1.0, which is a dependency in future ORC releases.

Change-Id: I386d03c3123ce1159abc54c505f60e0ae619f5fe
Reviewed-on: http://gerrit.cloudera.org:8080/23553
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-17 01:47:18 +00:00
Steve Carlin
69813a8c40 IMPALA-14464: Calcite planner should allow semi-colon in statement
The Calcite planner now handles a sql statement that has a semi-colon
at the end. Note that impala-shell doesn't pass the semi-colon into
the server. This is only seen with a direct call to the server.

Change-Id: Ie690159cd03f28f6b793628aa946292af71b6970
Reviewed-on: http://gerrit.cloudera.org:8080/23517
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-17 00:59:44 +00:00
Riza Suminto
3560621931 IMPALA-14503: Log maven dependency when building frontend
Impala Frontend has plenty of dependency, along with multitudes of
dependency exclusion/inclusion rules in it. This patch adds maven
dependency tree log to logs/mvn/mvn.log when invoking "make java"
command.

Testing:
Manually run "make java" from $IMPALA_HOME and verify that the
dependency trees are logged to logs/mvn/mvn.log.

Change-Id: I8cbe20faeab24bae708733d54996bd6c1dd97757
Reviewed-on: http://gerrit.cloudera.org:8080/23551
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-10-15 22:56:05 +00:00
Steve Carlin
cde4bc016c IMPALA-14115: Calcite planner: Added top-n analytic PlanNode optimization.
Impala has an optimization for analytic expressions that have a rank filter on
top of the analytic expression. It can add a top-n plan node to reduce the amount
of rows examined. This is tested in tpcds query 67.

The optimization logic relies on an unassigned rank conjunct within the analyzer
while creating the analytic plan node.

A slight reorganization of the code was needed to implement this optimization.
The SlotRefs for the AnalyticInfo needed to be created a little earlier from
where it was done in the previous commit.

A small fix was made to normalize binary predicates. A non-normalized binary
predicate prevents the optimization from being used.

A call to the checkAndApplyLimitPushdown is needed for some of the optimizations
to kick in.

A new AllProjectInfo internal class was created to hold the relationships
between the Calcite RexNode objects and the Impala Analytic expressions.

Also, IMPALA-14158 is fixed by this commit. The nullsFirst value was
incorrect when the syntax was explicit in the query.

A new Calcite planner test was added in the junit tests to ensure the
optimization kicks in. The new test file is in the
PlannerTest/calcite/limit-pushdown-analytic-calcite.test file. This is a copy
of the limit-pushdown-analytic.test file in its parent directory but with some
modified results. Most of the differences are trivial, but IMPALA-14469 has been
filed to deal with one optimization that did not get fixed, which is when
the order by clause has a constant expression.

Change-Id: Ie6fa6781db56771b13b0cf49bd236f776016bf8d
Reviewed-on: http://gerrit.cloudera.org:8080/23317
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
2025-10-10 17:11:45 +00:00
Michael Smith
17b3f9ee88 IMPALA-14470: Migrate fair scheduler to slf4j
Moves our fair scheduler code off commons-logging to use slf4j like the
rest of Impala. Relies on the reload4j implementation to add an appender
for message capture.

Change-Id: Ia94d512f61c7e959c17e1139dceac31ad1a01bf2
Reviewed-on: http://gerrit.cloudera.org:8080/23478
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-01 04:59:58 +00:00
Steve Carlin
6aa4df4443 IMPALA-14105: Calcite planner: Runtime filters not being applied with outer joins
Previous to this commit, outer join conjuncts were not being placed into
the ValueTransfersGraph which prevented them from being considered for
runtime filters.  This caused a slowdown in some tpcds queries.

The conjuncts are now registered with the ImpalaJoinRel. The appropriate TableRef
objects are picked up from the underyling plan nodes.

Change-Id: I9e06d3f35a10f35ff8b57ba25dbab1bc6a35238a
Reviewed-on: http://gerrit.cloudera.org:8080/23318
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-28 23:40:32 +00:00
Steve Carlin
a6dbd4015c IMPALA-14106: Calcite planner: Register equivalent union expressions in value transfer graph
This commit registers the equivalent union expressions in the value
transfer graph when the physical union node is created for the Calcite
planner.

Change-Id: I4c858ae82a1cb7b89b0ae4e70205d8eeaeb28687
Reviewed-on: http://gerrit.cloudera.org:8080/23316
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-24 22:58:33 +00:00
Michael Smith
52b87fcefd IMPALA-14454: Exclude log4j 2 dependencies
While we use reload4j, we can safely exclude log4j 2 dependencies to
reduce the size of our artifacts.

Change-Id: Ic060bdd969a6e5cd01646376b27c7355ce841819
Reviewed-on: http://gerrit.cloudera.org:8080/23439
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-09-24 18:04:06 +00:00
Michael Smith
5137bb94ac IMPALA-14446: Clean up pom.xml
Cleans up repetitive patterns in pom.xml.

Centralize plugin configuration in pluginManagement. Replace inline
maven-compiler-plugin configuration with newer maven.compiler.release
and update to latest plugin version.

Centralize common dependencies in dependencyManagement, including
exclusions when appropriate. Remove exclusions that are no longer
relevant.

Compared before and after with dependency:tree; only difference is that
commons-cli now comes from hadoop and jersey-serv{let,er} are
effectively excluded; all versions matched. Also ensured
USE_APACHE_COMPONENTS=true compiles.

Adds com.amazonaws:aws-java-sdk-bundle to exclusion checking to ensure
it's not accidentally included alongside impala-minimal-s3a-aws-sdk.

Removes missed io.netty exclusion from IMPALA-12816.

Updates commons-dbcp2 to 2.12.0 to match Hive.

Change-Id: If96649840e23036b4a73ee23e8d12516497994f0
Reviewed-on: http://gerrit.cloudera.org:8080/23432
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-23 02:50:22 +00:00
Peter Rozsa
b0f1d49042 IMPALA-14016: Add multi-catalog support for local catalog mode
This patch adds a new MetaProvider called MultiMetaProvider, which is
capable of handling multiple MetaProviders at once, prioritizing one
primary provider over multiple secondary providers. The primary
provider handles some methods exclusively for deterministic behavior.
In database listings, if one database name occurs multiple times the
contained tables are merged under that database name; if the two
separate databases contain a table with the same name, the query
analyzation fails with an error.
This change also modifies the local catalog implementation's
initialization. If catalogd is deployed, then it instantiates the
CatalogdMetaProvider and checks if the catalog configuration directory
is set as a backend flag. If it's set, then it tries to load every
configuration from the folder, and tries to instantiate the
IcebergMetaProvider from those configs. If the instantiation fails, an
error is reported to the logs, but the startup is not interrupted.

Tests:
 - E2E tests for multi-catalog behavior
 - Unit test for ConfigLoader

Change-Id: Ifbdd0f7085345e7954d9f6f264202699182dd1e1
Reviewed-on: http://gerrit.cloudera.org:8080/22878
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2025-09-19 15:03:59 +00:00
jichen0919
826c8cf9b0 IMPALA-14081: Support create/drop paimon table for impala
This patch mainly implement the creation/drop of paimon table
through impala.

Supported impala data types:
- BOOLEAN
- TINYINT
- SMALLINT
- INTEGER
- BIGINT
- FLOAT
- DOUBLE
- STRING
- DECIMAL(P,S)
- TIMESTAMP
- CHAR(N)
- VARCHAR(N)
- BINARY
- DATE

Syntax for creating paimon table:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
[col_name data_type ,...]
[PRIMARY KEY (col1,col2)]
)
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
STORED AS PAIMON
[LOCATION 'hdfs_path']
[TBLPROPERTIES (
'primary-key'='col1,col2',
'file.format' = 'orc/parquet',
'bucket' = '2',
'bucket-key' = 'col3',
];

Two types of paimon catalogs are supported.

(1) Create table with hive catalog:

CREATE TABLE paimon_hive_cat(userid INT,movieId INT)
STORED AS PAIMON;

(2) Create table with hadoop catalog:

CREATE [EXTERNAL] TABLE paimon_hadoop_cat
STORED AS PAIMON
TBLPROPERTIES('paimon.catalog'='hadoop',
'paimon.catalog_location'='/path/to/paimon_hadoop_catalog',
'paimon.table_identifier'='paimondb.paimontable');

SHOW TABLE STAT/SHOW COLUMN STAT/SHOW PARTITIONS/SHOW FILES
statements are also supported.

TODO:
    - Patches pending submission:
        - Query support for paimon data files.
        - Partition pruning and predicate push down.
        - Query support with time travel.
        - Query support for paimon meta tables.
    - WIP:
        - Complex type query support.
        - Virtual Column query support for querying
          paimon data table.
        - Native paimon table scanner, instead of
          jni based.
Testing:
    - Add unit test for paimon impala type conversion.
    - Add unit test for ToSqlTest.java.
    - Add unit test for AnalyzeDDLTest.java.
    - Update default_file_format TestEnumCase in
      be/src/service/query-options-test.cc.
    - Update test case in
      testdata/workloads/functional-query/queries/QueryTest/set.test.
    - Add test cases in metadata/test_show_create_table.py.
    - Add custom test test_paimon.py.

Change-Id: I57e77f28151e4a91353ef77050f9f0cd7d9d05ef
Reviewed-on: http://gerrit.cloudera.org:8080/22914
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-09-10 21:24:49 +00:00
Steve Carlin
8b057881c7 IMPALA-14102: [part 2] Fixed the JoinTranspose rule.
This fix is needed before the join optimization fix can be committed.

The JoinTranspose rule provided by Calcite was having 2 issues:

1) For tpcds-q10 and q35, an exception was being thrown. There is
a bug in the Calcite code when the Join-Project gets matched but
the Join is of reltype SemiJoin. In this case, the Projects do not
get created correctly and the exception gets thrown.

2) We only want to transpose a Project above a Join if there is an
underlying Join underneath the Project. The whole purpose is to
be able to create adjacent Join RelNodes. We do not have to transpose
the Project when it is not sandwiched between two Join nodes. It is
preferable to keep it underneath the Join since the row width
calculation would be affected (the Project may reduce the number of
columns, thus reducing the row width).

This commit extends the given JoinProjectTranspose rule by Calcite
and handles these added restrictions.

Change-Id: I7f62ec030fc8fbe36e6150bf96c9673c44b7da1b
Reviewed-on: http://gerrit.cloudera.org:8080/23313
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-08 02:49:21 +00:00
Fang-Yu Rao
1ff4e1b682 IMPALA-13767: Do not treat CTEs as names of actual tables
This patch implements an additional check when collecting table names
that are used in the given query. Specifically, for a table name that
is not fully qualified, we make sure the derived fully qualified table
name is not a common table expression (CTE) of a SqlWithItem in a
WITH clause since such CTE's are not actual tables.

Testing:
 - Added a test in test_ranger.py to verify the issue is fixed.

Change-Id: I3f51af42d64cdcff3c26ad5a96c7f53ebef431b3
Reviewed-on: http://gerrit.cloudera.org:8080/23209
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Fang-Yu Rao <fangyu.rao@cloudera.com>
2025-09-06 05:25:03 +00:00
Steve Carlin
83499e5be7 IMPALA-14102: [part 1] Calcite Planner: optimize join rule
This is part 1 of the commit for optimizing join rules for
Calcite.  This commit is just a copy of the LoptOptimizeJoinRule.java
from Calcite v1.37 for subsequent modification.

The purpose of this commit is to serve as a placeholder
starting point so we can easily see the customized changes that
are made by comparing Impala specific modifications for the rule
which will be done in subsequent commits for IMPALA-14102.

Change-Id: I63daf6dacf0547a0488c1ecf0bc185b548e00d87
Reviewed-on: http://gerrit.cloudera.org:8080/23312
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-03 04:05:09 +00:00
Steve Carlin
e74495e656 IMPALA-14101: [part 2] Calcite planner: Add cost model calculations
This commit adds the cost model and calculations to be used in the join
optimizer rule. The ImpalaCost object implements the RelOptCost interface
and contains values which contribute to a cost.  The ImpalaCost object
roughly mirrors the Calcite VolcanoCost object with some slight variations.
The ImpalaCost object only looks at the cpu and io cost and ignores the
rowCount cost. The rowCount cost is not needed because it is already
baked into the cpu and io results. That is to say, we determine the cpu cost
and io cost by using the rowCount cost.

The ImpalaCost object is generated in the ImpalaRelMdNonCumulativeCost
class which is called from Calcite for a given RelNode. The cost generated
by this object uses the various inputs of the RelNode to calculate the
cpu and io time for the given logical node. Note that this is a
non-cumulative cost.  A cumulative cost exists within Calcite as well, but
there was no need to change the cumulative cost logic.

The cost is used by the Calcite LoptOptimizeJoinRule when determining join
ordering. It will compare costs of different join ordering and choose the
join ordering with a lower cost.

With the current iteration, we only customize the costs for Impala for
aggregates, table scans, and joins.

A TODO in this commit is to allow various cpu and io costs to be configurable.

Change-Id: I1e52b0e11e9a6d5814b0313117dd9c56602f3ff5
Reviewed-on: http://gerrit.cloudera.org:8080/23311
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-09-03 04:05:09 +00:00
Fang-Yu Rao
0b9e2b2cd1 IMPALA-13011: Support authorization for Calcite in Impala
This patch adds support for authorization when Calcite is the planner.
Specifically, this patch focuses on the authorization of table-level
and column-level privilege requests, including the case when a table
is a regular view, whether the view was created by a superuser. Note
that CalciteAnalysisDriver would throw an exception from analysis() if
given a query that requires table masking, i.e., column masking or row
filtering, since this feature is not yet supported by the Calcite
planner.

Moreover, we register the VIEW_METADATA privilege for each function
involved in the given query. We hardcode the database associated with
the function to 'BuiltinsDb', which is a bit hacky. We should not be
doing this once each function could be associated with a database when
we are using the Calcite planner. We may need to change Calcite's
parser for this.

The issue reported in IMPALA-13767 will be taken care of in another
separate patch and hence this patch could incorrectly register the
privilege request for a common table expression (CTE) in a WITH
clause, preventing a legitimate user from executing a query involving
CTE's.

Testing:
 - We manually verified that the patch could pass the test cases in
   AuthorizationStmtTest#testPrivilegeRequests() except for
   "with t as (select * from alltypes) select * from t", for which
   the fix will be provided via IMPALA-13767.
 - Added various tests in test_ranger.py.

Change-Id: I9a7f7e4dc9a86a2da9e387832e552538e34029c1
Reviewed-on: http://gerrit.cloudera.org:8080/22716
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-03 00:15:22 +00:00
Steve Carlin
048b5689fd IMPALA-14080: Support LocalFsTable table types in Calcite planner.
IMPALA-13947 changes the use_local_catalog default to true. This causes
failure for when the use_calcite_planner query option is set to true.

The Calcite planner was only handling HdfsTable table types. It will
now handle LocalFsTable table types as well.

Currently, if table num rows is missing from table, Calcite planner will
load all partitions to estimate by iterating all partitions. This is
inefficent in local catalog mode and ideally should happen later after
partition prunning. Follow up work is needed to improve this.

Testing:
Reenable local catalog mode in
TestCalcitePlanner.test_calcite_frontend
TestWorkloadManagementSQLDetailsCalcite.test_tpcds_8_decimal

Co-authored-by: Riza Suminto

Change-Id: Ic855779aa64d11b7a8b19dd261c0164e65604e44
Reviewed-on: http://gerrit.cloudera.org:8080/23341
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-27 03:17:13 +00:00
Steve Carlin
ff58c5d42f IMPALA-14101: [part 1] Commit Cost file from Calcite
This commit is just a copy of the VolcanoCost.java file from Calcite
into this Impala repository.  The file can be found here:

https://github.com/apache/calcite/blob/calcite-1.37.0/core/src/main/...
      .../java/org/apache/calcite/plan/volcano/VolcanoCost.java

The only differences between this file and the Calcite file are:
1) All VolcanoCost strings have been changed to ImpalaCost
2) The package name is an Impala package.

This will make it easier to show the changes made for the Impala cost
model change in IMPALA-14101.

Change-Id: I864e20fb63c0ae4f2f88016128d2a68f39e17dfb
Reviewed-on: http://gerrit.cloudera.org:8080/23310
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-20 21:01:29 +00:00
Steve Carlin
5244f6169e IMPALA-14061: Calcite Planner: added Calcite rules
This commit adds Calcite optimization rules to create more efficient
plans. These rules should be considered a work in progress.  These
were tested against a 3TB tpcds database so they are fairly efficient
as/is, but we can make improvements as we see them along the way.

Most of the changes have been added to the CalciteOptimizer file. There
are several phases of rules that are applied, which are as follows:

- expand nodes:  These rules change the plan to a plan that can be
handled by Impala. For instance, there are RelNodes such as
"LogicalIntersect" which are not directly applicable to the Impala
physical nodes so they need to be expanded.
- coerce nodes: This module changes the nodes so they have the
correct datatype values (e.g. literal strings in Calcite are char
but need to be varchar for Impala)
- optimize nodes: first pass on reordering the logical RelNode ordering.
- join: Squishes the join RelNodes together, pushes them into one
"multiJoin" and then lets Calcite's join optimizer reorder the joins
into a more optimal plan.  A note on this:  with this iteration,
statistics are still not being applied. This will come in with later
commits to make better plans.
- post join optimize nodes: Reruns the optimize nodes since the
join ordering may present new optimization opportunities
- pre Impala commit: Extra massaging after optimization that is
done at the end
- conversion to Impala RelNodes: Maps Calcite RelNodes into Impala
RelNodes which will then be mapped to Impala PlanNodes

In addition to this general change, there is also a change with
removing the "toCNF" rule. Calcite has multiple places where it
creates a SEARCH operator via "simplifying" the RexNodes within
various rules. This operator is not supported directly in Impala
and we need to call "expandSearch" to handle this. Because Impala
does this under the covers in the rules, this has been fixed
by overriding the RexBuilder (with ImpalaRexBuilder) and expanding
the SEARCH operator whenever it is called (sidenote: we could have
changed the rules that called simplify, but that would have resulted
in too much code duplication).

The toCNF rule was removed and placed as a call within the
CoerceOperandShuttle, which already manipulates all the RexNodes, so
all that code is now in one place.

Change-Id: I6671f7ed298a18965ef0b7a5fc10f4912333a52b
Reviewed-on: http://gerrit.cloudera.org:8080/22870
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-20 12:03:16 +00:00
Steve Carlin
922443da46 IMPALA-14165: Type coercion code accidentally omitted from analysis
On the first cut of creating the Calcite planner, the Calcite planner
was standalone and ran its own JniFrontend.

In the current version, the parsing, validating, and single node
planning is called from the Impala framework.

There is some code in the first cut regarding the
"ImpalaTypeCoercionFactory" class which handles deriving the correct
data type for various expressions, for instance (found in exprs.test):

select count(*) from alltypesagg where
10.1 in (tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col)

Without this patch, the query returns the following error:
UDF ERROR: Decimal expression overflowed

This code can be found in CalciteValidator.java, but was accidentally omitted
from CalciteAnalysisDriver.

Change-Id: I74c4c714504400591d1ec6313f040191613c25d9
Reviewed-on: http://gerrit.cloudera.org:8080/23039
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-08-10 17:00:54 +00:00
Steve Carlin
98b6c0208f IMPALA-14094: Calcite planner: Use table and column statistics for optimization
This commit enables the Calcite planner join optimization rule to make use
of table and column statistics in Impala.

The ImpalaRelMetadataProvider class provides the metadata classes to the
rule optimizer.

All the ImpalaRelMd* classes are extensions of Calcite Metadata classes. The
ones overridden are:

ImpalaRelMdRowCount:
    This provides the cardinality of a given type of RelNode.
    The default implementation in the RelMdRowCount is used for some of the
    RelNodes. The ones overridden are:

    TableScan: Gets the row count from the Table object.

    Filter: Calls the FilterSelectivityEstimator and adjusts the number of
    rows based on the selectivity of the filter condition.

    Join: Uses our own algorithm to determine the number of rows that will
    be created by the join condition using the JoinRelationInfo (more on this
    below).

ImpalaRelMdDistinctRowCount:
    This provides the number of distinct rows returned by the RelNode.
    The default implementation in the RelMdDistinct RowCount is used for
    some of the RelNodes. The ones overridden are:

    TableScan: Uses the stats. If stats are not defined, all rows will
    be marked as distinct.

    Aggregate: For some reason, Calcite sometimes returns a number of
    distinct rows greater than the number of rows, which doesn't make
    sense. So this ensures the number of distinct rows never exceeds
    the number of rows.

    Filter: The number of distinct rows is reduced by the calculated
    selectivity.

    Join: same as aggregate.

ImpalaRelMdRowSize:
    Provides the Impala interpreted size of the Calcite datatypes.

ImpalaRelMdSelectivity:
    The selectivity is calculated within the RowCount. An initial attempt
    was done to use this class for selectivity, but it was seemed rather clunky
    since the row counts and selectivity are very closely intertwined and
    the pruned row counts (a future commit) made this even more complicated.
    So the selectivity metadata is overridden or all our RelNodes as full
    selectivity (1.0).

As mentioned above, the FilterSelectivityEstimator class tries to approximate
the number of rows filtered out with the given condition. Some work still
needs to be done to make this more in line with the Expr seletivities, a Jira
will be filed for this.

The JoinRelationInfo is the helper class that estimates the number of rows
that will be output of the Join RelNode. The join condition is split up into
multiple conditions broken up by the AND keyword. This first pass has some major
flaws which need to be corrected, including:
   - Only equality conditions limit the number of rows. Non-equality conditions
   will be ignored.  If there are only non-equality conditions, the cardinality
   will be the equivalent of a cross join.
   - Left joins take the maximum of the calculated join and the total number
   of rows on the left side. This can probably be improved upon if we find
   the matching rows provide a cardinality that is greater than one for each
   row. (Of course, right joins and outer joins have this same logic).

Change-Id: I9d5bb50eb562c28e4b7c7a6529d140f98e77295c
Reviewed-on: http://gerrit.cloudera.org:8080/23122
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-08-10 01:20:43 +00:00
Steve Carlin
8fa383d9eb IMPALA-14166: Calcite Planner: Ensure 'unsupported' functions are handled correctly
There are some datasketches functions which return a Function
object where the "isUnsupported" method returns true.

This needs to be explicitly handled in the Calcite code as unsupported.

Change-Id: Ic2c4a96005fc7571bde28643ea4cecda61839c77
Reviewed-on: http://gerrit.cloudera.org:8080/23041
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-07-05 23:50:44 +00:00
Riza Suminto
35aa2e2add IMPALA-14187: Add IMPALA_JAVA_TARGET env var
Impala is preparing to switch to JDK17 for Java compilation by default.
While the source version might remain in 1.8 for longer, we should
experiment with targeting binary version 17.

This patch adds IMPALA_JAVA_TARGET env var to control target binary
version. It is initialized in impala-config-java.sh, depending on value
of IMPALA_JDK_VERSION env var.

Testing:
Pass data load and FE tests with IMPALA_JDK_VERSION=17.

Change-Id: If194d87c542d416b878661403c32c6adc2930199
Reviewed-on: http://gerrit.cloudera.org:8080/23096
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-27 00:41:57 +00:00
Fang-Yu Rao
aba3a705a4 IMPALA-13982: Support regular views for Calcite planner in Impala
Before this patch, the Calcite planner in Impala only supported inline
views like 'temp' in the following query.

  select id from (
    select * from functional.alltypes
  ) as temp;

Regular views, on the other hand, were not supported. For instance, the
Calcite planner in Impala did not support regular views like
'functional.alltypes_view' created via the following statement and
hence queries against such regular views like
"select id from functional.alltypes_view" were not supported.

  CREATE VIEW functional.alltypes_view
  AS SELECT * FROM functional.alltypes;

This patch adds the support for regular views to the Calcite planner
via adding a ViewTable for each regular view in the given query
when populating the Calcite schema. This is similar to how regular
views are supported in PlannerTest#testView() at
https://github.com/apache/calcite/blob/main/core/src/test/java/org/apache/calcite/tools/PlannerTest.java
where the regular view to be tested is added in
https://github.com/apache/calcite/blob/main/testkit/src/main/java/org/apache/calcite/test/CalciteAssert.java.
We do not have to use or extend ViewTableMacro in
Apache Calcite because the information about the data types
returned from a regular view is already available in its respective
FeTable. Therefore, there is no need to parse the SQL statement
representing the regular view and collect the metadata of tables
referenced by the regular view as done by ViewTableMacro.

The patch supports the following cases, where
'functional.alltypes_view' is a regular view defined as
"SELECT * FROM functional.alltypes".
1. select id from functional.alltypes_view.
2. select alltypes_view.id from functional.alltypes_view.
3. select functional.alltypes_view.id from functional.alltypes_view.

Joining a regular view with an HDFS table like the following is also
supported.

  select alltypestiny.id
  from functional.alltypes_view, functional.alltypestiny

Note that after this patch, queries against regular views are supported
only in the legacy catalog mode but not the local catalog mode. In
fact, queries against HDFS tables in the local catalog mode are not
supported yet by the Calcite planner either. We will deal with this in
IMPALA-14080.

Testing:
 - Added test cases mentioned above to calcite.test. This makes sure
   the test cases are supported when we start the Impala server with
   the flag of '--use_calcite_planner=true'.
 - Manually verified the test cases above are supported if we start
   the Impala server with the environment variable USE_CALCITE_PLANNER
   set to true and the query option use_calcite_planner set to 1.

Change-Id: I600aae816727ae942fb221fae84c2aac63ae1893
Reviewed-on: http://gerrit.cloudera.org:8080/22883
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-06-24 22:30:15 +00:00
Steve Carlin
e9d7c152dc IMPALA-13582: Calcite planner: return proper labels for columns
The field names were not getting passed up to the output expressions.
These are found on the RelNode row type object.

The change is made in two different flows:

The first flow is in CalciteSingleNodePlanner which gets hit when
running from impala-shell and the use_calcite_planner query option
is used.

The second flow is in ExecRequestCreator and gets hit when running
with the start-up option that loads a different JniFrontend jar. This
mode will soon be deprecated, but is still used for testing purposes.

Change-Id: I42818646d98f87d8744585010fc166f9d416aec1
Reviewed-on: http://gerrit.cloudera.org:8080/22117
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-06-05 21:49:30 +00:00
Steve Carlin
006f9ba589 IMPALA-14041: Enable planner tests
This commit will enable some junit tests for the Calcite planner.

To run these tests, use the following command from $IMPALA_HOME:

(pushd java/calcite-planner && mvn -B -fae test -Dtest=TpcdsCpuCostPlannerTest)

Change-Id: Idaab4e9068bb64e9a9ee12d83cd2b6b55b99b9bf
Reviewed-on: http://gerrit.cloudera.org:8080/22864
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-06-03 00:27:34 +00:00
Zoltan Borok-Nagy
afa329fd89 IMPALA-13931: TestIcebergRestCatalog.test_rest_catalog_basic failed at setup
There were several issues with test_rest_catalog_basic which made it
fail in environments that used Ozone or S3.

Missing dependency of Ozone and S3 classes:
* This is resolved in iceberg-rest-catalog-test/pom.xml by adding
  a dependency to impala-executor-deps

Hadoop configuration was initialized properly:
* run-iceberg-rest-server.sh used Maven to run Iceberg REST Catalog in
  which case Maven is in charge of setting the CLASSPATH but the
  core-site/ozone-site/etc. config files were not on it, so the
  REST Catalog used a default Hadoop configuration that wasn't good
  for our environment.
* To overcome the CLASSPATH problem now we create a runnable JAR in
  iceberg-rest-catalog-test/pom.xml and also generate the proper
  CLASSPATH during compilation.
* run-iceberg-rest-server.sh now uses java -cp to run the REST
  Catalog

S3 builds threw NoSuchMethodException for the "create" method of
ApacheHttpClientConfigurations:
* The Iceberg library dynamically load its http client builders
  to workaround an error, see details in
  https://github.com/apache/iceberg/issues/6715
* So the Iceberg lib dynamically wants to load the "create" method
  of its own ApacheHttpClientConfigurations class but it fails
  with NoSuchMethodException.
* The critical code is invoked from Impala's IcebergMetadataScanner's
  ScanMetadataTable() method which happens to be invoked through
  JNI from the C++ backend.
* The context class loader of such threads are NULL, which means
  Java will use the bootstrap class loader to load classes and methods,
  but that doesn't have the proper resources on its classpath.
* To overcome this issue we set the context class loader for the thread
  to the class loader that originally loaded the IcebergMetadataScanner
  class.

Change-Id: I9dc0e30aeaff0b8de41426ba38506383b4af472c
Reviewed-on: http://gerrit.cloudera.org:8080/22818
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-05-09 17:01:56 +00:00
Steve Carlin
55804f7874 IMPALA-12959: Calcite planner: Implement count star optimization...
IMPALA-13779: Handle partition key scan optimization

IMPALA-13780: Handle full acid selects

The 3 commits referenced here are somewhat related in that they all
involve changes for the HdfsScanRel column layout and have been
combined.

For the optimizations, some infrastructure code was added. Information
from the Aggregation RelNode is needed by the TableScan RelNode and
vice versa. The mechanism to send information to children RelNodes is
by using the ParentPlanRelContext. The mechanism for sending information
up to the parent is by using the NodeWithExprs object. If the conditions
are met for the optimizations (equivalent to the conditions in the current
Impala planner), the optimizations are applied.

For count star optimization, the STAT_NUM_ROWS fake column is added to hold
the information, and then the aggregate applies a sum_init_zero on this
column.

For partition key scan, if the conditions are met, the Impala HdfsScanNode
is sent a flag in its constructor that handles the optimization.

For acid selects, the SingleNodePlanner has code to handle the additional
PlanNodes needed. Some code involving column number calculation was needed
to deal with the extra columns that are present in a full acid table.

One extra note: In HdfsScanNode, a Preconditions check was removed.
This state check ensured that the countStarSlot only existed when the
aggregation substitution map was set. This does not apply to Calcite which
does not use the substitution map to handle the count star optimization.

Change-Id: I975beefedd2cceb34dad0f93343a46d1b7094c13
Reviewed-on: http://gerrit.cloudera.org:8080/22425
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-05-04 03:16:36 +00:00
Steve Carlin
30979e7d30 IMPALA-13517: Support overloaded || operator
The || operator is used for both "or" and "concat". A new
Impala custom operator is created to handle both of them,
treating the precedence of the operator as if it's an "or".

The "or" is chosen if both parameters are null or boolean, as
taken from logic in CompoundVerticalBarExpr.

At convertlet time (when converting from SqlNode to RelNode),
the real operator is placed into the RexNode.

Change-Id: Iabaf02e84b769db1419bd96e1d1b30b8f83d56f5
Reviewed-on: http://gerrit.cloudera.org:8080/22105
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
Tested-by: Steve Carlin <scarlin@cloudera.com>
2025-05-01 19:33:45 +00:00
Steve Carlin
e473f034dc IMPALA-13042: Calcite Planner; Enable partition pruning
Enables partition pruning in the HdfsScan RelNode.

The PrunedPartitionHelper is used to separate the conjuncts and
is a wrapper around the HdfsPartitionPruner.

There are tests that currently exist in the Impala test framework
that check the runtime profile for the number of files read from
the table scan which are fixed with this commit.

One small modification was made to the "preserveRootTypes" parameter
in HdfsPartitionPruner. When it was set to false, the Calcite planner
failed in one place. It makes sense for the main code line that the
root type should not change, and this was tested on a full Jenkins run.

Change-Id: I8c698b857555baeae347835b4a6b39d035f12405
Reviewed-on: http://gerrit.cloudera.org:8080/22409
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-05-01 00:07:03 +00:00
Steve Carlin
3d24f45f9c IMPALA-13796: Calcite planner: Improper casting for char on join condition
For the following query:

SELECT COUNT(*) from orders t1 LEFT OUTER JOIN orders t2
ON cast(t1.o_comment as char(120)) = cast(t2.o_comment as char(120));

The join condition uses the Function "=(CHAR,CHAR)".  The function
defined within Impala uses a wildcard for the length of the char (-1).

Previous to the fix, the code detected that the char(120) needed casting,
would cast it to a char(1), and this produced erroneous results.

The fix is to make sure we don't cast from a char(x) to a char(-1).

Change-Id: Ib9f44e3d5a7623a20d9841541bb496c1dee32d1e
Reviewed-on: http://gerrit.cloudera.org:8080/22541
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-04-29 12:38:08 +00:00
Steve Carlin
706e1f026c IMPALA-13657: Connect Calcite planner to Impala Frontend framework
This commit adds the plumbing created by IMPALA-13653. The Calcite
planner is now called from Impala's Frontend code via 4 hooks which
are:

- CalciteCompilerFactory: the factory class that creates
    the implementations of the parser, analysis, and single node
    planner hooks.
- CalciteParsedStatement: The class which holds the Calcite SqlNode
    AST.
- CalciteAnalysisDriver: The class that does the validation of the
    SqlNode AST
- CalciteSingleNodePlanner: The class that converts the AST to a
    logical plan, optimizes it, and converts it into an Impala
    PlanNode physical plan.

To run on Calcite, one needs to do two things:

1) set the USE_CALCITE_PLANNER env variable to true before starting
the cluster. This adds the jar file into the path in the
bin/setclasspath.sh file, which is not there by default at the time
of this commit.
2) set the use_calcite_planner query option to true.

This commit makes the CalciteJniFrontend class obsolete. Once the
test cases are moved out of there, that class and others can be
removed.

Change-Id: I3b30571beb797ede827ef4d794b8daefb130ccb1
Reviewed-on: http://gerrit.cloudera.org:8080/22319
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-04-09 23:55:15 +00:00
Zoltan Borok-Nagy
bd3486c051 IMPALA-13586: Initial support for Iceberg REST Catalogs
This patch adds initial support for Iceberg REST Catalogs. This means
now it's possible to run an Impala cluster without the Hive Metastore,
and without the Impala CatalogD. Impala Coordinators can directly
connect to an Iceberg REST server and fetch metadata for databases and
tables from there. The support is read-only, i.e. DDL and DML statements
are not supported yet.

This was initially developed in the context of a company Hackathon
program, i.e. it was a team effort that I squashed into a single commit
and polished the code a bit.

The Hackathon team members were:
* Daniel Becker
* Gabor Kaszab
* Kurt Deschler
* Peter Rozsa
* Zoltan Borok-Nagy

The Iceberg REST Catalog support can be configured via a Java properties
file, the location of it can be specified via:
 --catalog_config_dir: Directory of configuration files

Currently only one configuration file can be in the direcory as we only
support a single Catalog at a time. The following properties are mandatory
in the config file:
* connector.name=iceberg
* iceberg.catalog.type=rest
* iceberg.rest-catalog.uri

The first two properties can only be 'iceberg' and 'rest' for now, they
are needed for extensibility in the future.

Moreover, Impala Daemons need to specify the following flags to connect
to an Iceberg REST Catalog:
 --use_local_catalog=true
 --catalogd_deployed=false

Testing
* e2e added to test basic functionlity with against a custom-built
  Iceberg REST server that delegates to HadoopCatalog under the hood
* Further testing, e.g. Ranger tests are expected in subsequent
  commits

TODO:
* manual testing against Polaris / Lakekeeper, we could add automated
  tests in a later patch

Change-Id: I1722b898b568d2f5689002f2b9bef59320cb088c
Reviewed-on: http://gerrit.cloudera.org:8080/22353
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-02 20:04:12 +00:00
Peter Rozsa
1f70269392 IMPALA-13838: Update Impala version to 5.0.0-SNAPSHOT
Change-Id: I9c5a2d817b30e14333feeb5b2de3e0c40795723f
Reviewed-on: http://gerrit.cloudera.org:8080/22596
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-08 14:13:48 +00:00
Steve Carlin
5b4427ed1b IMPALA-13587: Calcite planner: Outer join not aggregating nulls properly
The following query is producing incorrect results:

select t2.int_col y from alltypessmall t1 left outer join
alltypestiny t2 on t1.int_col = t2.int_col group by 1

... due to nulls not being aggregated properly on multiple nodes.
This is because the value equivalency graph is being set for the
join conjunct on an outer join. When a hash join partition node is
being used, there is an optimization that skips the aggregation step
that combines groups across nodes if, based on the value transfer
graph, it deduces that all data for the partition column is being
sent to the same node.

The bug here is that even though an outer join is using an
equi-conjunct, the left and right side are different when data is not
found on the outer join side, where it becomes null.

The fix is to avoid registering the equi-conjunct if the values are
not always equal.

Change-Id: I57e9d4ad4c4af5a4c268e43ac2937064dab6ffd7
Reviewed-on: http://gerrit.cloudera.org:8080/22138
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Steve Carlin <scarlin@cloudera.com>
2025-03-06 21:14:01 +00:00
Steve Carlin
b449381cc8 IMPALA-13575: Calcite planner: Fix exception when null is in values clause
The following query was failing with an exception:

select * from (values(0), (null))

The null type was not being assigned a type correctly. After this fix,
the null type will be created as an AnalyzedNullLiteral with the correct
type.

Change-Id: I4e78fb0ed63b9525540ad537cfb7aabd8bbfe7ea
Reviewed-on: http://gerrit.cloudera.org:8080/22109
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-06 21:13:06 +00:00
Steve Carlin
796c25fc57 IMPALA-13716: Calcite Planner: TupleIsNullPredicate fix for analytic functions
There is some special logic to materialize the TupleIsNullPredicate
functions that are created by join nodes for outer joins for analytic
functions. This commit refactors some of the code in the current Impala
planner and materializes them with the Analytic RelNode.

An example query from the test framework that causes this issue is:

select avg(g) over (order by f) af3
from alltypestiny t1
      left outer join
        (select
           id as a, coalesce(bigint_col, 30) as f,
           bigint_col  as g
         from alltypestiny) t2
      on (t1.id = t2.a);

Change-Id: Iaec363c2fa93a1e21bf74a40e5399e21ddd9bd60
Reviewed-on: http://gerrit.cloudera.org:8080/22411
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-26 16:22:26 +00:00