impala

mirror of https://github.com/apache/impala.git synced 2025-12-25 02:03:09 -05:00

Author	SHA1	Message	Date
Zoltan Borok-Nagy	85d77b908b	IMPALA-13756: Fix Iceberg V2 count() optimization for complex queries We optimize plain count() queries on Iceberg tables the following way: AGGREGATE COUNT() \| UNION ALL / \ / \ / \ SCAN all ANTI JOIN datafiles / \ without / \ deletes SCAN SCAN datafiles deletes \|\| rewrite \|\| \/ ArithmethicExpr: LHS + RHS / \ / \ / \ record_count AGGREGATE of all COUNT() datafiles \| without ANTI JOIN deletes / \ / \ SCAN SCAN datafiles deletes This optimization consists of two parts: 1 Rewriting count() expression to count() + "record_count" (of data files without deletes) 2 In IcebergScanPlanner we only need to consruct the right side of the original UNION ALL operator, i.e.: ANTI JOIN / \ / \ SCAN SCAN datafiles deletes SelectStmt decides whether we can do the count() optimization, and if so, does the following: 1: SelectStmt sets 'TotalRecordsNumV2' in the analyzer, then during the expression rewrite phase the CountStarToConstRule rewrites the count() to count() + record_count 2: SelectStmt sets "OptimizeCountStarForIcebergV2" in the query context then IcebergScanPlanner creates plan accordingly. This mechanism works for simple queries, but can turn on count() optimization in IcebergScanPlanner for all Iceberg V2 tables in complex queries. Even if only one subquery enables count() optimization during analysis. With this patch the followings change: 1: We introduce IcebergV2CountStarAccumulator which we use instead of the ArithmethicExpr. So after rewrite we still know if count() optimization should be enabled for the planner. 2: Instead of using the query context, we pass the information to the IcebergScanPlanner via the TableRef object. Testing * e2e tests Change-Id: I1940031298eb634aa82c3d32bbbf16bce8eaf874 Reviewed-on: http://gerrit.cloudera.org:8080/23705 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2025-12-19 17:53:50 +00:00
Zoltan Borok-Nagy	6649b92cb2	IMPALA-14635: We should not check for exact file sizes in iceberg-metadata-tables.test The Impala version string is written into the Parquet footer. This means in our tests we shouldn't check for exact file sizes of tables written during data loading/testing. Change-Id: I589ade5f81879ede54ff41466b77b5db3349a14f Reviewed-on: http://gerrit.cloudera.org:8080/23802 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-12-18 15:58:41 +00:00
Daniel Vanko	9d112dae23	IMPALA-14536: Fix CONVERT TO ICEBERG to not throw exception on Iceberg tables Previously, running ALTER TABLE <table> CONVERT TO ICEBERG on an Iceberg table produced an error. This patch fixes that, so the statement will do nothing when called on an Iceberg table and return with 'Table has already been migrated.' message. This is achieved by adding a new flag to StatementBase to signal when a statement ends up NO_OP, if that's true, the new TStmtType::NO_OP will be set as TExecRequest's type and noop_result can be used to set result from Frontend-side. Tests: * extended fe and e2e tests Change-Id: I41ecbfd350d38e4e3fd7b813a4fc27211d828f73 Reviewed-on: http://gerrit.cloudera.org:8080/23699 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Peter Rozsa <prozsa@cloudera.com>	2025-12-12 15:35:28 +00:00
Xuebin Su	d54b75ccf1	IMPALA-14619: Reset levels_readahead_ for late materialization Previously, `BaseScalarColumnReader::levels_readahead_` was not reset when the reader did not do page filtering. If a query selected the last row containing a collection value in a row group, `levels_readahead_` would be set and would not be reset when advancing to the next row group without page filtering. As a result, trying to skip collection values at the start of the next row group would cause a check failure. This patch fixes the failure by resetting `levels_readahead_` in `BaseScalarColumnReader::Reset()`, which is always called when advancing to the next row group. `levels_readahead_` is also moved out of the "Members used for page filtering" section as the variable is also used in late materialization. Testing: - Added an E2E test for the fix. Change-Id: Idac138ffe4e1a9260f9080a97a1090b467781d00 Reviewed-on: http://gerrit.cloudera.org:8080/23779 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-12-12 15:12:50 +00:00
Nandor Kollar	65639f16b9	IMPALA-12330: Allow setting format-version in ALTER TABLE CONVERT TO This change allows modifying the format version table property in ALTER TABLE CONVERT TO statements. It adds verification for the property value too: only 1 or 2 is supported as of now. Change-Id: Iaed207feb83a277a1c2f81dcf58c42f0721c0865 Reviewed-on: http://gerrit.cloudera.org:8080/23721 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Peter Rozsa <prozsa@cloudera.com>	2025-12-12 08:18:08 +00:00
Arnab Karmakar	ddd82e02b9	IMPALA-14065: Support WHERE clause in SHOW PARTITIONS statement This patch extends the SHOW PARTITIONS statement to allow an optional WHERE clause that filters partitions based on partition column values. The implementation adds support for various comparison operators, IN lists, BETWEEN clauses, IS NULL, and logical AND/OR expressions involving partition columns. Non-partition columns, subqueries, and analytic expressions in the WHERE clause are not allowed and will result in an analysis error. New analyzer tests have been added to AnalyzeDDLTest#TestShowPartitions to verify correct parsing, semantic validation, and error handling for supported and unsupported cases. Testing: - Added new unit tests in AnalyzeDDLTest for valid and invalid WHERE clause cases. - Verified functional tests covering partition filtering behavior. Change-Id: I2e2a14aabcea3fb17083d4ad6f87b7861113f89e Reviewed-on: http://gerrit.cloudera.org:8080/23566 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-12-11 15:36:08 +00:00
Csaba Ringhofer	780e6683a2	IMPALA-14573: port critical geospatial functions to c++ (part 1) This commit contains the simpler parts from https://gerrit.cloudera.org/#/c/20602 This mainly means accessors for the header of the binary format and bounding box check (st_envIntersects). New tests for not yet covered functions / overloads are also added. For details of the binary format see be/src/exprs/geo/shape-format.h Differences from the PR above: Only a subset of functions are added. The criteria was: 1. the native function must be fully compatible with the Java version* 2. must not rely on (de)serializing the full geometry 3. the function must be tested 1 implies 2 because (de)serialization is not implemented yet in the original patch for >2d geometries, which would break compatibility for the Java version for ZYZ/XYM/XYZM geometries. *: there are 2 known differences: 1. NULL handling: the Java functions return error instead of NULL when getting a NULL parameter 2. st_envIntersects() doesn't check if the SRID matches - the Java library looks inconsistant about this Because the native functions are fairly safe replacements for the Java ones, they are always used when geospatial_library=HIVE_ESRI. Change-Id: I0ff950a25320549290a83a3b1c31ce828dd68e3c Reviewed-on: http://gerrit.cloudera.org:8080/23700 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-12-06 07:50:23 +00:00
jichen0919	7e29ac23da	IMPALA-14092 Part2: Support querying of paimon data table via JNI This patch mainly implement the querying of paimon data table through JNI based scanner. Features implemented: - support column pruning. The partition pruning and predicate push down will be submitted as the third part of the patch. We implemented this by treating the paimon table as normal unpartitioned table. When querying paimon table: - PaimonScanNode will decide paimon splits need to be scanned, and then transfer splits to BE do the jni-based scan operation. - We also collect the required columns that need to be scanned, and pass the columns to Scanner for column pruning. This is implemented by passing the field ids of the columns to BE, instead of column position to support schema evolution. - In the original implementation, PaimonJniScanner will directly pass paimon row object to BE, and call corresponding paimon row field accessor, which is a java method to convert row fields to impala row batch tuples. We find it is slow due to overhead of JVM method calling. To minimize the overhead, we refashioned the implementation, the PaimonJniScanner will convert the paimon row batches to arrow recordbatch, which stores data in offheap region of impala JVM. And PaimonJniScanner will pass the arrow offheap record batch memory pointer to the BE backend. BE PaimonJniScanNode will directly read data from JVM offheap region, and convert the arrow record batch to impala row batch. The benchmark shows the later implementation is 2.x better than the original implementation. The lifecycle of arrow row batch is mainly like this: the arrow row batch is generated in FE,and passed to BE. After the record batch is imported to BE successfully, BE will be in charge of freeing the row batch. There are two free paths: the normal path, and the exception path. For the normal path, when the arrow batch is totally consumed by BE, BE will call jni to fetch the next arrow batch. For this case, the arrow batch is freed automatically. For the exceptional path, it happends when query is cancelled, or memory failed to allocate. For these corner cases, arrow batch is freed in the method close if it is not totally consumed by BE. Current supported impala data types for query includes: - BOOLEAN - TINYINT - SMALLINT - INTEGER - BIGINT - FLOAT - DOUBLE - STRING - DECIMAL(P,S) - TIMESTAMP - CHAR(N) - VARCHAR(N) - BINARY - DATE TODO: - Patches pending submission: - Support tpcds/tpch data-loading for paimon data table. - Virtual Column query support for querying paimon data table. - Query support with time travel. - Query support for paimon meta tables. - WIP: - Snapshot incremental read. - Complex type query support. - Native paimon table scanner, instead of jni based. Testing: - Create tests table in functional_schema_template.sql - Add TestPaimonScannerWithLimit in test_scanners.py - Add test_paimon_query in test_paimon.py. - Already passed the tpcds/tpch test for paimon table, due to the testing table data is currently generated by spark, and it is not supported by impala now, we have to do this since hive doesn't support generating paimon table for dynamic-partitioned tables. we plan to submit a separate patch for tpcds/tpch data loading and associated tpcds/tpch query tests. - JVM Offheap memory leak tests, have run looped tpch tests for 1 day, no obvious offheap memory increase is observed, offheap memory usage is within 10M. Change-Id: Ie679a89a8cc21d52b583422336b9f747bdf37384 Reviewed-on: http://gerrit.cloudera.org:8080/23613 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-12-05 18:19:57 +00:00
Peter Rozsa	d67ab6f11f	IMPALA-14569: (addendum) Fix 'partitions' row matching IMPALA-14569 introduced a test that asserts for a profile row like 'HDFS partitions' and it's possible for test environments to run on a different storage system. This change omits the storage type from the row_regex. Change-Id: If9b223f2be2dfe7be8724423fefdfb56ffeeba6e Reviewed-on: http://gerrit.cloudera.org:8080/23727 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-12-01 23:06:47 +00:00
Peter Rozsa	6cf21464b4	IMPALA-14569: Fix IllegalStateException in partition pruning on type mismatch This fixes an IllegalStateException in HdfsPartitionPruner when evaluating 'IN' predicates whose consist of two compatible types, for example DATE and STRING: date_col in (<date as string>). Previously, 'canEvalUsingPartitionMd' did not check if the slot type matched the literal type. This caused the frontend to attempt invalid comparisons via 'LiteralExpr.compareTo', leading to IllegalStateException or incorrect pruning. The fix ensures 'canEvalUsingPartitionMd' returns false on type mismatches, deferring evaluation to the backend where proper casting occurs. Testing: - Added regression test in hdfs-partition-pruning.test. Change-Id: Idc226a628c8df559329a060cb963b81e27e21eda Reviewed-on: http://gerrit.cloudera.org:8080/23706 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-27 02:48:28 +00:00
Daniel Vanko	3d22c7fe05	IMPALA-12209: Always include format-version in DESCRIBE FORMATTED and SHOW CREATE TABLE for Iceberg tables HiveCatalog does not include format-version for Iceberg tables in the table's parameters, therefore the output of SHOW CREATE TABLE may not replicate the original table. This patch makes sure to add it to both the SHOW CREATE TABLE and DESCRIBE FORMATTED/EXTENDED output. Additionally, adds ICEBERG_DEFAULT_FORMAT_VERSION variable to E2E tests, deducting from IMPALA_ICEBERG_VERSION environment variable. If Iceberg version is at least 1.4, default format-version is 2, before 1.4 it's 1. This way tests can work with multiple Iceberg versions. Testing: * updated show-create-table.test and show-create-table-with-stats.test for Iceberg tables * added format-version checks to multiple DESCRIBE FORMATTED tests Change-Id: I991edf408b24fa73e8a8abe64ac24929aeb8e2f8 Reviewed-on: http://gerrit.cloudera.org:8080/23514 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-24 21:48:17 +00:00
Csaba Ringhofer	f6ceca2b4d	IMPALA-14571: increase planner cost of java functions The main motivation is to evaluate expensive geospatial functions (which are Java functions) last in predicates. Java functions have a major overhead anyway from the JNI call, so bumping all Java function costs seems beneficial. Note that currently geospatial functions are the only built-in Java functions. Change-Id: I11d1652d76092ec60af18a33502dacc25b284fcc Reviewed-on: http://gerrit.cloudera.org:8080/22733 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-24 16:52:59 +00:00
Steve Carlin	54c0074b33	IMPALA-14405 ADDENDUM: Catch exception for bad column names This commit is a fix on top of IMPALA-14405 for the Calcite planner. The original commit matches column names from the expression in the select clause. For instance, if the query is "select 1 + 1", the label in impala-shell will be "1 + 1". It accomplished this by retrieving the string from the SqlNode object through the MySql dialect. However, when the expression doesn't succeed in the MySql dialect, an AssertionError gets thrown, causing the query to fail. We don't want the query to fail, we just want to go back to using the Calcite expression, e.g. EXPR$0. This occurred with this specific query: "select timestamp_col + interval 3 nanoseconds" So now the exception is caught and the default label name is used. Eventually we should try to match what Impala has, but this is a harder problem to fix. Change-Id: I6c4d76a25fb2486eb1ef19485bce7888d45d282f Reviewed-on: http://gerrit.cloudera.org:8080/23665 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Steve Carlin <scarlin@cloudera.com>	2025-11-18 21:34:29 +00:00
Arnab Karmakar	a2a11dec62	IMPALA-13263: Add single-argument overload for ST_ConvexHull() Implemented a single-argument version of ST_ConvexHull() to align with PostGIS behavior and simplify usage across geometry types. Testing: Added new tests in test_geospatial_functions.py for ST_ConvexHull(), which previously had no test coverage, to verify correctness across supported geometry types. Change-Id: Idb17d98f5e75929ec0143aa16195a84dd6e50796 Reviewed-on: http://gerrit.cloudera.org:8080/23604 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2025-11-18 10:26:04 +00:00
Steve Carlin	52334ba426	IMPALA-14421: Calcite planner: case statement returning wrong types for char, varchar The 'case' function resolver in the original Impala planner has a quirk in it which caused issues in the Calcite planner. The function resolver for the original planner resolves all case statements with the "boolean" version. Later on, in the analysis of the CaseExpr, the proper types are assessed and the necessary casting is added. The Calcite planner follows a similar path. The resolver always returns boolean as well and the coerce nodes module determines the proper return type for the case statement. Two other related issues are also fixed here: Literal strings should be treated as type STRING instead of CHAR(X), but a null should literal should not be changed from a CHAR(x) to a STRING. This broke a 'case' test in the test framework where the columns were non-literals with type char(x), and the return value was a "null" which should not have forced a cast to string. A cast from a varchar to a varchar should be ignored. Testing: Added a test to calcite.test. Ensured the existing cast test in test_chars.py passed. Ran through the Jenkins Calcite testing framework. Change-Id: I82d657f4bfce432c458ee8198188dadf9f23f2ef Reviewed-on: http://gerrit.cloudera.org:8080/23560 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-18 07:47:39 +00:00
Riza Suminto	f2243b76b5	IMPALA-14557: Fix flaky test_show_files_partition TestIcebergTable.test_show_files_partition is unstable because files are alphanumerically sorted and the order between a random UUID and "delete-*" is not guaranteed. This patch fix the flakiness by specifying VERIFY_IS_SUBSET and using negative lookahead of "delete" word to detect valid Iceberg data file. Testing: - Loop and pass test_show_files_partition 50 times. Before, it can fail in less than 10 loops. Change-Id: I6243585a5b7ab7cf7c95d5a9530ce2f2825c550e Reviewed-on: http://gerrit.cloudera.org:8080/23680 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-11-17 17:13:19 +00:00
Steve Carlin	bc99705252	IMPALA-13902: Calcite planner: Implement is_spool_query_results The is_spool_query_results query option is now supported in Calcite. The returnAtMostOneRow method is now implemented to support this. PlanRootSink is refactored to extract sanitizing query options (a new method sanitizeSpoolingOptions()) out of PlanRootSink.computeResourceProfile(). The bulk of memory bounding calculation is also extracted out to a new class SpoolingMemoryBound. Added "sleep" in ImpalaOperatorTable.java since some EE tests related to result spooling calls sleep() function. Changed ImpalaPlanRel to extends RelNode interface. A sanity test has been added to calcite.test, but the bulk of the testing will be done through the Impala test framework when it is enabled. Testing: - Pass FE tests PlannerTest#testResultSpooling, TpcdsCpuCostPlannerTest, and all java tests under calcite-planner project. - Pass query_test/test_result_spooling.py and custom_cluster/test_result_spooling.py. Co-authored-by: Riza Suminto Change-Id: I5b9bf49e2874ee12de212b892bd898c296774c6f Reviewed-on: http://gerrit.cloudera.org:8080/23562 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-16 02:33:02 +00:00
Riza Suminto	898e03e9d5	IMPALA-14552: (addendum) Fix bad testcase in show-create-table.test The original IMPALA-14552 patch pass precommit tests before IMPALA-12893: (part 2) (`275f03f`) merged. As consequence, it does not catch missing comma in updated show-create-table.test. This patch add that missing comma. Testing: Pass metadata/test_show_create_table.py Change-Id: Ib06e690a81e6b0ca483b3647cc59c73802a0a7b7 Reviewed-on: http://gerrit.cloudera.org:8080/23673 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-15 21:34:44 +00:00
Mihaly Szjatinya	087b715a2b	IMPALA-14108: Add support for SHOW FILES IN table PARTITION for Iceberg tables This patch implements partition filtering support for the SHOW FILES statement on Iceberg tables, based on the functionality added in IMPALA-12243. Prior to this change, the syntax resulted in a NullPointerException. Key changes: - Added ShowFilesStmt.analyzeIceberg() to validate and transform partition expressions using IcebergPartitionExpressionRewriter and IcebergPartitionPredicateConverter. After that, it collects matching file paths using IcebergUtil.planFiles(). - Added FeIcebergTable.Utils.getIcebergTableFilesFromPaths() to accept pre-filtered file lists from the analysis phase. - Enhanced TShowFilesParams thrift struct with optional selected_files field to pass pre-filtered file paths from frontend to backend. Testing: - Analyzer tests for negative cases: non-existent partitions, invalid expressions, non-partition columns, unsupported transforms. - Analyzer tests for positive cases: all transform types, complex expressions. - Authorization tests for non-filtered and filtered syntaxes. - E2E tests covering every partition transform type with various predicates. - Schema evolution and rollback scenarios. The implementation follows AlterTableDropPartition's pattern where the analysis phase performs validation/metadata retrieval and the execution phase handles result formatting and display. Change-Id: Ibb9913e078e6842861bdbb004ed5d67286bd3152 Reviewed-on: http://gerrit.cloudera.org:8080/23455 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-14 21:43:10 +00:00
Zoltan Borok-Nagy	275f03f10d	IMPALA-12893: (part 2): Upgrade Iceberg to version 1.5.2 This patch updates CDP_BUILD_NUMBER to 71942734 to in order to upgrade Iceberg to 1.5.2. This patch updates some tests so they pass with Iceberg 1.5.2. The behavior changes of Iceberg 1.5.2 are (compared to 1.3.1): * Iceberg V2 tables are created by default * Metadata tables have different schema * Parquet compression is explicitly set for new tables (even for ORC tables) * Sequence numbers are assigned a bit differently Updated the tests where needed. Code changes to accomodate for the above behavior changes: * SHOW CREATE TABLE adds 'format-version'='1' for Iceberg V1 tables * CREATE TABLE statements don't throw errors when Parquet compression is set for ORC tables Change-Id: Ic4f9ed3f7ee9f686044023be938d6b1d18c8842e Reviewed-on: http://gerrit.cloudera.org:8080/23670 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-14 01:27:45 +00:00
Arnab Karmakar	760eb4f2fa	IMPALA-13066: Extend SHOW CREATE TABLE to include stats and partitions Adds a new WITH STATS option to the SHOW CREATE TABLE statement to emit additional SQL statements for recreating table statistics and partitions. When specified, Impala outputs: - Base CREATE TABLE statement. - ALTER TABLE ... SET TBLPROPERTIES for table-level stats. - ALTER TABLE ... SET COLUMN STATS for all non-partition columns, restoring column stats. - For partitioned tables: - ALTER TABLE ... ADD PARTITION statements to recreate partitions. - Per-partition ALTER TABLE ... PARTITION (...) SET TBLPROPERTIES to restore partition-level stats. Partition output is limited by the PARTITION_LIMIT query option (default 1000). Setting PARTITION_LIMIT=0 includes all partitions and emits a warning if the limit is exceeded. Tests added to verify correctness of emitted statements. Default behavior of SHOW CREATE TABLE remains unchanged for compatibility. Change-Id: I87950ae9d9bb73cb2a435cf5bcad076df1570dc2 Reviewed-on: http://gerrit.cloudera.org:8080/23536 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-12 06:11:37 +00:00
Xuebin Su	6b6f7e614d	IMPALA-14472: Add create/read support for ARRAY column of Kudu Initial implementation of KUDU-1261 (array column type) recently merged in upstream Apache Kudu repository. This patch add initial Impala support for working with Kudu tables having array type columns. Unlike rows, the elements of a Kudu array are stored in a different format than Impala. Instead of per-row bit flag for NULL info, values and NULL bits are stored in separate arrays. The following types of queries are not supported in this patch: - (IMPALA-14538) Queries that reference an array column as a table, e.g. ```sql SELECT item FROM kudu_array.array_int; ``` - (IMPALA-14539) Queries that create duplicate collection slots, e.g. ```sql SELECT array_int FROM kudu_array AS t, t.array_int AS unnested; ``` Testing: - Add some FE tests in AnalyzeDDLTest and AnalyzeKuduDDLTest. - Add EE test test_kudu.py::TestKuduArray. Since Impala does not support inserting complex types, including array, the data insertion part of the test is achieved through custom C++ code kudu-array-inserter.cc that insert into Kudu via Kudu C++ client. It would be great if we could migrate it to Python so that it can be moved to the same file as the test (IMPALA-14537). - Pass core tests. Co-authored-by: Riza Suminto Change-Id: I9282aac821bd30668189f84b2ed8fff7047e7310 Reviewed-on: http://gerrit.cloudera.org:8080/23493 Reviewed-by: Alexey Serbin <alexey@apache.org> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-08 06:41:07 +00:00
Riza Suminto	671a7fcada	IMPALA-14529: (addendum) Fix kudu_create.test Kudu throws different error message after IMPALA-14529. This patch adjust the error message in kudu_create.test to let the test pass. Testing: Pass TestDdlStatements.test_create_kudu and TestKuduHMSIntegration.test_create_managed_kudu_tables. Change-Id: Iff4cd08f46626d03b1f0800828e5872b83f522ca Reviewed-on: http://gerrit.cloudera.org:8080/23648 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-11-06 22:42:34 +00:00
Steve Carlin	62bf609942	IMPALA-14414: Calcite planner: Added new code to handle nan/inf The current code works for NaN and Inf, but it breaks when upgrading to v1.40. This commit changes the code to handle these when we do the upgrade to 1.40 and adds a basic test into the calcite.test to ensure that when the upgrade happens, it does not break. Change-Id: I8593a4942a2fe785a0c77134b78a9d97257225fc Reviewed-on: http://gerrit.cloudera.org:8080/23561 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-05 12:55:39 +00:00
Riza Suminto	f34dea9b6f	IMPALA-14522: Fix test_paimon_show_stats after DST ends Test failed due to mismatch on "Last Creation Time" matching. This patch fix the assertion with simple regex. Testing: Pass test_paimon.py. Change-Id: I6855c0014111cef18318cdc4904782097a070ced Reviewed-on: http://gerrit.cloudera.org:8080/23619 Reviewed-by: Mihaly Szjatinya <mszjat@pm.me> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-11-03 21:25:42 +00:00
jichen0919	541fb3f405	IMPALA-14092 Part1: Prohibit Unsupported Operation for paimon table This patch is to prohibit un-supported operation against paimon table. All unsupported operations are added the checked in the analyze stage in order to avoid mis-operation. Currently only CREATE/DROP statement is supported, the prohibition will be removed later after the corresponding operation is truly supported. TODO: - Patches pending submission: - Support jni based query for paimon data table. - Support tpcds/tpch data-loading for paimon data table. - Virtual Column query support for querying paimon data table. - Query support with time travel. - Query support for paimon meta tables. Testing: - Add unit test for AnalyzeDDLTest.java. - Add unit test for AnalyzerTest.java. - Add test_paimon_negative and test_paimon_query in test_paimon.py. Change-Id: Ie39fa4836cb1be1b1a53aa62d5c02d7ec8fdc9d7 Reviewed-on: http://gerrit.cloudera.org:8080/23530 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-10-23 23:06:08 +00:00
pranav.lodha	7f77176970	IMPALA-13869: Support for 'hive.sql.query' property for Hive JDBC tables This patch adds support for the hive.sql.query table property in Hive JDBC tables accessed through Impala. Impala has support for Hive JDBC tables using the hive.sql.table property, which limits users to simple table access. However, many use cases demand the ability to expose complex joins, filters, aggregations, or derived columns as external views. Hive.sql.query leads to a custom SQL query that returns a virtual table(subquery) instead of pointing to a physical table. These use cases cannot be achieved with just the hive.sql.table property. This change allows Impala to: • Interact with views or complex queries defined on external systems without needing schema-level access to base tables. • Expose materialized logic (such as filters, joins, or transformations) via Hive to Impala consumers in a secure, abstracted way. • Better align with data virtualization use cases where physical data location and structure should be hidden from the querying engine. This patch also lays the groundwork for future enhancements such as predicate pushdown and performance optimizations for Hive JDBC tables backed by queries. Testing: End-to-end tests are included in test_ext_data_sources.py. Change-Id: I039fcc1e008233a3eeed8d09554195fdb8c8706b Reviewed-on: http://gerrit.cloudera.org:8080/22865 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-10-23 21:34:29 +00:00
Steve Carlin	c67b19daf6	IMPALA-14405: Labels for Calcite expressions not matching original planner Calcite sets literal expressions to EXPR$<x> which did not match expressions given by the Impala planner. For literal expressions such as "select 1 + 1", Impala creates the column name as "1 + 1". The field names can be found in the abstract syntax tree, so they are not set within the CalciteRelNodeConverter before the logical tree is created. A small test was added to calcite.test for a basic sanity check, but more comprehensive tests will be run in the tests/shell module (e.g. in test_shell_commandline.py and test_shell_interactive) which contain tests for labels. Change-Id: Ibd3e6366a284f53807b4b2c42efafa279249c1ea Reviewed-on: http://gerrit.cloudera.org:8080/23516 Reviewed-by: Steve Carlin <scarlin@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-10-22 03:37:48 +00:00
Steve Carlin	420e357b95	IMPALA-13695: Calcite planner: fix for ndv with 2 args The NDV function was crashing when called with the "scale" arg. This requires special processing which exists in FunctionCallExpr. The validation for this is now done in ImpalaNdvFunction and the special calculation is done within ImpalaAggRel This also fixes ndv for varchar types. The aggregation call within CoerceNodes was not differentiating between varchar and string. A cast to string function is needed in order to run the ndv function on a varchar column. Change-Id: I82419f77e043e9975865a042ffb8db75a26931f7 Reviewed-on: http://gerrit.cloudera.org:8080/23513 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-10-20 23:28:39 +00:00
Zoltan Borok-Nagy	bfae4d0b32	IMPALA-14496: Impala crashes when it writes multiple delete files per partition in a single DELETE operation Impala crashes when it needs to write multiple delete files per partition in a single DELETE operation. It is because IcebergBufferedDeleteSink has its own DmlExecState object, but sometimes the methods in TableSinkBase use the RuntimeState's DmlExecState object. I.e. it can happen that we add a partition to the IcebergBufferedDeleteSink's DmlExecState, but later we expect to find it in the RuntimeState's DmlExecState. This patch adds new methods to TableSinkBase that are specific for writing delete files, and they always take a DmlExecState object as a parameter. They are now used by IcebergBufferedDeleteSink. Testing * added e2e tests Change-Id: I46266007a6356e9ff3b63369dd855aff1396bb72 Reviewed-on: http://gerrit.cloudera.org:8080/23537 Reviewed-by: Mihaly Szjatinya <mszjat@pm.me> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-10-15 19:58:37 +00:00
Peter Rozsa	b0f1d49042	IMPALA-14016: Add multi-catalog support for local catalog mode This patch adds a new MetaProvider called MultiMetaProvider, which is capable of handling multiple MetaProviders at once, prioritizing one primary provider over multiple secondary providers. The primary provider handles some methods exclusively for deterministic behavior. In database listings, if one database name occurs multiple times the contained tables are merged under that database name; if the two separate databases contain a table with the same name, the query analyzation fails with an error. This change also modifies the local catalog implementation's initialization. If catalogd is deployed, then it instantiates the CatalogdMetaProvider and checks if the catalog configuration directory is set as a backend flag. If it's set, then it tries to load every configuration from the folder, and tries to instantiate the IcebergMetaProvider from those configs. If the instantiation fails, an error is reported to the logs, but the startup is not interrupted. Tests: - E2E tests for multi-catalog behavior - Unit test for ConfigLoader Change-Id: Ifbdd0f7085345e7954d9f6f264202699182dd1e1 Reviewed-on: http://gerrit.cloudera.org:8080/22878 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2025-09-19 15:03:59 +00:00
Joe McDonnell	ca356a8df5	IMPALA-13437 (part 2): Implement cost-based tuple cache placement This changes the default behavior of the tuple cache to consider cost when placing the TupleCacheNodes. It tries to pick the best locations within a budget. First, it eliminates unprofitable locations via a threshold. Next, it ranks the remaining locations by their profitability. Finally, it picks the best locations in rank order until it reaches the budget. The threshold is based on the ratio of processing cost for regular execution versus the processing cost for reading from the cache. If the ratio is below the threshold, the location is eliminated. The threshold is specified by the tuple_cache_required_cost_reduction_factor query option. This defaults to 3.0, which means that the cost of reading from the cache must be less than 1/3 the cost of computing the value normally. A higher value makes this more restrictive about caching locations, which pushes in the direction of lower overhead. The ranking is based on the cost reduction per byte. This is given by the formula: (regular processing cost - cost to read from cache) / estimated serialized size This prefers locations with small results or high reduction in cost. The budget is based on the estimated serialized size per node. This limits the total caching that a query will do. A higher value allows more caching, which can increase the overhead on the first run of a query. A lower value is less aggressive and can limit the overhead at the expense of less caching. This uses a per-node limit as the limit should scale based on the size of the executor group as each executor brings extra capacity. The budget is specified by the tuple_cache_budget_bytes_per_executor. The old behavior to place the tuple cache at all eligible locations is still available via the tuple_cache_placement_policy query option. The default is the cost_based policy described above, but the old behavior is available via the all_eligible policy. This is useful for correctness testing (and the existing tuple cache test cases). This changes the explain plan output: - The hash trace is only enabled at VERBOSE level. This means that the regular profile will not contain the hash trace, as the regular profile uses EXTENDED. - This adds additional information at VERBOSE to display the cost information for each plan node. This can help trace why a particular location was not picked. Testing: - This adds a TPC-DS planner test with tuple caching enabled (based on the existing TpcdsCpuCostPlannerTest) - This modifies existing tests to adapt to changes in the explain plan output Change-Id: Ifc6e7b95621a7937d892511dc879bf7c8da07cdc Reviewed-on: http://gerrit.cloudera.org:8080/23219 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-18 21:02:51 +00:00
Mihaly Szjatinya	591bf48c72	IMPALA-14013: DROP INCREMENTAL STATS throws NullPointerException for Iceberg tables Similarly to 'COMPUTE INCREMENTAL STATS', 'DROP INCREMENTAL STATS' should prohibit the partition variant for Iceberg tables. Testing: - FE: fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java - EE: tests/query_test/test_iceberg.py Change-Id: If3d9ef45a9c9ddce9a5e43c5058ae84f919e0283 Reviewed-on: http://gerrit.cloudera.org:8080/23394 Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-18 09:54:26 +00:00
Noemi Pap-Takacs	821c7347d1	IMPALA-13267: Display number of partitions for Iceberg tables Before this change, query plans and profile reported only a single partition even for partitioned Iceberg tables, which was misleading for users. Now we can display the number of scanned partitions correctly for both partitioned and unpartitioned Iceberg tables. This is achieved by extracting the partition values from the file descriptors and storing them in the IcebergContentFileStore. Instead of storing this information redundantly in all file descriptors, we store them in one place and reference the partition metadata in the FDs with an id. This also gives the opportunity to optimize memory consumption in the Catalog and Coordinator as well as reduce network traffic between them in the future. Time travel is handled similarly to oldFileDescMap. In that case we don't know the total number of partitions in the old snapshot, so the output is [Num scanned partitions]/unknown. Testing: - Planner tests - E2E tests - partition transforms - partition evolution - DROP PARTITION - time travel Change-Id: Ifb2f654bc6c9bdf9cfafc27b38b5ca2f7b6b4872 Reviewed-on: http://gerrit.cloudera.org:8080/23113 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-12 20:36:10 +00:00
Daniel Vanko	492b2b7f46	IMPALA-14406: Fix test_column_case_sensitivity for newer Iceberg versions New test introduced in IMPALA-14290 depends on the Iceberg versions, because newer ones (e.g. 1.5.2) will show {"start_time_month":"646","end_time_day":"19916"} instead of {"start_time_day":null,"end_time_month":null,"start_time_month":"646","end_time_day":"19916"} The test now accepts both cases. Testing: * ran query_test/test_iceberg.py with both Iceberg 1.3.1 and 1.5.2 Change-Id: I17e368ac043d1fbf80a78dcac6ab1be5a297b6ea Reviewed-on: http://gerrit.cloudera.org:8080/23389 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-11 18:28:46 +00:00
jichen0919	826c8cf9b0	IMPALA-14081: Support create/drop paimon table for impala This patch mainly implement the creation/drop of paimon table through impala. Supported impala data types: - BOOLEAN - TINYINT - SMALLINT - INTEGER - BIGINT - FLOAT - DOUBLE - STRING - DECIMAL(P,S) - TIMESTAMP - CHAR(N) - VARCHAR(N) - BINARY - DATE Syntax for creating paimon table: CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ( [col_name data_type ,...] [PRIMARY KEY (col1,col2)] ) [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)] STORED AS PAIMON [LOCATION 'hdfs_path'] [TBLPROPERTIES ( 'primary-key'='col1,col2', 'file.format' = 'orc/parquet', 'bucket' = '2', 'bucket-key' = 'col3', ]; Two types of paimon catalogs are supported. (1) Create table with hive catalog: CREATE TABLE paimon_hive_cat(userid INT,movieId INT) STORED AS PAIMON; (2) Create table with hadoop catalog: CREATE [EXTERNAL] TABLE paimon_hadoop_cat STORED AS PAIMON TBLPROPERTIES('paimon.catalog'='hadoop', 'paimon.catalog_location'='/path/to/paimon_hadoop_catalog', 'paimon.table_identifier'='paimondb.paimontable'); SHOW TABLE STAT/SHOW COLUMN STAT/SHOW PARTITIONS/SHOW FILES statements are also supported. TODO: - Patches pending submission: - Query support for paimon data files. - Partition pruning and predicate push down. - Query support with time travel. - Query support for paimon meta tables. - WIP: - Complex type query support. - Virtual Column query support for querying paimon data table. - Native paimon table scanner, instead of jni based. Testing: - Add unit test for paimon impala type conversion. - Add unit test for ToSqlTest.java. - Add unit test for AnalyzeDDLTest.java. - Update default_file_format TestEnumCase in be/src/service/query-options-test.cc. - Update test case in testdata/workloads/functional-query/queries/QueryTest/set.test. - Add test cases in metadata/test_show_create_table.py. - Add custom test test_paimon.py. Change-Id: I57e77f28151e4a91353ef77050f9f0cd7d9d05ef Reviewed-on: http://gerrit.cloudera.org:8080/22914 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-09-10 21:24:49 +00:00
Daniel Vanko	321429eac6	IMPALA-14237: Fix Iceberg partition values encoding This patch modifies the string overload of IcebergFunctions::TruncatePartitionTransform so that it always handles strings as UTF-8-encoded ones, because the Iceberg specification states that that strings are UTF-8 encoded. Also, for an Iceberg table UrlEncode is called in not the Hive-compatible way, rather than the standard way, similar to Java's URLEncoder.encode() (which the Iceberg API also uses) to conform with existing practices by Hive, Spark and Trino. This included a change in the set of characters which are not escaped to follow the URL Standard's application/x-www-form-urlencoded format. [1] Also renamed it from ShouldNotEscape to IsUrlSafe for better readability. Testing: * add and extend e2e tests to check partitions with Unicode characters * add be tests to coding-util-test.cc [1]: https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set Change-Id: Iabb39727f6dd49b76c918bcd6b3ec62532555755 Reviewed-on: http://gerrit.cloudera.org:8080/23190 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-08 18:54:07 +00:00
Daniel Vanko	edd5ff6e2a	IMPALA-14290: Make Iceberg partitioning column names case insensitive When creating or altering partitions of Iceberg tables, Impala only accepts column names if they are in lowercase and throws ImpalaRuntimeException otherwise. This patch allows the usage of other cases as well in PARTITION SPEC clauses. IcebergPartitionField converts field names to lower case in its constructor, similar to ColumnDef and PartitionKeyValue. Testing: * ran existing tests * add new test with mixed letter case columns Change-Id: I4080a6b7468fff940435e2e780322d4ba1f0de49 Reviewed-on: http://gerrit.cloudera.org:8080/23334 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-04 13:20:52 +00:00
stiga-huang	0dfed88861	IMPALA-14383: Fix crash in casting timestamp string with timezone offsets to DATE Timestamp string can have a timezone offset at its end, e.g. "2025-08-31 06:23:24.9392129 +08:00" has "+08:00" as the timezone offset. When casting strings to DATE type, we try to find the default format by matching the separators, i.e. '-', ':', ' ', etc in SimpleDateFormatTokenizer::GetDefaultFormatContext(). The one that matches this example is DEFAULT_DATE_TIME_CTX[] which represents the default date/time context for "yyyy-MM-dd HH:mm:ss.SSSSSSSSS". The fractional part at the end can have length from 0 to 9, matching DEFAULT_DATE_TIME_CTX[0] to DEFAULT_DATE_TIME_CTX[9] respectively. When calculating which item in DEFAULT_DATE_TIME_CTX is the matched format, we use the index as str_len - 20 where 20 is the length of "yyyy-MM-dd HH:mm:ss.". This causes the index overflow if the string length is larger than 29. A wild pointer is returned from GetDefaultFormatContext(), leading crash in following codes. This patch fixes the issue by adding a check to make sure the string length is smaller than the max length of the default date time format, i.e. DEFAULT_DATE_TIME_FMT_LEN (29). Longer strings will use DateTimeFormatContext created lazily. Note that this just fixes the crash. Converting timestamp strings with timezone offset at the end to DATE type is not supported yet and will be followed up in IMPALA-14391. Tests - Added e2e tests on constant expressions. Also added a test table with such timestamp strings and added test on it. Change-Id: I36d73f4a71432588732b2284ac66552f75628a62 Reviewed-on: http://gerrit.cloudera.org:8080/23371 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-03 20:54:59 +00:00
Riza Suminto	28cff4022d	IMPALA-14333: Run impala-py.test using Python3 Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true reveals some tests that require adjustment. This patch made such adjustment, which mostly revolves around encoding differences and string vs bytes type in Python3. This patch also switch the default to run pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The following are the details: Change hash() function in conftest.py to crc32() to produce deterministic hash. Hash randomization is enabled by default since Python 3.3 (see https://docs.python.org/3/reference/datamodel.html#object.__hash__). This cause test sharding (like --shard_tests=1/2) produce inconsistent set of tests per shard. Always restart minicluster during custom cluster tests if --shard_tests argument is set, because test order may change and affect test correctness, depending on whether running on fresh minicluster or not. Moved one test case from delimited-latin-text.test to test_delimited_text.py for easier binary comparison. Add bytes_to_str() as a utility function to decode bytes in Python3. This is often needed when inspecting the return value of subprocess.check_output() as a string. Implement DataTypeMetaclass.__lt__ to substitute DataTypeMetaclass.__cmp__ that is ignored in Python3 (see https://peps.python.org/pep-0207/). Fix WEB_CERT_ERR difference in test_ipv6.py. Fix trivial integer parsing in test_restart_services.py. Fix various encoding issues in test_saml2_sso.py, test_shell_commandline.py, and test_shell_interactive.py. Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1. Switch to binary comparison in test_iceberg.py where needed. Specify text mode when calling tempfile.NamedTemporaryFile(). Simplify create_impala_shell_executable_dimension to skip testing dev and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason is that several UTF-8 related tests in test_shell_commandline.py break in Python3 pytest + Python2 impala-shell combo. This skipping already happen automatically in build OS without system Python2 available like RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty). Removed unused vector argument and fixed some trivial flake8 issues. Several test logic require modification due to intermittent issue in Python3 pytest. These include: Add _run_query_with_client() in test_ranger.py to allow reusing a single Impala client for running several queries. Ensure clients are closed when the test is done. Mark several tests in test_ranger.py with SkipIfFS.hive because they run queries through beeline + HiveServer2, but Ozone and S3 build environment does not start HiveServer2 by default. Increase the sleep period from 0.1 to 0.5 seconds per iteration in test_statestore.py and mark TestStatestore to execute serially. This is because TServer appears to shut down more slowly when run concurrently with other tests. Handle the deprecation of Thread.setDaemon() as well. Always force_restart=True each test method in TestLoggingCore, TestShellInteractiveReconnect, and TestQueryRetries to prevent them from reusing minicluster from previous test method. Some of these tests destruct minicluster (kill impalad) and will produce minidump if metrics verifier for next tests fail to detect healthy minicluster state. Testing: Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true. Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4 Reviewed-on: http://gerrit.cloudera.org:8080/23319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-03 10:01:29 +00:00
Csaba Ringhofer	843de44788	IMPALA-13125: Fix pairwise test vector generation Replaced allpairspy with a homemade pair finder that seems to find a somewhat less optimal (larger) covering vector set but works reliably with filters. For details see tests/common/test_vector.py Also fixes a few test issues uncovered. Some fixes are copied from https://gerrit.cloudera.org/#/c/23319/ Added the possibility of shuffling vectors to get a different test set (env var IMPALA_TEST_VECTOR_SEED). By default the algorithm is deterministic so the test set won't change between runs (similarly to allpairspy). Added a new constraint to test only a single compression per file format in some tests to reduce the number of new vectors. EE + custom_cluster test count in exhaustive runs: before patch: ~11000 after patch: ~16000 without compression constraint: ~17000 Change-Id: I419c24659a08d8d6592fadbbd5b764ff73cbba3e Reviewed-on: http://gerrit.cloudera.org:8080/23342 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-08-28 15:27:02 +00:00
Zoltan Borok-Nagy	2fb56afb5e	IMPALA-14336: Avoid loading tables during table listing in the IcebergMetaProvider IcebergMetaProvider unnecessarily loads Iceberg tables in loadTableList(). Table loading is a slow operation which can make simple table listings painfully slow. This behavior is also contrast to CatalogdMetaProvider which lists tables without loading them. In our tests there were unloadable Iceberg tables which was never intended, some test tables were just wrongly created under iceberg_test/hadoop_catalog/, but they didn't use the HadoopCatalog. Normally we can assume that the tables returned by an Iceberg REST Catalog are loadable. Even if they are not, it shouldn't be too problematic to get an exception a bit later. Also, the new behavior is aligned with CatalogdMetaProvider, i.e. the tables are listed without fully loading them, and we only get an error when we want to use an unloadable table. This patch moves the Iceberg tables out from iceberg_test/hadoop_catalog/ that do not conform to HadoopCatalog. Testing * existing tests updated with the new paths Change-Id: I9ff75a751be5ad4b5159a1294eaaa304049c454a Reviewed-on: http://gerrit.cloudera.org:8080/23326 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-08-22 19:14:33 +00:00
Daniel Vanko	80f5e3bfa3	IMPALA-14322: Fix typo in IMPALA-12520 In IMPALA-12520 affected paths were modified to test_warehouse (separated with underscore) instead of test-warehouse (with hyphen). This commit replaces the underscore to hyphen. Change-Id: I3a9737af3e6169cc0cd144df53fd35e9e2b20468 Reviewed-on: http://gerrit.cloudera.org:8080/23304 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-08-18 19:06:36 +00:00
pranav.lodha	acca24fe02	IMPALA-14005: Support for quoted reserved words column names This change updates the way column names are projected in the SQL query generated for JDBC external tables. Instead of relying on optional mapping or default behavior, all column names are now explicitly quoted using appropriate quote characters. Column names are now wrapped with quote characters based on the JDBC driver being used: 1. Backticks (`) for Hive, Impala and MySQL 2. Double quotes (") for all other databases This helps in the support for case-sensitive or reserved column names. Change-Id: I5da5bc7ea5df8f094b7e2877a0ebf35662f93805 Reviewed-on: http://gerrit.cloudera.org:8080/23066 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2025-08-12 15:01:13 +00:00
Daniel Vanko	6a97109551	IMPALA-12520: Create all Iceberg test tables under /test-warehouse This patch modifies the creation of Iceberg tables in 5 testfiles. Previously these tables were created outside of /test-warehouse which could lead to issues, because we only clear the /test-warehouse directory in bin/jenkins/release_cloud_resources.sh. This means tables subsequent executions might see data from earlier runs. Change-Id: I97ce512db052b6e7499187079a184c1525692592 Reviewed-on: http://gerrit.cloudera.org:8080/23188 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2025-08-12 11:37:47 +00:00
Steve Carlin	922443da46	IMPALA-14165: Type coercion code accidentally omitted from analysis On the first cut of creating the Calcite planner, the Calcite planner was standalone and ran its own JniFrontend. In the current version, the parsing, validating, and single node planning is called from the Impala framework. There is some code in the first cut regarding the "ImpalaTypeCoercionFactory" class which handles deriving the correct data type for various expressions, for instance (found in exprs.test): select count(*) from alltypesagg where 10.1 in (tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col) Without this patch, the query returns the following error: UDF ERROR: Decimal expression overflowed This code can be found in CalciteValidator.java, but was accidentally omitted from CalciteAnalysisDriver. Change-Id: I74c4c714504400591d1ec6313f040191613c25d9 Reviewed-on: http://gerrit.cloudera.org:8080/23039 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Steve Carlin <scarlin@cloudera.com>	2025-08-10 17:00:54 +00:00
Riza Suminto	1cead45114	IMPALA-13947: Test local catalog mode by default Local catalog mode has been the default and works well in downstream Impala for over 5 years. This patch turn on local catalog mode by default (--catalog_topic_mode=minimal and --use_local_catalog=true) as preferred mode going forward. Implemented LocalCatalog.setIsReady() to facilitate using local catalog mode for FE tests. Some FE tests fail due to behavior differences in local catalog mode like IMPALA-7539. This is probably OK since Impala now largely hand over FileSystem permission check to Apache Ranger. The following custom cluster tests are pinned to evaluate under legacy catalog mode because their behavior changed in local catalog mode: TestCalcitePlanner.test_calcite_frontend TestCoordinators.test_executor_only_lib_cache TestMetadataReplicas TestTupleCacheCluster TestWorkloadManagementSQLDetailsCalcite.test_tpcds_8_decimal At TestHBaseHmsColumnOrder.test_hbase_hms_column_order, set --use_hms_column_order_for_hbase_tables=true flag for both impalad and catalogd to get consistent column order in either local or legacy catalog mode. Changed TestCatalogRpcErrors.test_register_subscriber_rpc_error assertions to be more fine grained by matching individual query id. Move most of test methods from TestRangerLegacyCatalog to TestRangerLocalCatalog, except for some that do need to run in legacy catalog mode. Also renamed TestRangerLocalCatalog to TestRangerDefaultCatalog. Table ownership issue in local catalog mode remains unresolved (see IMPALA-8937). Testing: Pass exhaustive tests. Change-Id: Ie303e294972d12b98f8354bf6bbc6d0cb920060f Reviewed-on: http://gerrit.cloudera.org:8080/23080 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-08-06 21:42:24 +00:00
Joe McDonnell	535b72e674	IMPALA-13945: Change hash trace to show each node's individual contribution Currently, the hash trace accumulates up the plan tree and is displayed only for tuple cache nodes. This means that tuple cache nodes high in a large plan can have hundreds of lines of hash trace output without an indication of which contributions came from which nodes. This changes the hash trace in two ways: 1. It displays each plan node's individual contribution to the hash trace. This only contains a summary of the hash contributed by the child, so the hash trace does not accumulate up the plan tree. Since each node is displaying its own contribution, the tuple cache node does not display the hash trace itself. 2. This adds structure to the hash trace to include a comment for each contribution to the hash trace. This allows a cleaner display of the individual pieces of a node's hash trace. It also gives extra information about the specific contributions into the hash. It should be possible to trace the contribution through the plan tree. This also changes the output to only display the hash trace with explain_level=EXTENDED or higher (i.e. it won't be displayed with STANDARD). Example output: tuple cache hash trace: TupleDescriptor 0: TTupleDescriptor(id:0, byteSize:0, numNullBytes:0, tableId:1, tuplePath:[]) Table: TTableName(db_name:functional, table_name:alltypes) PlanNode: [TPlanNode(node_id:0, node_type:HDFS_SCAN_NODE, num_children:0, limit:-1, row_tuples:[0], nullable_tu] [ples:[false], disable_codegen:false, pipelines:[], hdfs_scan_node:THdfsScanNode(tuple_id:0, random_r] [eplica:false, use_mt_scan_node:false, is_partition_key_scan:false, file_formats:[]), resource_profil] [e:TBackendResourceProfile(min_reservation:0, max_reservation:0))] Query options hash: TQueryOptionsHash(hi:-2415313890045961504, lo:-1462668909363814466) Testing: - Modified TupleCacheInfoTest and TupleCacheTest to use the new hash trace Change-Id: If53eda24e7eba264bc2d2f212b63eab9dc97a74c Reviewed-on: http://gerrit.cloudera.org:8080/23017 Reviewed-by: Yida Wu <wydbaggio000@gmail.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-07-23 16:15:30 +00:00
Daniel Vanko	365ce0b12f	IMPALA-11512: Add tests for BINARY type support in Iceberg This patch adds tests for the binary type in Iceberg tables. Change-Id: I9221050a4bee57b8fbb85280478304e5b28efd21 Reviewed-on: http://gerrit.cloudera.org:8080/23167 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-07-21 16:34:41 +00:00
Daniel Vanko	ee69ed1d03	IMPALA-13625: Allow reading Parquet int32/int64 as decimal without logical types This patch allows reading columns with integer logical type as decimals. This can occur when we're trying to read files that were written as INT but the column was altered to a suitable DECIMAL. In this case the precision is based on physical type and equals 9 and 18, for int32 and int64 respectively. Test: * add new e2e tests Change-Id: I56006eb3cca28c81ec8467d77b35005fbf669680 Reviewed-on: http://gerrit.cloudera.org:8080/22922 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-07-21 16:34:33 +00:00

1 2 3 4 5 ...

2024 Commits