impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
dependabot[bot]	521b17918c	Bump jinja2 from 2.11.3 to 3.1.4 in /infra/python/deps Bumps [jinja2](https://github.com/pallets/jinja) from 2.11.3 to 3.1.4. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](https://github.com/pallets/jinja/compare/2.11.3...3.1.4) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2024-05-06 16:45:25 +00:00
Michael Smith	0d01f5e829	IMPALA-13053: Update test to use ORC files Updates test_max_nesting_depth to use precreated ORC files like the Parquet version to reduce runtime rather than using Hive to generate ORC from the Parquet files. This reduces each test run by almost 3 minutes. Change-Id: I2f5bdbb86af0e651d189217a18882d5eda1098d5 Reviewed-on: http://gerrit.cloudera.org:8080/21391 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-04 11:19:02 +00:00
Daniel Becker	f75745e9bb	IMPALA-13035: Querying metadata tables from non-Iceberg tables throws IllegalArgumentException When attempting to query a metadata table of a non-Iceberg table the analyzer throws 'IllegalArgumentException'. The problem is that 'IcebergMetadataTable.isIcebergMetadataTable()' doesn't actually check whether the given path belongs to a valid metadata table, it only checks whether the path could syntactically refer to one. This is because it is called in 'Path.getCandidateTables()', at which point analysis has not been done yet. However, 'IcebergMetadataTable.isIcebergMetadataTable()' is also called in 'Analyzer.getTable()'. If 'isIcebergMetadataTable()' returns true, 'getTable()' tries to instantiate an 'IcebergMetadataTable' object with the table ref of the base table. If that table is not an Iceberg table, a precondition check fails. This change renames 'isIcebergMetadataTable()' to 'canBeIcebergMetadataTable()' and adds a new 'isIcebergMetadataTable()' function, which also takes an 'Analyzer' as a parameter. With the help of the 'Analyzer' it is possible to determine whether the base table is an Iceberg table. 'Analyzer.getTable()' then uses this new 'isIcebergMetadataTable()' function instead of canBeIcebergMetadataTable(). The constructor of 'IcebergMetadataTable' is also modified to take an 'FeIcebergTable' as the parameter for the base table instead of a general 'FeTable'. Testing: - Added a test query in iceberg-metadata-tables.test. Change-Id: Ia7c25ed85a8813011537c73f0aaf72db1501f9ef Reviewed-on: http://gerrit.cloudera.org:8080/21361 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Peter Rozsa <prozsa@cloudera.com>	2024-05-03 10:59:45 +00:00
Peter Rozsa	7ad9400656	IMPALA-13044: Upgrade bouncycastle to 1.78 This patch upgrades bouncycastle to 1.78. As of bouncycastle:1.71, the -jdk15on artifact is no longer available, the artifact is changed to -jdk18on. Tests: - core tests ran Change-Id: I8372916ab79b863e7a07d22e8333abd54492fa29 Reviewed-on: http://gerrit.cloudera.org:8080/21371 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-03 00:09:15 +00:00
Yida Wu	5b70e48ebb	IMPALA-13031: Enhancing logging for spilling configuration with local buffer directory details The patch adds logging for local buffer directory when using remote scratch space. The printed log would be like "Using local buffer directory for scratch space /tmp/test/impala-scratch on disk 8 limit: 500.00 MB, priority: 2147483647". Tests: Manally tests the logging working as described. Change-Id: I8fb357016d72a363ee5016f7881b0f6b0426aff5 Reviewed-on: http://gerrit.cloudera.org:8080/21350 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-02 04:34:23 +00:00
wzhou-code	08f8a30025	IMPALA-12910: Support running TPCH/TPCDS queries for JDBC tables This patch adds script to create external JDBC tables for the dataset of TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for external JDBC tables with Impala-Impala federation. Note that JDBC tables are mapping tables, they don't take additional disk spaces. It fixes the race condition when caching of SQL DataSource objects by using a new DataSourceObjectCache class, which checks reference count before closing SQL DataSource. Adds a new query-option 'clean_dbcp_ds_cache' with default value as true. When it's set as false, SQL DataSource object will not be closed when its reference count equals 0 and will be kept in cache until the SQL DataSource is idle for more than 5 minutes. Flag variable 'dbcp_data_source_idle_timeout_s' is added to make the duration configurable. java.sql.Connection.close() fails to remove a closed connection from connection pool sometimes, which causes JDBC working threads to wait for available connections from the connection pool for a long time. The work around is to call BasicDataSource.invalidateConnection() API to close a connection. Two flag variables are added for DBCP configuration properties 'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait' properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively in apache.commons.dbcp v2. Fixes a bug for database type comparison since the type strings specified by user could be lower case or mix of upper/lower cases, but the code compares the types with upper case string. Fixes issue to close SQL DataSource object in JdbcDataSource.open() and JdbcDataSource.getNext() when some errors returned from DBCP APIs or JDBC drivers. testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables for Impala-Impala, Postgres and MySQL. Following sample commands creates TPCDS JDBC tables for Impala-Impala federation with remote coordinator running at 10.19.10.86, and Postgres server running at 10.19.10.86: ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=IMPALA --database_host=10.19.10.86 --clean ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=POSTGRES --database_host=10.19.10.86 \ --database_name=tpcds --clean TPCDS tests for JDBC tables run only for release/exhaustive builds. TPCH tests for JDBC tables run for core and exhaustive builds, except Dockerized builds. Remaining Issues: - tpcds-decimal_v2-q80a failed with returned rows not matching expected results for some decimal values. This will be fixed in IMPALA-13018. Testing: - Passed core tests. - Passed query_test/test_tpcds_queries.py in release/exhaustive build. - Manually verified that only one SQL DataSource object was created for test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option 'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object was closed by cleanup thread. Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a Reviewed-on: http://gerrit.cloudera.org:8080/21304 Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-02 02:14:20 +00:00
Joe McDonnell	d09c502490	IMPALA-13049: Add dependency management for log4j2 to use 2.18.0 Currently, there is no dependency management for the log4j2 version. Impala itself doesn't use log4j2. However, recently we encountered a case where one dependency brought in log4-core 2.18.0 and another brought in log4j-api 2.17.1. log4j-core 2.18.0 relies on the existence of the ServiceLoaderUtil class from log4j-api 2.18.0. log4j-api 2.17.1 doesn't have this class, which causes class not found exceptions. This uses dependency management to set the log4j2 version to 2.18.0 for log4j-core and log4j-api to avoid any mismatch. Testing: - Ran a local build and verified that both log4j-core and log4j-api are using 2.18.0. Change-Id: Ib4f8485adadb90f66f354a5dedca29992c6d4e6f Reviewed-on: http://gerrit.cloudera.org:8080/21379 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-01 02:37:49 +00:00
Michael Smith	20f908b1ab	IMPALA-13046: Update Iceberg mixed format deletes test Updates iceberg-mixed-format-position-deletes.test for HIVE-28069. Newer versions of Hive will now remove a data file if a delete would negate all rows in the data file to reduce the number of small files produced. The test now ensures every data file it expects to produce will have a row after delete (or circumvent the merge logic by using different formats). Change-Id: I87c23cc541983223c6b766372f4e582c33ae6836 Reviewed-on: http://gerrit.cloudera.org:8080/21373 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-30 22:18:12 +00:00
Michael Smith	b35aa81965	IMPALA-13045: Wait for impala_query_live to exist Waits for creation of 'sys.impala_query_live' in tests to ensure it has been registered with HMS. Change-Id: I5cc3fa3c43be7af9a5f097359a0d4f20d057a207 Reviewed-on: http://gerrit.cloudera.org:8080/21372 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2024-04-30 17:15:57 +00:00
Joe McDonnell	56f35ad40a	IMPALA-12684: Enable IMPALA_COMPRESSED_DEBUG_INFO by default IMPALA_COMPRESSED_DEBUG_INFO was introduced in IMPALA-11511 and reduces Impala binary sizes by >50%. Debug tools like gdb and our minidump processing scripts handle compressed debug information properly. There are slightly higher link times and additional overhead when doing debugging. Overall, the reduction in binary sizes seems worth it given the modest overhead. Compressing the debug information also avoids concerns that adding debug information to toolchain components would increase binary sizes. This changes the default for IMPALA_COMPRESSED_DEBUG_INFO to true. Testing: - Ran pstack on a Centos 7 machine running tests with IMPALA_COMPRESSED_DEBUG_INFO=true and verified that the symbols work properly - Forced the production of minidumps for a job using IMPALA_COMPRESSED_DEBUG_INFO=true and verified it is processed properly. - Used this locally for development for several months Change-Id: I31640f1453d351b11644bb46af3d2158b22af5b3 Reviewed-on: http://gerrit.cloudera.org:8080/20871 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-29 21:09:05 +00:00
Michael Smith	712a37bce4	IMPALA-12997: Use graceful shutdown for query log tests Uses graceful shutdown for all tests that might insert into 'sys.impala_query_log' to avoid leaving the table locked in HMS by a SIGTERM. That's primarily any test that lowers 'query_log_write_interval_s' or 'query_log_max_queued'. Lowers grace period on test_query_log_table_flush_on_shutdown because ShutdownWorkloadManagement() is not started until the grace period ends. Updates "Adding/Removing local backend" to only apply to executors. It was only added for executors, but would be removed on dedicated coordinators as well (resulting in a DFATAL message during graceful shutdown). Change-Id: Ia123c53a952a77ff4a9c02736b5717ccaa3566dc Reviewed-on: http://gerrit.cloudera.org:8080/21345 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-27 03:19:39 +00:00
Michael Smith	ba32d70891	IMPALA-13012: Lower default query_log_max_queued Sets the query_log_max_queued default such that query_log_max_queued * num_columns(49) < statement_expression_limit to avoid triggering e.g. AnalysisException: Exceeded the statement expression limit (250000) Statement has 370039 expressions. Also increases statement_expression_limit for insertion to avoid an error if query_log_max_queued is changed. Logs time taken to write to the queries table for help with debugging and adds histogram "impala-server.completed-queries.write-durations". Fixes InternalServer so it uses 'default_query_options'. Change-Id: I6535675307d88cb65ba7d908f3c692e0cf3259d7 Reviewed-on: http://gerrit.cloudera.org:8080/21351 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2024-04-26 23:45:56 +00:00
Michael Smith	73a9ef9c4c	IMPALA-13005: Create Query Live table in HMS Creates the 'sys.impala_query_live' table in HMS using a similar 'CREATE TABLE' command to 'sys.impala_query_log'. Updates frontend to identify a System Table based on the '__IMPALA_SYSTEM_TABLE' property. Tables improperly marked with '__IMPALA_SYSTEM_TABLE' will error when attempting to scan them because no relevant scanner will be available. Creating the table in HMS simplifies supporting 'SHOW CREATE TABLE' and 'DESCRIBE EXTENDED', so allows them for parity with Query Log. Explicitly disables 'COMPUTE STATS' on system tables as it doesn't work correctly. Makes System Tables work with local catalog mode, fixing LocalCatalogException: Unknown table type for table sys.impala_query_live Updates workload management implementation to rely more on SystemTables.thrift definition, and adds DCHECKs to verify completeness and ordering. Testing: - adds additional test cases for changes to introspection commands - passes existing test_query_live and test_query_log suites Change-Id: Idf302ee54a819fdee2db0ae582a5eeddffe4a5b4 Reviewed-on: http://gerrit.cloudera.org:8080/21302 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-26 23:21:56 +00:00
Riza Suminto	29e4186793	IMPALA-13024: Ignore slots if using default pool and empty group Slot based admission should not be enabled when using default pool. There is a bug where coordinator-only query still does slot based admission because executor group name set to ClusterMembershipMgr::EMPTY_GROUP_NAME ("empty group (using coordinator only)"). This patch adds check to recognize coordinator-only query in default pool and skip slot based admission for it. Testing: - Add BE test AdmissionControllerTest.CanAdmitRequestSlotsDefault. - In test_executor_groups.py, split test_coordinator_concurrency to test_coordinator_concurrency_default and test_coordinator_concurrency_two_exec_group_cluster to show the behavior change. - Pass core tests in ASAN build. Change-Id: I0b08dea7ba0c78ac6b98c7a0b148df8fb036c4d0 Reviewed-on: http://gerrit.cloudera.org:8080/21340 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-26 21:18:42 +00:00
Daniel Becker	457ab9831a	IMPALA-12973,IMPALA-11491,IMPALA-12651: Support BINARY nested in complex types in select list Binary fields in complex types are currently not supported at all for regular tables (an error is returned). For Iceberg metadata tables, IMPALA-12899 added a temporary workaround to allow queries that contain these fields to succeed by NULLing them out. This change adds support for displaying them with base64 encoding for both regular and Iceberg metadata tables. Complex types are displayed in JSON format, so simply inserting the bytes of the binary fields is not acceptable as it would produce invalid JSON. Base64 is a widely used encoding that allows representing arbitrary binary information using only a limited set of ASCII characters. This change also adds support for top level binary columns in Iceberg metadata tables. However, these are not base64 encoded but are returned in raw byte format - this is consistent with how top level binary columns from regular (non-metadata) tables are handled. Testing: - added test queries in iceberg-metadata-tables.test referencing both nested and top level binary fields; also updated existing queries - moved relevant tests (queries extracting binary fields from within complex types) from nested-types-scanner-basic.test to a new binary-in-complex-type.test file and also added a query that selects the containing complex types; this new test file is run from test_scanners.py::TestBinaryInComplexType::\ test_binary_in_complex_type - moved negative tests in AnalyzerTest.TestUnsupportedTypes() to AnalyzeStmtsTest.TestComplexTypesInSelectList() and converted them to positive tests (expecting success); a negative test already in AnalyzeStmtsTest.TestComplexTypesInSelectList() was also converted Change-Id: I7b1d7fa332a901f05a46e0199e13fb841d2687c2 Reviewed-on: http://gerrit.cloudera.org:8080/21269 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2024-04-26 13:18:54 +00:00
Steve Carlin	b39cd79ae8	IMPALA-12872: Use Calcite for optimization - part 1: simple queries This is the first commit to use the Calcite library to parse, analyze, and optimize queries. The hook for the planner is through an override of the JniFrontend. The CalciteJniFrontend class is the driver that walks through each of the Calcite steps which are as follows: CalciteQueryParser: Takes the string query and outputs an AST in the form of Calcite's SqlNode object. CalciteMetadataHandler: Iterate through the SqlNode from the previous step and make sure all essential table metadata is retrieved from catalogd. CalciteValidator: Validate the SqlNode tree, akin to the Impala Analyzer. CalciteRelNodeConverter: Change the AST into a logical plan. In this first commit, the only logical nodes used are LogicalTableScan and LogicalProject. The LogicalTableScan will serve as the node that reads from an Hdfs Table and the LogicalProject will only project out the used columns in the query. In later versions, the LogicalProject will also handle function changes. CalciteOptimizer: This step is to optimize the query. In this cut, it will be a nop, but in later versions, it will perform logical optimizations via Calcite's rule mechanism. CalcitePhysPlanCreator: Converts the Calcite RelNode logical tree into Impala's PlanNode physical tree ExecRequestCreator: Implement the existing Impala steps that turn a Single Node Plan into a Distributed Plan. It will also create the TExecRequest object needed by the runtime server. Only some very basic queries will work with this commit. These include: select * from tbl <-- only needs the LogicalTableScan select c1 from tbl <-- Also uses the LogicalProject In the CalciteJniFrontend, there is some basic checks to make sure only select statements will get processed. Any non-query statement will revert back to the current Impala planner. In this iteration, any queries besides the minimal ones listed above will result in a caught exception which will then be run through the current Impala planner. The tests that do work can be found in calcite.test and run through the custom cluster test test_experimental_planner.py This iteration should support all types with the exception of complex types. Calcite does not have a STRING type, so the string type is represented as VARCHAR(MAXINT) similar to how Hive represents their STRING type. The ImpalaTypeConverter file is used to convert the Impala Type object to corresponding Calcite objects. Authorization is not yet working with this current commit. A Jira has been filed (IMPALA-13011) to deal with this. Change-Id: I453fd75b7b705f4d7de1ed73c3e24cafad0b8c98 Reviewed-on: http://gerrit.cloudera.org:8080/21109 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2024-04-25 20:09:09 +00:00
Daniel Becker	4f033c7750	IMPALA-12950: Improve error message in case of out-of-range numeric conversions IMPALA-12035 introduced checks for numeric conversions that are unsafe and can fail (if the target type cannot store the value, the behaviour is undefined): - from floating-point types to integer types - from double to float However, it can be difficult to trace which part of the query caused this based on the error message. This change adds the source type, the destination type and the value to be converted to the error message. Unfortunately, at this point in the BE, the original SQL is not available, so we cannot reference that. Testing: - extended existing tests in expr-test.cc. Change-Id: Ieeed52e25f155818c35c11a8a6821708476ffb32 Reviewed-on: http://gerrit.cloudera.org:8080/21331 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-25 16:26:17 +00:00
Abhishek Rawat	f620e5d5c0	IMPALA-13015: Dataload fails due to concurrency issue with test.jceks Move 'hadoop credential' command used for creating test.jceks to testdata/bin/create-load-data.sh. Earlier it was in bin/load-data.py which is called in parallel and was causing failures due to race conditions. Testing: - Ran JniFrontendTest#testGetSecretFromKeyStore after data loading and test ran clean. Change-Id: I7fbeffc19f2b78c19fee9acf7f96466c8f4f9bcd Reviewed-on: http://gerrit.cloudera.org:8080/21346 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-23 11:09:47 +00:00
Zoltan Borok-Nagy	ec73b542c5	IMPALA-13002: Iceberg V2 tables with Avro delete files aren't read properly If the Iceberg table has Avro delete files (e.g. by setting 'write.delete.format.default'='avro') then Impala won't be able to read the contents of the delete files properly. It is because the avro schema is not set properly for the virtual delete table. Testing: * added e2e tests with position delete files of all kinds Change-Id: Iff13198991caf32c51cd9e0ace4454fd00216cf6 Reviewed-on: http://gerrit.cloudera.org:8080/21301 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2024-04-23 09:02:12 +00:00
Riza Suminto	850709cece	IMPALA-12777: Fix tpcds/tpcds-q66.test PlannerTest/tpcds/tpcds-q66.test was mistakenly a copy of PlannerTest/tpcds/tpcds-q61.test with different predicate values. This patch replace the wrong test file with correct TPC-DS Q66 query. Testing: - Pass FE test TpcdsPlannerTest#testQ66. Change-Id: I5b886f5dc1da213d25f33bd7b01dacca53eaef1b Reviewed-on: http://gerrit.cloudera.org:8080/21344 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-23 04:50:05 +00:00
Noemi Pap-Takacs	9b05a205fe	IMPALA-13000: Document OPTIMIZE TABLE Document OPTIMIZE TABLE syntax and behaviour. Testing: - built docs locally Change-Id: I851669686ed4da610dcac97c9b88ff23b0a4a647 Reviewed-on: http://gerrit.cloudera.org:8080/21320 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>	2024-04-22 10:40:44 +00:00
Riza Suminto	93278cccf0	IMPALA-12543: Detect self-events before finishing DDL test_iceberg_self_events has been flaky for not having tbls_refreshed_before equal to tbls_refreshed_after in-between query executions. Further investigation reveals concurrency bug due to db/table level lock is not taken during db/table self-events check (IMPALA-12461 part1). The order of ALTER TABLE operation is as follow: 1. alter table starts in CatalogOpExecutor 2. table level lock is taken 3. HMS RPC starts (CatalogOpExecutor.applyAlterTable()) 4. HMS generates the event 5. HMS RPC returns 6. table is reloaded 7. catalog version is added to inflight event list 8. table level lock is released Meanwhile the event processor thread fetches the new event after 4 and before 7. Because of IMPALA-12461 (part 1), it can also finish self-events checking before reaching 7. Before IMPALA-12461, self-events would have needed to wait for 8. Note that this issue is only relevant for table level events, as self-events checking for partition level events still takes table lock. This patch fix the issue by adding newCatalogVersion to the table's inflight event list before updating HMS using helper class InProgressTableModification. If HMS update does not complete (ie., an exception is thrown), the new newCatalogVersion that was added is then removed. This patch also fix few smaller issues, including: - Avoid incrementing EVENTS_SKIPPED_METRIC if numFilteredEvents == 0 in MetastoreEventFactory.getFilteredEvents(). - Increment EVENTS_SKIPPED_METRIC in MetastoreTableEvent.reloadTableFromCatalog() if table is already in the middle of reloading (revealed through flaky test_skipping_older_events). - Rephrase misleading log message in MetastoreEventProcessor.getNextMetastoreEvents(). Testing: - Add TestEventProcessingWithImpala, run it with debug_action and sync_ddl dimensions. - Pass exhaustive tests. Change-Id: I8365c934349ad21a4d9327fc11594d2fc3445f79 Reviewed-on: http://gerrit.cloudera.org:8080/21029 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-21 10:16:29 +00:00
stiga-huang	db09d58ef7	IMPALA-12933: Avoid fetching unneccessary events of unwanted types There are several places where catalogd will fetch all events of a specific type on a table. E.g. in TableLoader#load(), if the table has an old createEventId, catalogd will fetch all CREATE_TABLE events after that createEventId on the table. Fetching the list of events is expensive since the filtering is done on client side, i.e. catalogd fetches all events and filter them locally based on the event type and table name. This could take hours if there are lots of events (e.g 1M) in HMS. This patch sets the eventTypeSkipList with the complement set of the wanted type. So the get_next_notification RPC can filter out some events on HMS side. To avoid bringing too much computation overhead to HMS's underlying RDBMS in evaluating predicates of EVENT_TYPE != 'xxx', rare event types (e.g. DROP_ISCHEMA) are not added in the list. A new flag, common_hms_event_types, is added to specify the common HMS event types. Once HIVE-28146 is resolved, we can set the wanted types directly in the HMS RPC and this approach can be simplified. UPDATE_TBL_COL_STAT_EVENT, UPDATE_PART_COL_STAT_EVENT are the most common unused events for Impala. They are also added to the default skip list. A new flag, default_skipped_hms_event_types, is added to configure this list. This patch also fixes an issue that events of the non-default catalog are not filtered out. In a local perf test, I generated 100K RELOAD events after creating a table in Hive. Then use the table in Impala to trigger metadata loading on it which will fetch the latest CREATE_TABLE event by polling all events after the last known CREATE_TABLE event. Before this patch, fetching the events takes 1s779ms. Now it takes only 395.377ms. Note that in prod env, the event messages are usually larger, we could have a larger speedup. Tests: - Added an FE test - Ran CORE tests Change-Id: Ieabe714328aa2cc605cb62b85ae8aa4bd537dbe9 Reviewed-on: http://gerrit.cloudera.org:8080/21186 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-20 19:25:54 +00:00
David Rorke	25a8d70664	IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator This patch improves the accuracy of the CPU ProcessingCost estimates for several of the CPU intensive operators by basing the costs on benchmark data. The general approach for a given operator was to run a set of queries that exercised the operator under various conditions (e.g. large vs small row sizes and row counts, varying NDV, different file formats, etc) and capture the CPU time spent per unit of work (the unit of work might be measured as some number of rows, some number of bytes, some number of predicates evaluated, or some combination of these). The data was then analyzed in an attempt to fit a simple model that would allow us to predict CPU consumption of a given operator based on information available at planning time. For example, the CPU ProcessingCost for a Parquet scan is estimated as: TotalCost = (0.0144 * BytesMaterialized) + (0.0281 * Rows * Predicate Count) The coefficients (0.0144 and 0.0281) are derived from benchmarking scans under a variety of conditions. Similar cost functions and coefficients were derived for all of the benchmarked operators. The coefficients for all the operators are normalized such that a single unit of cost equates to roughly 100 nanoseconds of CPU time on a r5d.4xlarge instance. So we would predict an operator with a cost of 10,000,000 would complete in roughly one second on a single core. Limitations: * Costing only addresses CPU time spent and doesn't account for any IO or other wait time. * Benchmarking scenarios didn't provide comprehensive coverage of the full range of data types, distributions, etc. More thorough benchmarking could improve the costing estimates further. * This initial patch only covers a subset of the operators, focusing on those that are most common and most CPU intensive. Specifically the following operators are covered by this patch. All others continue to use the previous ProcessingCost code: AggregationNode DataStreamSink (exchange sender) ExchangeNode HashJoinNode HdfsScanNode HdfsTableSink NestedLoopJoinNode SortNode UnionNode Benchmark-based costing of the remaining operators will be covered by a future patch. Future patches will automate the collection and analysis of the benchmark data and the computation of the cost coefficients to simplify maintenance of the costing as performance changes over time. Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca Reviewed-on: http://gerrit.cloudera.org:8080/21279 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-20 06:48:58 +00:00
Riza Suminto	d437334e53	IMPALA-12988: Calculate an unbounded version of CpuAsk Planner calculates CpuAsk through a recursive call beginning at Planner.computeBlockingAwareCores(), which is called after Planner.computeEffectiveParallelism(). It does blocking operator analysis over the selected degree of parallelism that was decided during computeEffectiveParallelism() traversal. That selected degree of parallelism, however, is already bounded by min and max parallelism config, derived from PROCESSING_COST_MIN_THREADS and MAX_FRAGMENT_INSTANCES_PER_NODE options accordingly. This patch calculates an unbounded version of CpuAsk that is not bounded by min and max parallelism config. It is purely based on the fragment's ProcessingCost and query plan relationship constraint (for example, the number of JOIN BUILDER fragments should equal the number of destination JOIN fragments for partitioned join). Frontend will receive both bounded and unbounded CpuAsk values from TQueryExecRequest on each executor group set selection round. The unbounded CpuAsk is then scaled down once using a nth root based sublinear-function, controlled by the total cpu count of the smallest executor group set and the bounded CpuAsk number. Another linear scaling is then applied on both bounded and unbounded CpuAsk using QUERY_CPU_COUNT_DIVISOR option. Frontend then compare the unbounded CpuAsk after scaling against CpuMax to avoid assigning a query to a small executor group set too soon. The last executor group set stays as the "catch-all" executor group set. After this patch, setting COMPUTE_PROCESSING_COST=True will show following changes in query profile: - The "max-parallelism" fields in the query plan will all be set to maximum parallelism based on ProcessingCost. - The CpuAsk counter is changed to show the unbounded CpuAsk after scaling. - A new counter CpuAskBounded shows the bounded CpuAsk after scaling. If QUERY_CPU_COUNT_DIVISOR=1 and PLANNER_CPU_ASK slot counting strategy is selected, this CpuAskBounded is also the minimum total admission slots given to the query. - A new counter MaxParallelism shows the unbounded CpuAsk before scaling. - The EffectiveParallelism counter remains unchanged, showing bounded CpuAsk before scaling. Testing: - Update and pass FE test TpcdsCpuCostPlannerTest and PlannerTest#testProcessingCost. - Pass EE test tests/query_test/test_tpcds_queries.py - Pass custom cluster test tests/custom_cluster/test_executor_groups.py Change-Id: I5441e31088f90761062af35862be4ce09d116923 Reviewed-on: http://gerrit.cloudera.org:8080/21277 Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-20 00:28:53 +00:00
Michael Smith	5e7d720257	IMPALA-12938: add-opens for platform.cgroupv1 Adds '--add-opens=jdk.internal.platform.cgroupv1' for Java 11 with ehcache, covering Impala daemons and frontend tests. Fixes InaccessibleObjectException detected by test_banned_log_messages.py. Change-Id: I312ae987c17c6f06e1ffe15e943b1865feef6b82 Reviewed-on: http://gerrit.cloudera.org:8080/21334 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-19 22:39:23 +00:00
Riza Suminto	9a41dfbdc7	IMPALA-13016: Fix ambiguous row_regex that check for no-existence There are few row_regex patterns used in EE test files that are ambiguous on whether a pattern does not exist in all parts of the results/runtime profile or at least one row does not have that pattern. These were caught by grepping the following pattern: $ git grep -n "row_regex: (?\!" This patch replaces them with either with !row_regex or VERIFY_IS_NOT_IN comment. Testing: - Run and pass modified tests. Change-Id: Ic81de34bf997dfaf1c199b1fe1b05346b55ff4da Reviewed-on: http://gerrit.cloudera.org:8080/21333 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-19 21:00:18 +00:00
Riza Suminto	6abfdbc56c	IMPALA-12980: Translate CpuAsk into admission control slots Impala has a concept of "admission control slots" - the amount of parallelism that should be allowed on an Impala daemon. This defaults to the number of processors per executor and can be overridden with -–admission_control_slots flag. Admission control slot accounting is described in IMPALA-8998. It computes 'slots_to_use' for each backend based on the maximum number of instances of any fragment on that backend. This can lead to slot underestimation and query overadmission. For example, assume an executor node with 48 CPU cores and configured with -–admission_control_slots=48. It is assigned 4 non-blocking query fragments, each has 12 instances scheduled in this executor. IMPALA-8998 algorithm will request the max instance (12) slots rather than the sum of all non-blocking fragment instances (48). With the 36 remaining slots free, the executor can still admit another fragment from a different query but will potentially have CPU contention with the one that is currently running. When COMPUTE_PROCESSING_COST is enabled, Planner will generate a CpuAsk number that represents the cpu requirement of that query over a particular executor group set. This number is an estimation of the largest number of query fragment instances that can run in parallel without waiting, given by the blocking operator analysis. Therefore, the fragment trace that sums into that CpuAsk number can be translated into 'slots_to_use' as well, which will be a closer resemblance of maximum parallel execution of fragment instances. This patch adds a new query option called SLOT_COUNT_STRATEGY to control which admission control slot accounting to use. There are two possible values: - LARGEST_FRAGMENT, which is the original algorithm from IMPALA-8998. This is still the default value for the SLOT_COUNT_STRATEGY option. - PLANNER_CPU_ASK, which will follow the fragment trace that contributes towards CpuAsk number. This strategy will schedule more or equal admission control slots than the LARGEST_FRAGMENT strategy. To do the PLANNER_CPU_ASK strategy, the Planner will mark fragments that contribute to CpuAsk as dominant fragments. It also passes max_slot_per_executor information that it knows about the executor group set to the scheduler. AvgAdmissionSlotsPerExecutor counter is added to describe what Planner thinks the average 'slots_to_use' per backend will be, which follows this formula: AvgAdmissionSlotsPerExecutor = ceil(CpuAsk / num_executors) Actual 'slots_to_use' in each backend may differ than AvgAdmissionSlotsPerExecutor, depending on what is scheduled on that backend. 'slots_to_use' will be shown as 'AdmissionSlots' counter under each executor profile node. Testing: - Update test_executors.py with AvgAdmissionSlotsPerExecutor assertion. - Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost. - Add EE test test_processing_cost.py. - Add FE test PlannerTest#testProcessingCostPlanAdmissionSlots. Change-Id: I338ca96555bfe8d07afce0320b3688a0861663f2 Reviewed-on: http://gerrit.cloudera.org:8080/21257 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-18 21:58:13 +00:00
Yida Wu	6a079be290	IMPALA-13004: Fix heap-use-after-free error in ExprTest AiFunctionsTest The issue is that the code previously used a std::string_view to hold the data which is actually returned by rapidjson::Document. However, the rapidjson::Document object gets destroyed after creating the std::string_view. This meant the std::string_view referenced memory that was no longer valid, leading to a heap-use-after-free error. This patch fixes this issue by modifying the function to return a std::string instead of a std::string_view. When the function returns a string, it creates a copy of the data from rapidjson::Document. This ensures the returned string has its own memory allocation and doesn't rely on the destroyed rapidjson::Document. Tests: Reran the asan build and passed. Change-Id: I3bb9dcf9d72cce7ad37d5bc25821cf6ee55a8ab5 Reviewed-on: http://gerrit.cloudera.org:8080/21315 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-18 18:58:41 +00:00
Yida Wu	cc4d0a58ea	IMPALA-12874: Identify active and standby catalog and statestore in the web debug endpoint This patch adds support to display the HA status of catalog and statestore on the root web page. The status will be presented as "Catalog Status: Active" or "Statestore Status: Standby" based on the values retrieved from the metrics catalogd-server.active-status and statestore.active-status. If the catalog or statestore is standalone, it will show active as the status, which is same as the metric. Tests: Ran core tests. Manually tests the web page, and verified the status display is correct. Also checked the situation when the failover happens, the current 'standby' status can be changed to 'active'. Change-Id: Ie9435ba7a9549ea56f9d080a9315aecbcc630cd2 Reviewed-on: http://gerrit.cloudera.org:8080/21294 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-18 18:30:54 +00:00
Michael Smith	5f49cc44b4	IMPALA-12998: Add SHOW_METADATA_TABLES to ignored DDL Adds SHOW_METADATA_TABLES to the list of ignored DDL in workload management. Fixes DCHECK failure when running Impala's full test suite with 'enable_workload_mgmt'. Change-Id: I69f7de9756aa730d70cd9187c9f869d5bcf67fce Reviewed-on: http://gerrit.cloudera.org:8080/21290 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-18 00:05:50 +00:00
Daniel Becker	652c9dd0de	IMPALA-13008: test_metadata_tables failed in Ubuntu 20 build TestIcebergV2Table::test_metadata_tables failed in Ubuntu 20 build in a release candidate because the file sizes in some queries didn't match the expected ones. As Impala writes its version into the Parquet files it writes, the file sizes can change with the release (especially as SNAPSHOT or RELEASE is part of the full version, and their lengths differ). This change updates the failing tests to take regexes for the file sizes instead of concrete values. Change-Id: Iad8fd0d9920034e7dbe6c605bed7579fbe3b5b1f Reviewed-on: http://gerrit.cloudera.org:8080/21317 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-17 23:39:24 +00:00
Csaba Ringhofer	541fc5ee9e	IMPALA-12990: Fix impala-shell handling of unset rows_deleted The issue occurred in Python 3 when 0 rows were deleted from Iceberg. It could also happen in other DMLs with older Impala servers where TDmlResult.rows_deleted was not set. See the Jira for details of the error. Testing: Extended shell tests for Kudu DML reporting to also cover Iceberg. Change-Id: I5812b8006b9cacf34a7a0dbbc89a486d8b454438 Reviewed-on: http://gerrit.cloudera.org:8080/21284 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-17 18:52:25 +00:00
Michael Smith	bbe3303ded	IMPALA-13003: Handle Iceberg AlreadyExistsException When multiple coordinators attempt to create the same table concurrently with "if not exists", we still see AlreadyExistsException: Table was created concurrently: my_iceberg_tbl Iceberg throws its own version of AlreadyExistsException, but we avoid most code paths that would throw it because we first check HMS to see if the table exists before trying to create it. Updates createIcebergTable to handle Iceberg's AlreadyExistsException identically to the HMS AlreadyExistsException. Adds a test using DebugAction to simulate concurrent table creation. Change-Id: I847eea9297c9ee0d8e821fe1c87ea03d22f1d96e Reviewed-on: http://gerrit.cloudera.org:8080/21312 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-17 14:03:32 +00:00
Kurt Deschler	06bbbea257	IMPALA-12679: Improve test_rows_sent_counters assert This patch changes the assert for failed test test_rows_sent_counters so that the actual RPC count is displayed in the assert output. The root cause of the failure will be addressed once sufficient data is collected with the new output. Testing: Ran test_rows_sent_counters with modified expected RPC count range to simulate failure. Change-Id: Ic6b48cf4039028e749c914ee60b88f04833a0069 Reviewed-on: http://gerrit.cloudera.org:8080/21310 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-17 05:00:01 +00:00
Noemi Pap-Takacs	fc07880b8a	IMPALA-13006: Restrict Iceberg tables to Parquet Iceberg test tables/views are restricted to the Parquet file format in functional/schema_constraints.csv. The following two were unintentionally left out: iceberg_query_metadata iceberg_view Added the constraint for these tables too. Testing: - executed data load for the functional dataset Change-Id: I2590d7a70fe6aaf1277b19e6b23015d39d2935cb Reviewed-on: http://gerrit.cloudera.org:8080/21306 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-16 17:34:04 +00:00
Saurabh Katiyal	0606fc760f	IMPALA-11495: Add glibc version and effective locale to the Web UI Added a new section "Other Info" in root page for WebUI, displaying effective locale and glibc version. Change-Id: Ia69c4d63df4beae29f5261691a8dcdd04b931de7 Reviewed-on: http://gerrit.cloudera.org:8080/21252 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-16 14:09:32 +00:00
Michael Smith	74ff59b913	IMPALA-12963: Return parent PID when children spawned Returns the original PID for a command rather than any children that may be active. This happens during graceful shutdown in UBSAN tests. Also updates 'kill' to use the version of 'get_pid' that logs details to help with debugging. Moves try block in test_query_log.py to after client2 has been initialized. Removes 'drop table' on unique_database, since test suite already handles cleanup. Change-Id: I214e79507c717340863d27f68f6ea54c169e4090 Reviewed-on: http://gerrit.cloudera.org:8080/21278 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-15 22:00:29 +00:00
stiga-huang	61ceb16d88	IMPALA-12999: Add log4j.properties to the DEB/RPM packages log4j.properties is required to configure log4j before logs from it are redirected to glog (done in GlogAppender#Install()). This is crucial to show error logs during initialization, especially while lauching the JVM. See the JIRA description for an example. This copies log4j.properties from fe/src/test/resources directly since it hasn't changed for years. Change-Id: Iee0b9699ef313aa8e94bd351fa51fad3ea0cdf57 Reviewed-on: http://gerrit.cloudera.org:8080/21293 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-14 10:10:47 +00:00
Daniel Becker	f2f0d798df	IMPALA-12996: Add support for DATE in Iceberg metadata tables DATE fields in Iceberg metadata tables were NULLed out before this change. This change adds support for displaying their actual values. DATE fields are stored as 32-bit integers (storing the number of days since the epoch), so they are handled similarly to INTS, but if they are out of the valid DATE range, their value is set to DateValue::INVALID_DAYS_SINCE_EPOCH. Tests: - added a test query and adjusted existing ones in iceberg-metadata-tables.test Change-Id: Ib2223385f90555b1f9b22f3e27fa0e2489c3b9b5 Reviewed-on: http://gerrit.cloudera.org:8080/21292 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-12 19:03:32 +00:00
Xiang Yang	050805d21b	IMPALA-12362: (part-4/4) Refactor linux packaging related cmake files. Independent linux packaging related content to package/CMakeLists.txt to make it more clearly. This patch also add LICENSE and NOTICE file in the final package. Testing: - Manually deploy package on Ubuntu22.04 and verify it. Change-Id: If3914dcda69f81a735cdf70d76c59fa09454777b Reviewed-on: http://gerrit.cloudera.org:8080/20263 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-12 14:48:00 +00:00
Zoltan Borok-Nagy	0334f83704	IMPALA-12810: Simplify IcebergDeleteNode and IcebergDeleteBuilder Now that we have the DIRECTED distribution mode, some parts of IcebergDeleteNode and IcebergDeleteBuilder became dead code. It is time to simplify the above classes. IcebergDeleteBuilder and KrpcDataStreamSender now also tolerate NULL file paths which are also not an error in the hash join mode. Change-Id: I3ba02b33433990950b49628f11e732e01ed8a34d Reviewed-on: http://gerrit.cloudera.org:8080/21258 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-11 21:27:22 +00:00
Daniel Becker	ef6dad694d	IMPALA-12986: Base64Encode fails if the 'out_len' output parameter is passed with certain values The Base64Encode function in coding-util.h with signature bool Base64Encode(const char* in, int64_t in_len, int64_t out_max, char* out, int64_t* out_len); fails if 'out_len', when passed to the function, contains a value that does not fit in a 32 bit integer. Internally we use the int sasl_encode64(const char in, unsigned inlen, char out, unsigned outmax, unsigned outlen); function and explicitly cast 'out_len' to 'unsigned'. The error is that the called sasl_encode64() function only sets the four lower bytes of 'out_len' (assuming that 'unsigned' is a 32 bit integer), and if the upper bytes are not all zero, the resulting value of 'out_len' will be incorrect. This change changes the type of 'out_len' from 'int64_t' to 'unsigned' to match the type that sasl_encode64() expects. Base64Decode() is also updated to use 'unsigned'. Before this change it used an intermediate 32 bit local variable to avoid this issue. Testing: - added a regression test in coding-util-test.cc Change-Id: I35ae59fc9b3280f89ea4f7d95d27d2f21751001f Reviewed-on: http://gerrit.cloudera.org:8080/21271 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-11 18:24:53 +00:00
Zoltan Borok-Nagy	94ed30d9fa	IMPALA-12991: Eliminate unnecessary SORT for Iceberg DELETEs Since we are using IcebergBufferedDeleteSink, which sorts the data before flushing, there is no need to add a SORT node before the sink. Testing: * updated planner tests Change-Id: I94a691e7990228a1ec2de03e6ad90ebb97931581 Reviewed-on: http://gerrit.cloudera.org:8080/21285 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-11 17:35:07 +00:00
Yida Wu	9837637d93	IMPALA-12920: Support ai_generate_text built-in function for OpenAI's chat completion API Added support for following built-in functions: - ai_generate_text_default(prompt) - ai_generate_text(ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) 'ai_endpoint', 'ai_model' and 'ai_api_key_jceks_secret' are flagfile options. 'ai_generate_text_default(prompt)' syntax expects all these to be set to proper values. The other syntax, will try to use the provided input parameter values, but fallback to instance level values if the inputs are NULL or empty. Only public OpenAI (api.openai.com) and Azure OpenAI (openai.azure.com) API endpoints are currently supported. Exposed these functions in FunctionContext so that they can also be called from UDFs: - ai_generate_text_default(context, model) - ai_generate_text(context, ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) Testing: - Added unit tests for AiGenerateTextInternal function - Added fe test for JniFrontend::getSecretFromKeyStore - Ran manual tests to make sure Impala can talk with OpenAI LLMs using 'ai_generate_text' built-in function. Example sql: select ai_generate_text("https://api.openai.com/v1/chat/completions", "hello", "gpt-3.5-turbo", "open-ai-key", '{"temperature": 0.9, "model": "gpt-4"}') - Tested using standalone UDF SDK and made sure that the UDFs can invoke BuiltInFunctions (ai_generate_text and ai_generate_text_default) Change-Id: Id4446957f6030bab1f985fdd69185c3da07d7c4b Reviewed-on: http://gerrit.cloudera.org:8080/21168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-11 07:25:50 +00:00
Laszlo Gaal	408c119f7d	IMPALA-12564: Prevent Hive loading libfesupport.so in the minicluster during TSAN runs During TSAN runs all Impala binaries (including libfesupport.so) are built with TSAN options, which include a reference to the external symbol __tsan_init. This causes a problem for libfesupport.so when it is loaded into Hive during minicluster startup, because the Java VM running Hive's code cannot supply this symbol (the stock JVM is obviously not built with TSAN). Unfortunately this symbol resolution failure causes Hive's JVM simply to abort on Red Hat 8 (or later) and on Ubuntu 20.04 (or later). On earlier versions of the same platforms the JVM turned the same failure into an UnsatisfiedLinkError exception, which is actually handled by Hive. This patch prevents libfesupport.so from being loaded into Hive for TSAN runs so that the minicluster can actually be started. This is achieved by not adding the directory containing libfesupport.so to JAVA_LIBRARY_PATH, preventing the JVM from finding it. Change-Id: Ie030d9876c297d6e9dae80eba37e525ee2bccb20 Reviewed-on: http://gerrit.cloudera.org:8080/21191 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-10 20:37:02 +00:00
Gabor Kaszab	df7aac9517	IMPALA-12970: Fix ConcurrentModificationException for Iceberg table scans When a table is partitioned IcebergScanNode sorts the file descriptors for better scheduling. However, the list of file descriptors comes from IcebergContentFileStore and is shared between different select queries on the table. When another query tries to iterate the list of file descriptors and at the same time the IcebergScanNode sorts them we get a ConcurrentModificationException. To solve this IceberScanNode now creates its own copy of the file descriptor list not to interfere with other queries. Manual testing: 300-400 SELECT * Iceberg queries were sent into Impala in a loop that confidently reproduced the original issue. With the fix the issue is gone. The queries used for the repro: 1: select * from functional_parquet.iceberg_v2_partitioned_position_deletes_orc a, functional_parquet.iceberg_partitioned_orc_external b where a.action = b.action and b.id=3; 2: select * from functional_parquet.iceberg_v2_equality_delete_schema_evolution; Change-Id: Iafe57f05ffa0fa6a0875c141cfafd5ee1607a5c3 Reviewed-on: http://gerrit.cloudera.org:8080/21267 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-10 17:47:32 +00:00
Csaba Ringhofer	8ff51fbf74	IMPALA-5323: Support BINARY columns in Kudu tables The patch adds read and write support for BINARY columns in Kudu tables. Predicate push down is implemented, but is incomplete: a constant binary argument will be only pushed down if the constant folding never encounters non-ascii strings. Examples: - cast(unhex(hex("aa")) as binary) can be pushed down - cast(hex(unhex("aa")) as binary) can't be pushed down as unhex("aa") is not ascii (even though the final result is ascii) See IMPALA-10349 for more details on this limitation. The patch also changes casting BINARY <-> STRING from noop to calling an actual function. While this may add some small overhead it allows the backend to know whether an expression returns STRING or BINARY. Change-Id: Iff701a4b3a09ce7b6982c5d238e65f3d4f3d1151 Reviewed-on: http://gerrit.cloudera.org:8080/18868 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-10 16:17:15 +00:00
Michael Smith	6121c4f7d6	IMPALA-12905: Disk-based tuple caching This implements on-disk caching for the tuple cache. The TupleCacheNode uses the TupleFileWriter and TupleFileReader to write and read back tuples from local files. The file format uses RowBatch's standard serialization used for KRPC data streams. The TupleCacheMgr is the daemon-level structure that coordinates the state machine for cache entries, including eviction. When a writer is adding an entry, it inserts an IN_PROGRESS entry before starting to write data. This does not count towards cache capacity, because the total size is not known yet. This IN_PROGRESS entry prevents other writers from concurrently writing the same entry. If the write is successful, the entry transitions to the COMPLETE state and updates the total size of the entry. If the write is unsuccessful and a new execution might succeed, then the entry is removed. If the write is unsuccessful and won't succeed later (e.g. if the total size of the entry exceeds the max size of an entry), then it transitions to the TOMBSTONE state. TOMBSTONE entries avoid the overhead of trying to write entries that are too large. Given these states, when a TupleCacheNode is doing its initial Lookup() call, one of three things can happen: 1. It can find a COMPLETE entry and read it. 2. It can find an IN_PROGRESS/TOMBSTONE entry, which means it cannot read or write the entry. 3. It finds no entry and inserts its own IN_PROGRESS entry to start a write. The tuple cache is configured using the tuple_cache parameter, which is a combination of the cache directory and the capacity similar to the data_cache parameter. For example, /data/0:100GB uses directory /data/0 for the cache with a total capacity of 100GB. This currently supports a single directory, but it can be expanded to multiple directories later if needed. The cache eviction policy can be specified via the tuple_cache_eviction_policy parameter, which currently supports LRU or LIRS. The tuple_cache parameter cannot be specified if allow_tuple_caching=false. This contains contributions from Michael Smith, Yida Wu, and Joe McDonnell. Testing: - This adds basic custom cluster tests for the tuple cache. Change-Id: I13a65c4c0559cad3559d5f714a074dd06e9cc9bf Reviewed-on: http://gerrit.cloudera.org:8080/21171 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>	2024-04-10 03:11:49 +00:00
Riza Suminto	4764b91f42	IMPALA-12965: Add debug query option RUNTIME_FILTER_IDS_TO_SKIP Runtime filter still have negative effect on certain scenario such as long wait time that delays scan and cascading runtime filter chain that prevents parallel execution of fragments. Having debug query option to simply skip a runtime filter id from being scheduled can help us investigate and test a solution early before implementing the improvement code. This patch add RUNTIME_FILTER_IDS_TO_SKIP option to do that. This patch also improve parsing of multi-value query options to not split at ',' char that is within two double quotes and ignore empty/whitespace value if exist. Testing: - Add BE test in query-options-test.cc - Add FE test in runtime-filter-query-options.test Change-Id: I897e37685dd1ec279989b55560ec7616a00d2280 Reviewed-on: http://gerrit.cloudera.org:8080/21230 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-09 21:35:53 +00:00

1 2 3 4 5 ...

11383 Commits