impala

mirror of https://github.com/apache/impala.git synced 2025-12-25 02:03:09 -05:00

Author	SHA1	Message	Date
stiga-huang	fb45c786e9	IMPALA-13691: Partition values from HMS events don't need URL decoding Hive uses URL encoding to format the partition strings when creating the partition folders, e.g. "00:00:00" will be encoded into "00%3A00%3A00". When you create a partition of string type partition column "p" and using "00:00:00" as the partition value, the underlying partition folder is "p=00%3A00%3A00". When parsing the partition folders, Impala will URL-decode the partition folder names to get the correct partition values. This is correct in ALTER TABLE RECOVER PARTITIONS command that gets the partition strings from the file paths. However, for partition strings come from HMS events, Impala shouldn't URL-decode them since they are not URL encoded and are the original partition values. This causes HMS events on partitions that have percent signs in the value strings being matched to wrong partitions. This patch fixes the issue by only URL-decoding the partition strings that come from file paths. Tests: - Ran tests/metadata/test_recover_partitions.py - Added custom-cluster test. Change-Id: I7ba7fbbed47d39b02fa0b1b86d27dcda5468e344 Reviewed-on: http://gerrit.cloudera.org:8080/22388 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-27 22:52:43 +00:00
pranavyl	a61b90f860	IMPALA-13039: AES Encryption/ Decryption Support in Impala AES (Advanced Encryption Standard) crypto functions are widely recognized and respected encryption algorithm used to protect sensitive data which operate by transforming plaintext data into ciphertext using a symmetric key, ensuring confidentiality and integrity. This standard speciﬁes the Rijndael algorithm, a symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128 and 256 bits. The patch makes use of the EVP_*() algorithms from the OpenSSL library. The patch includes: 1. AES-GCM, AES-CTR, and AES-CFB encryption functionalities and AES-GCM, AES-ECB, AES-CTR, and AES-CFB decryption functionalities. 2. Support for both 128-bit and 256-bit key sizes for GCM and ECB modes. 3. Enhancements to EncryptionKey class to accommodate various AES modes. The aes_encrypt() and aes_decrypt() functions serve as entry points for encryption and decryption operations, handling encryption and decryption based on user-provided keys, AES modes, and initialization vectors (IVs). The implementation includes key length validation and IV vector size checks to ensure data integrity and confidentiality. Multiple AES modes: GCM, CFB, CTR for encryption, and GCM, CFB, CTR and ECB for decryption are supported to provide flexibility and compatibility with various use cases and OpenSSL features. AES-GCM is set as the default mode due to its strong security properties. AES-CTR and AES-CFB are provided as fallbacks for environments where AES-GCM may not be supported. Note that AES-GCM is not available in OpenSSL versions prior to 1.0.1, so having multiple methods ensures broader compatibility. Testing: The patch is thouroughly tested and the tests are included in exprs.test. Change-Id: I3902f2b1d95da4d06995cbd687e79c48e16190c9 Reviewed-on: http://gerrit.cloudera.org:8080/20447 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-27 22:16:37 +00:00
Riza Suminto	3005092332	IMPALA-13668: Add default_test_protocol parameter to py.test ImpalaTestSuite.client is always initialized as beeswax client. And many tests use it directly rather than going through helper method such as execute_query(). This patch add add default_test_protocol parameter to conftest.py. It control whether to initialize ImpalaTestSuite.client equals to 'beeswax_client', 'hs2_client', or 'hs2_http_client'. This parameter is still default to 'beeswax'. This patch also adds helper method 'default_client_protocol_dimension', 'beeswax_client_protocol_dimension' and 'hs2_client_protocol_dimension' for convenience and traceability. Reduced occurrence where test method manually override ImpalaTestSuite.client. They are replaced by combination of ImpalaTestSuite.create_impala_clients and ImpalaTestSuite.close_impala_clients. Testing: - Pass core tests. Change-Id: I9165ea220b2c83ca36d6e68ef3b88b128310af23 Reviewed-on: http://gerrit.cloudera.org:8080/22336 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-24 12:19:02 +00:00
stiga-huang	2e59bbae37	IMPALA-12785: Add commands to control event-processor status This patch extends the existing AdminFnStmt to support operations on EventProcessor. E.g. to pause the EventProcessor: impala-shell> :event_processor('pause'); to resume the EventProcessor: impala-shell> :event_processor('start'); Or to resume the EventProcessor on a given event id (1000): impala-shell> :event_processor('start', 1000); Admin can also resume the EventProcessor at the latest event id by using -1: impala-shell> :event_processor('start', -1); Supported command actions in this patch: pause, start, status. The command output of all actions will show the latest status of EventProcessor, including - EventProcessor status: PAUSED / ACTIVE / ERROR / NEEDS_INVALIDATE / STOPPED / DISABLED. - LastSyncedEventId: The last HMS event id which we have synced to. - LatestEventId: The event id of the latest event in HMS. Example output: [localhost:21050] default> :event_processor('pause'); +--------------------------------------------------------------------------------+ \| summary \| +--------------------------------------------------------------------------------+ \| EventProcessor status: PAUSED. LastSyncedEventId: 34489. LatestEventId: 34489. \| +--------------------------------------------------------------------------------+ Fetched 1 row(s) in 0.01s If authorization is enabled, only admin users that have ALL privilege on SERVER can run this command. Note that there is a restriction in MetastoreEventsProcessor#start(long) that resuming EventProcessor back to a previous event id is only allowed when it's not in the ACTIVE state. This patch aims to expose the control of EventProcessor to the users so MetastoreEventsProcessor is not changed. We can investigate the restriction and see if we want to relax it. Note that resuming EventProcessor at a newer event id can be done on any states. Admins can use this to manually resolve the lag of HMS event processing, after they have made sure all (or important) tables are manually invalidated/refreshed. A new catalogd RPC, SetEventProcessorStatus, is added for coordinators to control the status of EventProcessor. Tests - Added e2e tests Change-Id: I5a19f67264cfe06a1819a22c0c4f0cf174c9b958 Reviewed-on: http://gerrit.cloudera.org:8080/22250 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-24 04:10:14 +00:00
Daniel Becker	c5b474d3f5	IMPALA-13594: Read Puffin stats also from older snapshots Before this change, Puffin stats were only read from the current snapshot. Now we also consider older snapshots, and for each column we choose the most recent available stats. Note that this means that the stats for different columns may come from different snapshots. In case there are both HMS and Puffin stats for a column, the more recent one will be used - for HMS stats we use the 'impala.lastComputeStatsTime' table property, and for Puffin stats we use the snapshot timestamp to determine which is more recent. This commit also renames the startup flag 'disable_reading_puffin_stats' to 'enable_reading_puffin_stats' and the table property 'impala.iceberg_disable_reading_puffin_stats' to 'impala.iceberg_read_puffin_stats' to make them more intuitive. The default values are flipped to keep the same behaviour as before. The documentation of Puffin reading is updated in docs/topics/impala_iceberg.xml Testing: - updated existing test cases and added new ones in test_iceberg_with_puffin.py - reorganised the tests in TestIcebergTableWithPuffinStats in test_iceberg_with_puffin.py: tests that modify table properties and other state that other tests rely on are now run separately to provide a clean environment for all tests. Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39 Reviewed-on: http://gerrit.cloudera.org:8080/22177 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-23 15:25:59 +00:00
Michael Smith	4a645105f9	IMPALA-13658: Enable tuple caching aggregates Enables tuple caching on aggregates directly above scan nodes. Caching aggregates requires that their children are also eligible for caching, so this excludes aggregates above an exchange, union, or hash join. Testing: - Adds Planner tests for different aggregate cases to confirm they have stable tuple cache keys and are valid for caching. - Adds custom cluster tests that cached aggregates are used, and can be re-used in slightly different statements. Change-Id: I9bd13c2813c90d23eb3a70f98068fdcdab97a885 Reviewed-on: http://gerrit.cloudera.org:8080/22322 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-22 23:26:21 +00:00
Xuebin Su	d7ee509e93	IMPALA-12648: Add KILL QUERY statement To support killing queries programatically, this patch adds a new type of SQL statements, called the KILL QUERY statement, to cancel and unregister a query on any coordinator in the cluster. A KILL QUERY statement looks like ``` KILL QUERY '123:456'; ``` where `123:456` is the query id of the query we want to kill. We follow syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY' are added as "unreserved keywords", like 'DEFAULT'. This allows the three keywords to be used as identifiers. A user is authorized to kill a query only if the user is an admin or is the owner of the query. KILL QUERY statements are not affected by admission control. Implementation: Since we don't know in advance which impalad is the coordinator of the query we want to kill, we need to broadcast the kill request to all the coordinators in the cluster. Upon receiving a kill request, each coordinator checks whether it is the coordinator of the query: - If yes, it cancels and unregisters the query, - If no, it reports "Invalid or unknown query handle". Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is created for this. For authorization, this patch adds a custom handler of AuthorizationException for each statement to allow the exception to be handled by the backend. This is because we don't know whether the user is the owner of the query until we reach its coordinator. To support cancelling child queries, this patch changes ChildQuery::Cancel() to bypass the HS2 layer so that the session of the child query will not be added to the connection used to execute the KILL QUERY statement. Testing: - A new ParserTest case is added to test using "unreserved keywords" as identifiers. - New E2E test cases are added for the KILL QUERY statement. - Added a new dimension in TestCancellation to use the KILL QUERY statement. - Added file tests/common/cluster_config.py and made CustomClusterTestSuite.with_args() composable so that common cluster configs can be reused in custom cluster tests. Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0 Reviewed-on: http://gerrit.cloudera.org:8080/21930 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2025-01-22 22:22:54 +00:00
Sai Hemanth Gantasala	dc9066e7b2	IMPALA-13500: (Addendum) Do not assert CatalogD's partition id Amend verifying logs in impala to exclude partition id info. partition_id assigned by catalogD may not be in serial order, so it is best to avoid checking partition id in the logs. Change-Id: I27cdeb2a4bed8afa30a27d05c7399c78af5bcebb Reviewed-on: http://gerrit.cloudera.org:8080/22198 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-21 06:54:44 +00:00
Riza Suminto	c298c54262	IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting getPerInstanceNdvForCpuCosting is a method to estimate the number of distinct values of exprs per fragment instance when accounting for the likelihood of duplicate keys across fragment instances. It borrows the probabilistic model described in IMPALA-2945. This method is exclusively used by AggregationNode only. getPerInstanceNdvForCpuCosting run the probabilistic formula individually for each grouping expression and then multiply it together. That match with how we estimate group NDV in the past where we simply do NDV multiplication of each grouping expression. Recently, we adds tuple-based analysis to lower cardinality estimate for all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086). All of the bounding happens in AggregationNode.computeStats(), where we call estimateNumGroups() function that returns globalNdv estimate for specific aggregation class. To take advantage from that more precise globalNdv, this patch replace getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that apply the probabilistic formula over this single globalNdv number rather than the old way where it often return an overestimated number from NDV multiplication method. Its use is still limited only to calculate ProcessingCost. Using it for preagg output cardinality will be done by IMPALA-2945. estimatePreaggCardinality is skipped if data partition of input is a subset of grouping expression. Testing: - Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True. ProcessingCost changes, but all cardinality number stays. - Add CardinalityTest#testEstimatePreaggCardinality. - Update test_executor_groups.py. Enable v2 profile as well for easier runtime profile debugging. Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1 Reviewed-on: http://gerrit.cloudera.org:8080/22320 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 10:05:09 +00:00
Riza Suminto	8c2017aa00	IMPALA-12937: (part 2) Deflake TestAdmissionControllerStress TestAdmissionControllerStress::test_mem_limit is flaky again. One fragment instance that expected to stay alive until query submission loop ends actually finished early, even though clients are only fetching 1 rows every 0.5 second. This patch attempts to address the flakiness in two ways. First, is lowering batch_size to 10. Lower batch size is expected to keep all running fragment instances runnning until the query admission loop finishes. Second, is lowering num_queries from 50 to 40 if exploration_strategy is exhaustive. This will shorten the query submission loop, expecially when submission_delay_ms is high (150 seconds). This is OK because, based on the assertions, the test framework will only retain at most 15 active queries and 10 in-queue queries once the query submission loop ends. This patch also refactors SubmitQueryThread. Set long_polling_time_ms=100 for all queries to get faster initial response. The lock is removed and replaced with threading.Event to signal the end of test. The thread client and query_handle scope is made local within run() method for proper cleanup. Set timeout for wait_for_admission_control instead of waiting indefinitely. impala_connection.py is refactored so that BeeswaxConnection has matching logging functionality as ImpylaHS2Connection. Changed ImpylaHS2Connection._collect_profile_and_log initialization for possibillity that experimental Calcite planner may have ability to pull query profile and log from Impala backend. Testing: - Run and pass test_mem_limit in both TestAdmissionControllerStress and TestAdmissionControllerStressWithACService in exhaustive exploration 10 times. - Run and pass the whole TestAdmissionControllerStress and TestAdmissionControllerStressWithACService in exhaustive exploration. Change-Id: I706e3dedce69e38103a524c64306f39eac82fac3 Reviewed-on: http://gerrit.cloudera.org:8080/22351 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 09:21:46 +00:00
Riza Suminto	8675dbfe66	IMPALA-13665: Parallelize TestDecimalFuzz TestDecimalFuzz.test_decimal_ops and TestDecimalFuzz.test_width_bucket each execute_scalar query 10000 times. This patch speed them up by breaking each into 10 parallel tests run where each run execute_scalar query 1000 times. This patch also make execute_scalar and execute_scalar_expect_success to run query with long_polling_time_ms=100 if there is no query_options specified. Adds assertion in execute_scalar_expect_success that result is indeed only a single row. Slightly change exists_func to avoid unused argument warning. Testing: - From tests/ run and pass the following command ./run-tests.py query_test/test_decimal_fuzz.py Change-Id: Ic12b51b50739deff7792e2640764bd75e8b8922d Reviewed-on: http://gerrit.cloudera.org:8080/22328 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 01:01:10 +00:00
Joe McDonnell	71818c673b	IMPALA-13253: Add option to enable keepalive for client connections Client connections can drop without an explicit close. This can happen if client machine resets or there is a network disruption. Some load balancers have an idle timeout that result in the connection becoming invalid without an explicit teardown. With short idle timeouts (e.g. AWS LB has a timeout of 350 seconds), this can impact many connections. This adds startup options to enable / tune TCP keepalive settings for client connections: client_keepalive_probe_period_s - idle time before doing keepalive probes If set to > 0, keepalive is enabled. client_keepalive_retry_period_s - time between keepalive probes client_keepalive_retry_count - number of keepalive probes These startup options mirror the startup options for Kudu's equivalent functionality. Thrift has preexisting support for turning on keepalive, but that support uses the OS defaults for keepalive settings. To add the ability to tune the keepalive settings, this implements a wrapper around the Thrift socket (both TLS and non-TLS) and manually sets the keepalive options on the socket (mirroring code from Kudu's Socket::SetTcpKeepAlive). This does not enable keepalive by default to make it easy to backport. A separate patch will turn keepalive on by default. Testing: - Added a custom cluster test that connects with impala-shell and verifies that the socket has the keepalive timer. Verified that it works on Ubuntu 20, Centos 7, and Redhat 8. - Used iptables to manually test cases where the client is unreachable and verified that the server detects that and closes the connection. Change-Id: I9e50f263006c456bc0797b8306aa4065e9713450 Reviewed-on: http://gerrit.cloudera.org:8080/22254 Reviewed-by: Yida Wu <wydbaggio000@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-16 16:45:27 +00:00
stiga-huang	fdc4346635	IMPALA-13634: ImpalaTestSuite.cleanup_db() should use its own client ImpalaTestSuite.cleanup_db(cls, db_name, sync_ddl=1) is used to drop the entire db. Currently it uses cls.client and sets sync_ddl based on the parameter. This clears all the query options of cls.client and makes it always run with the same sync_ddl value unless the test code explicitly set query options again. This patch changes cleanup_db() to use a dedicated client. Tested with some tests that uses this method and see performance improvement: TestName Before After metadata/test_explain.py::TestExplainEmptyPartition 52s 9s query_test/test_insert_behaviour.py::TestInsertBehaviour::test_insert_select_with_empty_resultset 62s 15s metadata/test_metadata_query_statements.py::TestMetadataQueryStatements::test_describe_db (exhaustive) 160s 25s Change-Id: Icb01665bc18d24e2fce4383df87c4607cf4562f1 Reviewed-on: http://gerrit.cloudera.org:8080/22286 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 06:06:41 +00:00
gaurav1086	c3cbd79b56	IMPALA-13288: OAuth AuthN Support for Impala This patch added OAuth support with following functionality: * Load and parse OAuth JWKS from configured JSON file or url. * Read the OAuth Access token from the HTTP Header which is the same format as JWT Authorization Bearer token. * Verify the OAuth's signature with public key in JWKS. * Get the username out of the payload of OAuth Access token. * If kerberos or ldap is enabled, then both jwt and oauth are supported together. Else only one of jwt or oauth is supported. This has been a pre existing flow for jwt. So OAuth will follow the same policy. * Impala Shell side changes: OAuth options -a and --oauth_cmd Testing: - Added 3 custom cluster be test in test_shell_jwt_auth.py: - test_oauth_auth_valid: authenticate with valid token. - test_oauth_auth_expired: authentication failure with expired token. - test_oauth_auth_invalid_jwk: authentication failure with valid signature but expired. - Added 1 custom cluster fe test in JwtWebserverTest.java - testWebserverOAuthAuth: Basic tests for OAuth - Added 1 custom cluster fe test in LdapHS2Test.java - testHiveserver2JwtAndOAuthAuth: tests all combinations of jwt and oauth token verification with separate jwks keys. - Manually tested with a valid, invalid and expired oauth access token. - Passed core run. Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6 Reviewed-on: http://gerrit.cloudera.org:8080/21728 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 03:32:57 +00:00
Sai Hemanth Gantasala	46115e9a8e	IMPALA-13126: Obtain table read lock in EP to process partitioned event For a partition-level event, isOlderEvent() in catalogD needs to check whether the corresponding partition is reloaded after the event. This should be done after holding the table read lock. Otherwise, EventProcessor could hit ConcurrentModificationException error when there are concurrent DDLs/DMLs modifying the partition list. note: Created IMPALA-13650 for a cleaner solution to clear the inflight events list for partitioned table events. Testing: - Added a end-to-end stress test to verify the above scenario Change-Id: I26933f98556736f66df986f9440ebb64be395bc1 Reviewed-on: http://gerrit.cloudera.org:8080/21663 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 07:29:43 +00:00
Noemi Pap-Takacs	c518d3c818	IMPALA-13501: Clean up uncommitted Iceberg files after validation check failure Iceberg supports multiple writers with optimistic concurrency. Each writer can write new files which are then added to the table after a validation check to ensure that the commit does not conflict with other modifications made during the execution. When there was a conflicting change which could not be resolved, it means that the newly written files cannot be committed to the table, so they used to become orphan files on the file system. Orphan files can accumulate over time, taking up a lot of storage space. They do not belong to the table because they are not referenced by any snapshot and therefore they can't be removed by expiring snapshots. This change introduces automatic cleanup of uncommitted files after an unsuccessful DML operation to prevent creating orphan files. No cleanup is done if Iceberg throws CommitStateUnknownException because the update success or failure is unknown in this case. Testing: - E2E test: Injected ValidationException with debug option. - stress test: Added a method to check that no orphan files were created after failed conflicting commits. Change-Id: Ibe59546ebf3c639b75b53dfa1daba37cef50eb21 Reviewed-on: http://gerrit.cloudera.org:8080/22189 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-09 18:52:40 +00:00
Riza Suminto	134de01a59	IMPALA-13642: Fix unused test vector in test_scanners.py Several test vectors were ignored in test_scanners.py. This cause repetition of the same test without actually varying the test exec_option nor debug_action. This patch fix it by: - Use execute_query() instead of client.execute() - Passing vector.get_value('exec_option') when executing test query. Repurpose ImpalaTestMatrix.embed_independent_exec_options to deepcopy 'exec_option' dimension during vector generation. Therefore, each test execution will have unique copy of 'exec_option' for them self. This patch also adds flake8-unused-arguments plugin into critique-gerrit-review.py and py3-requirements.txt so we can catch this issue during code review. impala-flake8 is also updated to use impala-python3-common.sh. Adds flake8==3.9.2 in py3-requirements.txt, which is the highest version that has compatible dependencies with pylint==2.10.2. Drop unused 'dryrun' parameter in get_catalog_compatibility_comments method of critique-gerrit-review.py. Testing: - Run impala-flake8 against test_scanners.py and confirm there is no more unused variable. - Run and pass test_scanners.py in core exploration. Change-Id: I3b78736327c71323d10bcd432e162400b7ed1d9d Reviewed-on: http://gerrit.cloudera.org:8080/22301 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-09 06:17:51 +00:00
Peter Rozsa	d23ba87d46	IMPALA-13361: Add INSERT * and UPDATE SET * syntax for MERGE statement This change adds INSERT * and UPDATE SET * language elements for WHEN NOT MATCHED and WHEN MATCHED clauses. INSERT * enumerates all source expressions from source table/subquery and analyzes the clause similarly to the regular WHEN NOT MATCHED THEN INSERT case. UPDATE SET * creates assignments for each target table column by enumerating the table columns and assigning source expressions by index. If the target column count and the source expression count mismatches or the types mismatches both clauses report analysis errors. Tests: - parser tests added - analyzer tests added - E2E tests added Change-Id: I31cb771f2355ba4acb0f3b9f570ec44fdececdf3 Reviewed-on: http://gerrit.cloudera.org:8080/22051 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-08 17:03:57 +00:00
Riza Suminto	01b8b45252	IMPALA-13620: Refresh compute_table_stats.py script This patch refreshes compute_table_stats.py script with the following changes: - Limit parallelism to IMPALA_BUILD_THREADS at maximum if --parallelism argument is not set. - Change its default connection to hs2, leveraging existing ImpylaHS2Connection. - Change OptionParser to ArgumentParser. - Use impala-python3 to run the script. - Add --exclude_table_names to skip running COMPUTE STATS on certain tables/views. - continue_on_error is False by default. This patch also improves query handle logging in ImpylaHS2Connection. collect_profile_and_log argument is added to control whether to pull logs and runtime profile at the end of __fetch_results(). The default behavior remains unchanged. Skip COMPUTE STATS for functional_kudu.alltypesagg and functional_kudu.manynulls because it is invalid to run COMPUTE STATS over view. Customized hive-site.xml to set datanucleus.connectionPool.maxPoolSize to 30 and hikaricp.connectionTimeout to 60000 ms. Also set hive.log.dir to ${IMPALA_CLUSTER_LOGS_DIR}/hive. Testing: Repeatedly run compute-table-stats.sh from cold state and confirm there is no error occurs. This is the script to do so from active minicluster: cd $IMPALA_HOME ./bin/start-impala-cluster.py --kill ./testdata/bin/kill-hive-server.sh ./testdata/bin/run-hive-server.sh ./bin/start-impala-cluster.py ./testdata/bin/compute-table-stats.sh > /tmp/compute-stats.txt 2>&1 grep error /tmp/compute-stats.txt Core tests ran and passed. Change-Id: I1ebf02f95b957e7dda3a30622b87e8fca3197699 Reviewed-on: http://gerrit.cloudera.org:8080/22231 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-08 07:49:31 +00:00
Riza Suminto	e82cf3b3a1	IMPALA-13643: Add scan_multithread_constraint in test_scanners.py Debug action HDFS_SCANNER_THREAD_CHECK_SOFT_MEM_LIMIT exist inside HdfsScanNode (hdfs-scan-node.cc) code path. MT_DOP > 0 executes using HdfsScanNodeMt (hdfs-scan-node-mt.cc) rather than HdfsScanNode, and always start single scanner thread per ScanNode. Thus, there is no need to exercise HDFS_SCANNER_THREAD_CHECK_SOFT_MEM_LIMIT and MT_DOP > 0 combination. This patch adds scan_multithread_constraint in test_scanners.py where 'mt_dop' exec option dimension is declared. This reduce core test vector combination from 1254 to 1138 and exhaustive test vector combination from 7530 to 6774. Testing: - Run and pass test_scanners.py in core exploration. Change-Id: I77c2e6f9bbd4bc1825fa1f006a22ee1a6ea5a606 Reviewed-on: http://gerrit.cloudera.org:8080/22300 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Jason Fehr <jfehr@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 04:38:26 +00:00
stiga-huang	777ae104bb	IMPALA-13305: Better thrift compatibility checks based on pyparsing There are some false positive warnings reported by critique-gerrit-review.py when adding a new thrift struct that has required fields. This patch leverages pyparsing to analyze the thrift file changes. So we can identify whether the new required field is added in an existing struct. thrift_parser.py adds a simple thrift grammar parser to parse a thrift file into an AST. It basically consists of pyparsing.ParseResults and some customized classes to inject the line number, i.e. thrift_parser.ThriftField and thrift_parser.ThriftEnumItem. Import thrift_parser to parse the current version of a thrift file and the old version of it before the commit. critique-gerrit-review.py then compares the structs and enums to report these warnings: - A required field is deleted in an existing struct. - A new required field is added in an existing struct. - An existing field is renamed. - The qualifier (required/optional) of a field is changed. - The type of a field is changed. - An enum item is removed. - Enum items are reordered. Only thrift files used in both catalogd and impalad are checked. This is the same as the current version. We can further improve this by analyzing all RPCs used between impalad and catalogd to get all thrift struct/enums used in them. Warning examples for commit `e48af8c04`: "common/thrift/StatestoreService.thrift": [ { "message": "Renaming field 'sequence' to 'catalogd_version' in TUpdateCatalogdRequest might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 345, "side": "REVISION" } ] Warning examples for commit `595212b4e`: "common/thrift/CatalogObjects.thrift": [ { "message": "Adding a required field 'type' in TIcebergPartitionField might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 612, "side": "REVISION" } ] Warning examples for commit `c57921225`: "common/thrift/CatalogObjects.thrift": [ { "message": "Renaming field 'partition_id' to 'spec_id' in TIcebergPartitionSpec might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 606, "side": "REVISION" } ], "common/thrift/CatalogService.thrift": [ { "message": "Changing field 'iceberg_data_files_fb' from required to optional in TIcebergOperationParam might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 215, "side": "REVISION" }, { "message": "Adding a required field 'operation' in TIcebergOperationParam might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 209, "side": "REVISION" } ], "common/thrift/Query.thrift": [ { "message": "Renaming field 'spec_id' to 'iceberg_params' in TFinalizeParams might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 876, "side": "REVISION" } ] Warning example for commit `2b2cf8d96`: "common/thrift/CatalogService.thrift": [ { "message": "Enum item FUNCTION_NOT_FOUND=3 changed to TABLE_NOT_LOADED=3 in CatalogLookupStatus. This might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 381, "side": "REVISION" } ] Warning example for commit `c01efd096`: "common/thrift/JniCatalog.thrift": [ { "message": "Removing the enum item TAlterTableType.SET_OWNER=15 might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 107, "side": "PARENT" } ] Warning example for commit `374783c55`: "common/thrift/Query.thrift": [ { "message": "Changing type of field 'enabled_runtime_filter_types' from PlanNodes.TEnabledRuntimeFilterTypes to set<PlanNodes.TRuntimeFilterType> in TQueryOptions might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 449, "side": "REVISION" } Tests - Add tests in tests/infra/test_thrift_parser.py - Verified the script with all(1260) commits of common/thrift. Change-Id: Ia1dc4112404d0e7c5df94ee9f59a4fe2084b360d Reviewed-on: http://gerrit.cloudera.org:8080/22264 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 02:00:17 +00:00
Andrew Sherman	21ef3e6ffe	IMPALA-13638: Translate apostrophe to underscore in Prometheus metric names. Impala has some metrics that reflect the state of the JVM. Some of these metrics have names that are partly composed of the names of the MemoryPoolMXBean objects in the Java virtual machine. In Jdk 8 these are names like "Code Cache" and "PS Eden Space". In Jdk 11 these names include apostrophe characters, for example "CodeHeap 'profiled nmethods'". The derived metric names work OK for Impala in both the webui and in json output. However the apostrophe character is illegal in Prometheus metric names per https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels and these metrics cannot be consumed by Prometheus. Fix this by adding the apostrophe to the list of characters that are mapped to underscores when we translate the metric names for Prometheus metrics. TESTING: Extended the test_prometheus_metrics test to parse all generated Prometheus metrics. Ran the test with Jdk 11 where it failed without the server fix Change-Id: I557b123c075dff0b14ac527de08bc6177bd2a3f6 IMPALA-13596: first cut at tidied code Reviewed-on: http://gerrit.cloudera.org:8080/22295 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 23:38:49 +00:00
Riza Suminto	f1de0c392f	IMPALA-13636: Fix target file_format in TestTpcdsInsert TestTpcdsInsert creates a temporary table to test insert functionality. It has three problems: 1. It does not use unique_database parameter, so the temporary table is not cleaned up after test finished. 2. It ignores file_format from test vector, causing inconsistency in the temporary table's file format. str_insert is always in PARQUET format, while store_sales_insert is always in TEXTFILE format. 3. text file_format dimension is never exercised, because --workload_exploration_strategy in run-all-tests.sh does not explicitly list tpcds-insert workload. This patch fixes all three problems and few flake8 warnings in test_tpcds_queries.py. Testing: - Run bin/run-all-tests.sh with EXPLORATION_STRATEGY=exhaustive EE_TEST=true EE_TEST_FILES="query_test/test_tpcds_queries.py::TestTpcdsInsert" Verified that the temporary table format follows file_format dimension. Change-Id: Iea621ec1d6a53eba9558b0daa3a4cc97fbcc67ae Reviewed-on: http://gerrit.cloudera.org:8080/22291 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-01-06 22:15:12 +00:00
Sai Hemanth Gantasala	e7c97439d1	IMPALA-12141: EP shouldn't fail while releasing write lock if the lock is not held previously Without IMPALA-12832, Event Processor (EP) is going into error state when there is an issue while obtaining a table write lock because the finally-clause of releaseWriteLock() is always invoked even if the lock is not held by current thread. This patch addresses the problem by checking if the table holds write lock before releasing it. Note: With IMPALA-12832, the EP invalidates the table when an error is encountered which is still an overhead. With this patch EP will neither goes into error state nor invalidates when this issue is encountered. Testing: - Added an end-to-end to verify the same. Change-Id: Ib2e4c965796dd515ab8549efa616f72510ca447f Reviewed-on: http://gerrit.cloudera.org:8080/22080 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 20:42:07 +00:00
Michael Smith	99529db6ad	IMPALA-13639: Wait for prior queries in TestWebPage Waits for prior test queries to complete before starting query cancellation tests that expect no in-flight queries. Removes redundant asserts that num_in_flight_queries reached the expected value, as try_until will fail the test if it times out. Change-Id: I683d8b25dc0ec40bc2deb7aa11f79c6bc1a837c3 Reviewed-on: http://gerrit.cloudera.org:8080/22292 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-03 23:17:27 +00:00
Xuebin Su	731c16c73a	IMPALA-13154: Update metrics when loading an HDFS table Previously, some table metrics, such as the estimated memory usage and the number of files, were only updated when a "FULL" Thrift object of the table is requested. As a result, if a user ran a DESCRIBE command on a table, and then tried to find the table on the Top-N page of the web UI, the user would not find it. This patch fixes the issue by updating the table metrics as soon as an HDFS table is loaded. With this, no matter what Thrift object type of the table is requested, the metrics will always be updated and displayed on the web UI. Testing: - Added two custom cluster tests in test_web_pages.py to make sure that table stats can be viewed on the web UI after DESCRIBE, for both legacy and local catalog modes. Change-Id: I6e2eb503b0f61b1e6403058bc5dc78d721e7e940 Reviewed-on: http://gerrit.cloudera.org:8080/22014 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-02 19:31:18 +00:00
Gabriella Gyorgyevics	8aea57fc77	IMPALA-13211: Add negative test for Parquet Byte Stream Split encoding This change adds EE tests in test_parquet_byte_stream_split_encoding.py that check that Impala returns the correct error message when it encounters a table that contains a parquet file with Byte Stream Split encoding. To regenerate the test files, run the parquet_files_generator.py script in the testdata/parquet_byte_stream_split_encoding/ folder. Change-Id: If5eff8bf51fe246a9d0250e38c470b821fec75d9 Reviewed-on: http://gerrit.cloudera.org:8080/22124 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-02 18:09:38 +00:00
Peter Rozsa	19110b490d	IMPALA-13362: Implement WHEN NOT MATCHED BY SOURCE syntax for MERGE statement This change adds support for a new MERGE clause that covers the condition when the source statement's rows do not match the target tables rows. Example: MERGE INTO target t using source s on t.id = s.id WHEN NOT MATCHED BY SOURCE THEN UPDATE set t.column = "a"; This change also adds support to use WHEN NOT MATCHED BY TARGET explicitly, this is equivalent to WHEN NOT MATCHED. Tests: - Parser tests for the new language elements. - Analyzer and planner test for WHEN NOT MATCHED BY SOURCE/TARGET clauses. - E2E tests for WHEN NOT MATCHED BY SOURCE clause. Change-Id: Ia0e0607682a616ef6ad9eccf499dc0c5c9278c5f Reviewed-on: http://gerrit.cloudera.org:8080/21988 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-02 15:02:20 +00:00
stiga-huang	14b959d7bf	IMPALA-13635: Fix wrong expect in test_event_processor_error_global_invalidate TestEventProcessingError.test_event_processor_error_global_invalidate creates a partitioned table with partition year=2024. The test expects the output of DESCRIBE FORMATTED contains string "2024". It happens to work in year 2024 since there is a field of "CreateTime" in the output that has "2024" in the timestamp. Now as we are in year 2025, the test fails forever. This fixes the test to check the output of SHOW PARTITIONS. Tests - Verified the test locally. Change-Id: I0b17fd1f90a9bc00d027527661ff675e61ba0b1a Reviewed-on: http://gerrit.cloudera.org:8080/22287 Reviewed-by: Yida Wu <wydbaggio000@gmail.com> Reviewed-by: Andrew Sherman <asherman@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-02 03:59:02 +00:00
Riza Suminto	2f5aef64a5	IMPALA-13617: Rename c_last_review_date to c_last_review_date_sk TPC-DS v2.11.0, section 2.4.7, rename column customer.c_last_review_date to customer.c_last_review_date_sk to align with other surrogate key columns. impala-tpcds-kit has been modified to reflect this column name change in `086d7113c8` However, the tpcds dataset schema in Impala test data remains unchanged. This patch did such a rename to align closer to TPC-DS v2.11.0. This patch contains no data type adjustment because such adjustment requires larger changes. customer_multiblock_page_index.parquet added by IMPALA-10310 is regenerated to follow the new schema of table customer. The SQL used to create the file is ordered more specifically over both c_current_cdemo_sk and c_customer_sk columns. The associated test assertion in parquet-page-index.test is also updated. A workaround in test_file_parser.py added by IMPALA-13543 is now removed after this change is applied. Testing: - Pass core tests. Change-Id: Ie446b3c534cb8f6f54265cd9b2f705cad91dd4ac Reviewed-on: http://gerrit.cloudera.org:8080/22223 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-20 06:20:37 +00:00
Joe McDonnell	5b4afb4f8f	IMPALA-13368: Fixup Redhat detection for Python >= 3.8 Python 3.8 removed the platform.linux_distribution() function which is currently used to detect Redhat. This switches to using the 'distro' package, which implements the same functionality across different Python versions. Since Redhat 6 is no longer supported, this removes the detection of Redhat 6 and associated skip logic. Testing: - Ran a core job Change-Id: I0dfaf798c0239f6068f29adbd2eafafdbbfd66c3 Reviewed-on: http://gerrit.cloudera.org:8080/22073 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-17 07:28:51 +00:00
Joe McDonnell	5c02190152	IMPALA-13124: Migrate tests that use 'unittest' package to pytest base class Some tests were written to use the builtin 'unittest' package. In testing on Python 3, these tests failed with an error like "RuntimeError: generator raised StopIteration". Since Impala tests are standardized on pytests, this converts those locations to use our regular pytest base classes. This required restructing the test_redaction.py custom cluster test to use the pytest setup and teardown methods. It also simplifies the test cases so that each attempted startup gets its own test rather than doing multiple startup attempts in a single test. Testing: - Ran exhaustive job Change-Id: I89e854f64e424a75827929a4f6841066024390e9 Reviewed-on: http://gerrit.cloudera.org:8080/21475 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2024-12-17 07:28:51 +00:00
Joe McDonnell	8d5adfd0ba	IMPALA-13123: Add option to run tests with Python 3 This introduces the IMPALA_USE_PYTHON3_TESTS environment variable to select whether to run tests using the toolchain Python 3. This is an experimental option, so it defaults to false, continuing to run tests with Python 2. This fixes a first batch of Python 2 vs 3 issues: - Deciding whether to open a file in bytes mode or text mode - Adapting to APIs that operate on bytes in Python 3 (e.g. codecs) - Eliminating 'basestring' and 'unicode' locations in tests/ by using the recommendations from future ( https://python-future.org/compatible_idioms.html#basestring and https://python-future.org/compatible_idioms.html#unicode ) - Uses impala-python3 for bin/start-impala-cluster.py All fixes leave the Python 2 path working normally. Testing: - Ran an exhaustive run with Python 2 to verify nothing broke - Verified that the new environment variable works and that it uses Python 3 from the toolchain when specified Change-Id: I177d9b8eae9b99ba536ca5c598b07208c3887f8c Reviewed-on: http://gerrit.cloudera.org:8080/21474 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2024-12-17 07:28:51 +00:00
jasonmfehr	086e0b0ffa	IMPALA-13603: Fix Flaky Live Queries Table Tests Workload management skips recording successful trivial DDLs but does not skip recording failed trivial DDLs. The test_query_live.py tests run a describe DDL in the setup_method(). Normally, this describe succeeds immediately and thus is not recorded in the workload management tables. However, in very rare instances, the describe DDL will fail the first time and succeed the second time. These cases result in an extra query recorded in the sys.impala_query_live table and test assertions that rely on a certain number of records being in this table fail because there are extra records. This change modifies the method used to determine if the sys.impala_query_live table is available. The test_query_live.py tests now check the coordinator's catalog cache and wait until the sys.impala_query_live table appears. DDLs are no longer executed. All tests in the test_query_live.py file passed locally and in Jenkins builds. Change-Id: I767a2bd7b068ab3fdeddb7cb3c4b307844d2d279 Reviewed-on: http://gerrit.cloudera.org:8080/22210 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-17 02:02:35 +00:00
jasonmfehr	cb35dc8769	IMPALA-13536: Workload Management Tests Failing on Init Check Most of the workload management tests verify that the workload management process has successfully completed. Part of this verification ensures a catalog update has propagated the workload management changes to the coordinators by determining the catalog version, from the catalogd logs, that contains the workload management table changes and ensuring that version is in the coordinator logs. The test flakiness occurs when multiple catalogd versions are combined into a later version. Specifically, tests were failing because the coordinator logs were checked for catalog version X but the actual version in the coordinator logs was X+1. The fix for the test flakiness is to allow for the exepected catalog version or any later version. Change-Id: I9f20a149ab1f45ee3506f098f8594965a24a89d3 Reviewed-on: http://gerrit.cloudera.org:8080/22200 Reviewed-by: Jason Fehr <jfehr@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-13 05:44:58 +00:00
jasonmfehr	f5bb759c67	IMPALA-13600: Fix Flaky JWT Test The custom cluster tests for Impala shell JWT authentication all contain magic numbers for the expected count of RPCs for the hs2-http protocol. Thus, any time the rpcs are modified, these tests have the potential to fail. Since the JWT tests are focused on all JWT authentications either succeeding or failing, the actual number of rpcs is not relevant. The tests now use existing metrics to determine the expected rpc count. Additionally, the tests use existing metrics to determine when the assertions can run instead of relying on a sleep statement. The modified tests passed locally and in Jenkins. Change-Id: Icf0eebd74e1ce10ad24055b7fab4b1901ce61e03 Reviewed-on: http://gerrit.cloudera.org:8080/22201 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-12 23:51:55 +00:00
Andrew Sherman	2280c1362e	IMPALA-12943: Document Admission Control User Quotas. Document the feature introduced in IMPALA-12345. Add a few more tests to the QuotaExamples test which demonstrate the examples used in the docs. Clarify in docs and code the behavior when a user is a member of more than one group for which there are rules. In this case the least restrictive rule applies. Also document the '--max_hs2_sessions_per_user' flag introduced in IMPALA-12264. Change-Id: I82e044adb072a463a1e4f74da71c8d7d48292970 Reviewed-on: http://gerrit.cloudera.org:8080/22100 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-11 02:18:18 +00:00
jasonmfehr	490f90c65e	IMPALA-13536: Fix Workload Management Init Tests Issues Several problems with the workload management code and test_workload_mgmt_init.py tests have been uncovered by the Ozone tests. * test_create_on_version_1_0_0 - Test comment said it ran on 10 nodes, test configuration specified 1 node. Fix was to modify the test configuration. * test_create_on_version_1_1_0 - Test comment said it ran on 10 nodes, test configuration specified 1 node. Fix was to modify the test configuration. * test_invalid_* - All four of these tests run the same internal function to execute the test. This internal function was not waiting long enough for the expected failure to appear. The fixed internal function waits longer for the expected failure. Additionally, the @CustomClusterTestSuite annotation has a new option named 'log_symlinks', which, if set to True will resolve all daemon log symlinks and output their actual paths to the log. Failed tests can then be easily traced to the exact log files for that test. The existing workload management tests in testdata have been expanded to also assert the expected table properties are present. Modified tests passed on Ozone builds both with and without erasure coding enabled. Change-Id: Ie3f34088d1d925f30abb63471387e6fdb62b95a7 Reviewed-on: http://gerrit.cloudera.org:8080/22119 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-11 01:01:46 +00:00
jasonmfehr	41c145f5ad	IMPALA-13536: Fix Workload Management Init with Catalog HA When running an Impala cluster with catalogd HA enabled, the standby catalogd would go into a loop waiting for the first catalog update to arrive repeatedly logging the same error and never joining the server thread defined in catalogd-main.cc. Before this patch, when the standby daemon became active, the first catalogd update was finally received, and the workload management initialization process ran a second time in the newly active daemon because this daemon saw that it was active. This patch modifies the catalogd workload management initialization code so it waits until the active catalogd has been determined. At that point, the standby daemon skips workload management initialization while the active daemon runs it after it receives the first catalog update. Testing was accomplished by modifying the workload management initialization custom cluster tests to assert that the init process is not re-run when a catalogd switches from standby to active and also to remove the assumption that the first catalogd was active. The test_catalog_ha test was deleted since all its assertions are handled by the setup_method of the new TestWorkloadManagementCatalogHA class. Ozone tests with and without erasure coding were also ran and passed. Change-Id: Id3797a0a9cf0b8ae844d9b7d46b607d93824f69a Reviewed-on: http://gerrit.cloudera.org:8080/22118 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-10 19:23:18 +00:00
Laszlo Gaal	403519def4	IMPALA-13597: Upgrade critique-gerrit-review.py to Python3 Commit `8e71f5ec86` has changed the Python environment for the gerrit-auto-critic script from Python2 to Python3. Unfortunately the change missed a few Python3-related updates, so the script started failing in the pre-commit environment. This patch adds the following updates to the Python3 update: - changes the virtualenv implementation from virtualenv to the venv module offered by default in Python3. - adds pip3 and system_site_packages=True to the venv creation - bumps the flake8 module to a newer version, as it doesn't have to be compatible with Python2 any longer. - extends Popen calls with universal_newlines=True wherever these were missing. The patch also fixes a regex search string in test_kudu.py (changes the regex pattern string to a raw Python string). This is somewhat unrelated to the Python script change, but it was discovered during testing to make flake8 emit a badly formatted warning message. The python3-venv and python3-wheel packages were installed manually on jenkins.impala.io during testing. These were necessary to eliminate errors during the scripts initial virtualenv-setup steps. Tests: - ran the new script locally - ran the new script through the precommit process using a test copy of the gerrit-auto-critic job, test-gerrit-auto-critic. Change-Id: I5efa035fae38bd42cc3b07f479da2b3983f68252 Reviewed-on: http://gerrit.cloudera.org:8080/22191 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2024-12-10 18:18:54 +00:00
Mihaly Szjatinya	79eb48e9f6	IMPALA-13544: Expose TRANSLATED_TO_EXTERNAL property in SHOW CREATE TABLE Makes TRANSLATED_TO_EXTERNAL property from HMS exposed as a result of SHOW CREATE TABLE request by removing it from ToSqlUtils' hidden properties list. Change-Id: I0048a041d50f5e520f5286a613a428393397bc4d Reviewed-on: http://gerrit.cloudera.org:8080/22108 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-09 15:07:45 +00:00
Sai Hemanth Gantasala	2bffca1029	IMPALA-13500: Fix test_invalidate_stale_partition_on_reload() test TestEventProcessingCustomConfigs#test_invalidate_stale_partition_on _reload() is flaky for not finding log lines it is looking for. This is happening because ParallelFileMetadataLoader which is running in the background thread is missing the current catalog delta request, hence the log lines are not visible in impala logs. The next catalog delta request writes the required log lines to impala logs but the test fails by that time. Adding a sleep 5s should finish the file metadata reload and also increased the timeout from 6s to 15s to verify the impala log then catalog delta request can capture the required content. Testing: - Looped the test several hundred times locally to verify that their is no flakiness. Change-Id: I62170aa6ed8ae122482a03212fec9c4fe843ce03 Reviewed-on: http://gerrit.cloudera.org:8080/22084 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-26 12:17:03 +00:00
jiangwel	756d9af1c9	IMPALA-13448: Log cause when failing to flush lineage events, audit events or profiles When impala fails to flush lineage events, audit events or profiles, only the log file name is logged: "Could not open log file: filename". Now we will log "Could not open log file: filename, reason". e.g: "Could not open log file: filename, cause: Permission denied" Testing: - added custom cluster tests in test_logging.py Change-Id: I5b281d807e47aad98fc256af4e0c2a9dd417c7ac Reviewed-on: http://gerrit.cloudera.org:8080/22070 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-25 17:19:03 +00:00
Daniel Becker	e5919f13f9	IMPALA-13370: Read Puffin stats from metadata.json property if available When Trino writes Puffin stats for a column, it includes the NDV as a property (with key "ndv") in the "statistics" section of the metadata.json file, in addition to the Theta sketch in the Puffin file. When we are only reading the stats and not writing/updating them, it is enough to read this property if it is present. After this change, Impala only opens and reads a Puffin stats file if it contains stats for at least one column for which the "ndv" property is not set in the metadata.json file. Testing: - added a test in test_iceberg_with_puffin.py that verifies that the Puffin stats file is not read if the the metadata.json file contains the NDV property. It uses the newly added stats file with corrupt datasketches: 'metadata_ndv_ok_sketches_corrupt.stats'. Change-Id: I5e92056ce97c4849742db6309562af3b575f647b Reviewed-on: http://gerrit.cloudera.org:8080/21959 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-23 16:04:06 +00:00
Riza Suminto	f533225915	IMPALA-13543: single_node_perf_run.py must accept tpcds_partitioned tpcds_partitioned dataset is a fully-partitioned version of tpcds dataset (the latter only partition store_sales table). It does not have the default text format database like tpcds dataset. Instead, it relies on pre-existence of text format tpcds database, which then INSERT OVERWRITE INTO tpcds_partitioned database equivalent. It does not have its own queries set, but instead symlinked to share testdata/workloads/tpcds/queries. It also have slightly different schema from tpcds dataset, namely column "c_last_review_date" in tpcds dataset is "c_last_review_date_sk" in tpcds_partitioned (TPC-DS v2.11.0, section 2.4.7). These reasons make tpcds_partitioned ineligible for perf-AB-test (single_node_perf_run.py). This patch update single_node_perf_run.py and related scripts to make tpcds_partitioned eligible for benchmark dataset. It adds an initial steps to load the text database from tpcds dataset with selected scale before running the load script for tpcds_partitioned dataset. Compute stats step also limited to run one at a time to not overadmit the cluster with concurrent compute stats queries. Created helper function build_replacement_params() inside generate-schema-statements.py for common function. Testing - Run perf-AB-test-ub2004 with this commit included and confirm benchmark works with tpcds_partitioned dataset. - Run normal data loading. Pass FE tests, and query_test/test_tpcds_queries.py. Change-Id: I4b6f435705dcf873696ffd151052ebeab35d9898 Reviewed-on: http://gerrit.cloudera.org:8080/22061 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-19 07:11:42 +00:00
Michael Smith	2085edbe1c	IMPALA-13503: CustomClusterTestSuite for whole class Allow using CustomClusterTestSuite with a single cluster for the whole class. This speeds up tests by letting us group together multiple test cases on the same cluster configuration and only starting the cluster once. Updates tuple cache tests as an example of how this can be used. Reduces test_tuple_cache execution time from 100s to 60s. Change-Id: I7a08694edcf8cc340d89a0fb33beb8229163b356 Reviewed-on: http://gerrit.cloudera.org:8080/22006 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Jason Fehr <jfehr@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-18 23:57:39 +00:00
Andrew Sherman	de6b902581	IMPALA-12345: Add user quotas to Admission Control Allow administrators to configure per user limits on queries that can run in the Impala system. In order to do this, there are two parts. Firstly we must track the total counts of queries in the system on a per-user basis. Secondly there must be a user model that allows rules that control per-user limits on the number of queries that can be run. In a Kerberos environment the user names that are used for both the user model and at runtime are short user names, e.g. testuser when the Kerberos principal is testuser/scm@EXAMPLE.COM TPoolStats (the data that is shared between Admission Control instances) is extended to include a map from user name to a count of queries running. This (along with some derived data structures) is updated when queries are queued and when they are released from Admission Control. This lifecycle is slightly different from other TPoolStats data which usually tracks data about queries that are running. Queries can be rejected because of user quotas at submission time. This is done for two reasons: (1) queries can only be admitted from the front of the queue and we do not want to block other queries due to quotas, and (2) it is easy for users to understand what is going on when queries are rejected at submission time. Note that when running in configurations without an Admission Daemon then Admission Control does not have perfect information about the system and over-admission is possible for User-Level Admission Quotas in the same way that it is for other Admission Control controls. The User Model is implemented by extending the format of the fair-scheduler.xml file. The rules controlling the per-user limits are specified in terms of user or group names. Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to the fair-scheduler.xml file. These elements can be placed on the root configuration, which applies to all pools, or the pool configuration. The ‘userQueryLimit’ element has 2 child elements: "user" and "totalCount". The 'user' element contains the short names of users, and can be repeated, or have the value "*" for a wildcard name which matches all users. The ‘groupQueryLimit’ element has 2 child elements: "group" and "totalCount". The 'group' element contains group names. The root level rules and pool level rules must both be passed for a new query to be queued. The rules dictate a maximum number of queries that can run by a user. When evaluating rules at either the root level, or at the pool level, when a rule matches a user then there is no more evaluation done. To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the RequestPoolService is enhanced. If user quotas are enabled for a pool then a list of the users with running or queued queries in that pool is visible on the coordinator webui admission control page. More comprehensive documentation of the user model will be provided in IMPALA-12943 TESTING New end-to-end tests are added to test_admission_controller.py, and admission-controller-test is extended to provide unit tests for the user model. Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41 Reviewed-on: http://gerrit.cloudera.org:8080/21616 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-16 06:38:38 +00:00
Riza Suminto	1f35747ea3	IMPALA-5792: Eliminate duplicate beeswax python code This patch unify duplicated exec summary code used by python beeswax clients: one used by the shell in impala_shell.py and one used by tests in impala_beeswax.py. The code that has progress furthest is the one in shell/impala_client.py, which is the one that can print correct exec summary table for MT_DOP>0 queries. It is made into a dedicated build_exec_summary_table function in impala_client.py, and then impala_beeswax.py import it from impala_client.py. This patch also fix several flake8 issues around the modified files. Testing: - Manually run TPC-DS Q74 in impala-shell and then type "summary" command. Confirm that plan tree is displayed properly. - Run single_node_perf_run.py over branches that produce different TPC-DS Q74 plan tree. Confirm that the plan tree are displayed correctly in performance_result.txt Change-Id: Ica57c90dd571d9ac74d76d9830da26c7fe20c74f Reviewed-on: http://gerrit.cloudera.org:8080/22060 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>	2024-11-14 11:19:19 +00:00
Daniel Becker	fafcd60061	IMPALA-13471: test_enable_reading_puffin() seems to fail in the Ozone build The tests TestIcebergWithPuffinStatsStartupFlag::test_[dis\|en]able_reading_puffin queried an Iceberg table that is created during normal dataload from existing non-filesystem-specific metadata and data files. Therefore the path of the Puffin stats file that is present in the metadata.json file does not contain any filesystem-specific prefix, for which Puffin reading does not work on Ozone. Note that reading Puffin stats for tables that are created normally do work on Ozone. This change modifies the tests to create the table on the fly, modifying the file path to include the filesystem-specific prefix. Change-Id: I7afec1c70d7b43bae98289d65749b01ca720e7f7 Reviewed-on: http://gerrit.cloudera.org:8080/22008 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-13 17:52:35 +00:00
Fang-Yu Rao	4255926b12	IMPALA-12554: Create one Ranger policy for multi-column GRANT This patch makes Impala create only one Ranger policy for the GRANT statement when there are multiple columns specified to reduce the number of policies created on the Ranger server. Note that this patch relies on RANGER-4585 and RANGER-4638. Testing: - Manually verified that Impala's catalog daemon only sends one GrantRevokeRequest to the Ranger plug-in and that the value of the key 'column' is a comma-separated list of column names involved in the GRANT statement. - Added an end-to-end test to verify only one Ranger policy will be created in a multi-column GRANT statement. Change-Id: I2b0ebba256c7135b4b0d2160856202292d720c6d Reviewed-on: http://gerrit.cloudera.org:8080/21940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-08 11:48:45 +00:00

1 2 3 4 5 ...

3280 Commits