impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Peter Rozsa	d76b451962	Release 4.5.0 Change-Id: Idb2fe419f645415cb1768095a0c537de75add13d 4.5.0-rc2	2025-02-13 17:23:56 +01:00
pranav.lodha	17e99767a5	IMPALA-13728: OpenSSLUtilTest.ValidateInitialize failed by AES_128_GCM not supported The test failed because AES_128_GCM is not supported in all OpenSSL versions. Replacing it with AES_256_CFB fixes the test as it is supported in all OpenSSL versions. Change-Id: If5b3a000e302d2705a02820c560b474f0c311560 Reviewed-on: http://gerrit.cloudera.org:8080/22477 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-13 17:23:31 +01:00
Peter Rozsa	f1f0680f8b	IMPALA-13742: Force python3 for CPack This patch adds a global override for CPack to use python3. Change-Id: Id9a0400d6c4cfb1b13ccf33c8ed4d5bf2a9c513b Reviewed-on: http://gerrit.cloudera.org:8080/22464 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-13 17:23:31 +01:00
Csaba Ringhofer	b1a985be5e	IMPALA-13680: Avoid flush() when closing TSSLSocket Closing the transports could hang in TAcceptQueueServer if there was an error during SSL handshake. As the TSSLSocket is wrapped in TBufferedTransport and TBufferedTransport::close() calls flush(), TSSLSocket::flush() was also called that led to trying again the handshake in an unclean state. This led to hanging indefinitely with OpenSSL 3.2. Another potential error is that if flush throws an exception then the underlying TTransport's close() wont' be called. Ideally this would be solved in Thrift (THRIFT-5846). As quick fix this change adds a subclass for TBufferedTransport that doesn't call flush(). This is safe to do as generated TProcessor subclasses call flush() every time when the client/server sends a message. Testing: - the issue was caught by thrift-server-test/KerberosOnAndOff and TestClientSsl::test_ssl hanging till killed Change-Id: I4879a1567f7691711d73287269bf87f2946e75d2 Reviewed-on: http://gerrit.cloudera.org:8080/22368 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2025-01-28 10:37:09 +00:00
stiga-huang	fb45c786e9	IMPALA-13691: Partition values from HMS events don't need URL decoding Hive uses URL encoding to format the partition strings when creating the partition folders, e.g. "00:00:00" will be encoded into "00%3A00%3A00". When you create a partition of string type partition column "p" and using "00:00:00" as the partition value, the underlying partition folder is "p=00%3A00%3A00". When parsing the partition folders, Impala will URL-decode the partition folder names to get the correct partition values. This is correct in ALTER TABLE RECOVER PARTITIONS command that gets the partition strings from the file paths. However, for partition strings come from HMS events, Impala shouldn't URL-decode them since they are not URL encoded and are the original partition values. This causes HMS events on partitions that have percent signs in the value strings being matched to wrong partitions. This patch fixes the issue by only URL-decoding the partition strings that come from file paths. Tests: - Ran tests/metadata/test_recover_partitions.py - Added custom-cluster test. Change-Id: I7ba7fbbed47d39b02fa0b1b86d27dcda5468e344 Reviewed-on: http://gerrit.cloudera.org:8080/22388 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-27 22:52:43 +00:00
pranavyl	a61b90f860	IMPALA-13039: AES Encryption/ Decryption Support in Impala AES (Advanced Encryption Standard) crypto functions are widely recognized and respected encryption algorithm used to protect sensitive data which operate by transforming plaintext data into ciphertext using a symmetric key, ensuring confidentiality and integrity. This standard speciﬁes the Rijndael algorithm, a symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128 and 256 bits. The patch makes use of the EVP_*() algorithms from the OpenSSL library. The patch includes: 1. AES-GCM, AES-CTR, and AES-CFB encryption functionalities and AES-GCM, AES-ECB, AES-CTR, and AES-CFB decryption functionalities. 2. Support for both 128-bit and 256-bit key sizes for GCM and ECB modes. 3. Enhancements to EncryptionKey class to accommodate various AES modes. The aes_encrypt() and aes_decrypt() functions serve as entry points for encryption and decryption operations, handling encryption and decryption based on user-provided keys, AES modes, and initialization vectors (IVs). The implementation includes key length validation and IV vector size checks to ensure data integrity and confidentiality. Multiple AES modes: GCM, CFB, CTR for encryption, and GCM, CFB, CTR and ECB for decryption are supported to provide flexibility and compatibility with various use cases and OpenSSL features. AES-GCM is set as the default mode due to its strong security properties. AES-CTR and AES-CFB are provided as fallbacks for environments where AES-GCM may not be supported. Note that AES-GCM is not available in OpenSSL versions prior to 1.0.1, so having multiple methods ensures broader compatibility. Testing: The patch is thouroughly tested and the tests are included in exprs.test. Change-Id: I3902f2b1d95da4d06995cbd687e79c48e16190c9 Reviewed-on: http://gerrit.cloudera.org:8080/20447 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-27 22:16:37 +00:00
Csaba Ringhofer	988d353e02	IMPALA-13693: Fix load-ext-data-sources.sh on Rocky 9.5 Two permission issues caused this dataload step to fail: - Lack of X permission on home directory (seems linux specific). - LOAD statement has no right to use \tmp for some reason - using \LOAD instead solves this. I don't know what postgres/configuration change caused this. Testing: - dataload and ext-data-source related tests passed on Rocky Linux 9.5 Change-Id: I3829116f4c6d6f6cba2da824cd9f31259a15ca1b Reviewed-on: http://gerrit.cloudera.org:8080/22383 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2025-01-25 11:52:02 +00:00
Steve Carlin	1490018810	IMPALA-13522: Calcite Planner: Treat the "real" type as double Real type was being treated as a float. E2E test can be found in exprs.test where there is a cast to real. Specifically, this test... select count(*) from alltypesagg where double_col >= 20.2 and cast(double_col as double) = cast(double_col as real) ... was casting double_col as a double and returning the wrong result previous to this commit. Change-Id: I5f3cc0e50a4dfc0e28f39d81b591c1b458fd59ce Reviewed-on: http://gerrit.cloudera.org:8080/22087 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2025-01-25 00:25:05 +00:00
Csaba Ringhofer	d9476a99c4	IMPALA-13689: Fix webserver tests with curl 7.76.1 On Rocky Linux 9.5 a few checks started to fail because error 503 was expected to cause non 0 return value but it was not treated as error. The difference is likely caused by the newer curl version. curl documentation seems unclear about the return value of auth related status codes. The fix is to check the specific http status code instead of curl's return value. Testing: - webserver tests passed on Rocky Linux 9.5 Change-Id: I354aa87a1b6283aa617f0298861bd5e79d03efc7 Reviewed-on: http://gerrit.cloudera.org:8080/22379 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-24 21:12:14 +00:00
Riza Suminto	3005092332	IMPALA-13668: Add default_test_protocol parameter to py.test ImpalaTestSuite.client is always initialized as beeswax client. And many tests use it directly rather than going through helper method such as execute_query(). This patch add add default_test_protocol parameter to conftest.py. It control whether to initialize ImpalaTestSuite.client equals to 'beeswax_client', 'hs2_client', or 'hs2_http_client'. This parameter is still default to 'beeswax'. This patch also adds helper method 'default_client_protocol_dimension', 'beeswax_client_protocol_dimension' and 'hs2_client_protocol_dimension' for convenience and traceability. Reduced occurrence where test method manually override ImpalaTestSuite.client. They are replaced by combination of ImpalaTestSuite.create_impala_clients and ImpalaTestSuite.close_impala_clients. Testing: - Pass core tests. Change-Id: I9165ea220b2c83ca36d6e68ef3b88b128310af23 Reviewed-on: http://gerrit.cloudera.org:8080/22336 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-24 12:19:02 +00:00
stiga-huang	2e59bbae37	IMPALA-12785: Add commands to control event-processor status This patch extends the existing AdminFnStmt to support operations on EventProcessor. E.g. to pause the EventProcessor: impala-shell> :event_processor('pause'); to resume the EventProcessor: impala-shell> :event_processor('start'); Or to resume the EventProcessor on a given event id (1000): impala-shell> :event_processor('start', 1000); Admin can also resume the EventProcessor at the latest event id by using -1: impala-shell> :event_processor('start', -1); Supported command actions in this patch: pause, start, status. The command output of all actions will show the latest status of EventProcessor, including - EventProcessor status: PAUSED / ACTIVE / ERROR / NEEDS_INVALIDATE / STOPPED / DISABLED. - LastSyncedEventId: The last HMS event id which we have synced to. - LatestEventId: The event id of the latest event in HMS. Example output: [localhost:21050] default> :event_processor('pause'); +--------------------------------------------------------------------------------+ \| summary \| +--------------------------------------------------------------------------------+ \| EventProcessor status: PAUSED. LastSyncedEventId: 34489. LatestEventId: 34489. \| +--------------------------------------------------------------------------------+ Fetched 1 row(s) in 0.01s If authorization is enabled, only admin users that have ALL privilege on SERVER can run this command. Note that there is a restriction in MetastoreEventsProcessor#start(long) that resuming EventProcessor back to a previous event id is only allowed when it's not in the ACTIVE state. This patch aims to expose the control of EventProcessor to the users so MetastoreEventsProcessor is not changed. We can investigate the restriction and see if we want to relax it. Note that resuming EventProcessor at a newer event id can be done on any states. Admins can use this to manually resolve the lag of HMS event processing, after they have made sure all (or important) tables are manually invalidated/refreshed. A new catalogd RPC, SetEventProcessorStatus, is added for coordinators to control the status of EventProcessor. Tests - Added e2e tests Change-Id: I5a19f67264cfe06a1819a22c0c4f0cf174c9b958 Reviewed-on: http://gerrit.cloudera.org:8080/22250 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-24 04:10:14 +00:00
Daniel Becker	c5b474d3f5	IMPALA-13594: Read Puffin stats also from older snapshots Before this change, Puffin stats were only read from the current snapshot. Now we also consider older snapshots, and for each column we choose the most recent available stats. Note that this means that the stats for different columns may come from different snapshots. In case there are both HMS and Puffin stats for a column, the more recent one will be used - for HMS stats we use the 'impala.lastComputeStatsTime' table property, and for Puffin stats we use the snapshot timestamp to determine which is more recent. This commit also renames the startup flag 'disable_reading_puffin_stats' to 'enable_reading_puffin_stats' and the table property 'impala.iceberg_disable_reading_puffin_stats' to 'impala.iceberg_read_puffin_stats' to make them more intuitive. The default values are flipped to keep the same behaviour as before. The documentation of Puffin reading is updated in docs/topics/impala_iceberg.xml Testing: - updated existing test cases and added new ones in test_iceberg_with_puffin.py - reorganised the tests in TestIcebergTableWithPuffinStats in test_iceberg_with_puffin.py: tests that modify table properties and other state that other tests rely on are now run separately to provide a clean environment for all tests. Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39 Reviewed-on: http://gerrit.cloudera.org:8080/22177 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-23 15:25:59 +00:00
Michael Smith	4a645105f9	IMPALA-13658: Enable tuple caching aggregates Enables tuple caching on aggregates directly above scan nodes. Caching aggregates requires that their children are also eligible for caching, so this excludes aggregates above an exchange, union, or hash join. Testing: - Adds Planner tests for different aggregate cases to confirm they have stable tuple cache keys and are valid for caching. - Adds custom cluster tests that cached aggregates are used, and can be re-used in slightly different statements. Change-Id: I9bd13c2813c90d23eb3a70f98068fdcdab97a885 Reviewed-on: http://gerrit.cloudera.org:8080/22322 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-22 23:26:21 +00:00
Xuebin Su	d7ee509e93	IMPALA-12648: Add KILL QUERY statement To support killing queries programatically, this patch adds a new type of SQL statements, called the KILL QUERY statement, to cancel and unregister a query on any coordinator in the cluster. A KILL QUERY statement looks like ``` KILL QUERY '123:456'; ``` where `123:456` is the query id of the query we want to kill. We follow syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY' are added as "unreserved keywords", like 'DEFAULT'. This allows the three keywords to be used as identifiers. A user is authorized to kill a query only if the user is an admin or is the owner of the query. KILL QUERY statements are not affected by admission control. Implementation: Since we don't know in advance which impalad is the coordinator of the query we want to kill, we need to broadcast the kill request to all the coordinators in the cluster. Upon receiving a kill request, each coordinator checks whether it is the coordinator of the query: - If yes, it cancels and unregisters the query, - If no, it reports "Invalid or unknown query handle". Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is created for this. For authorization, this patch adds a custom handler of AuthorizationException for each statement to allow the exception to be handled by the backend. This is because we don't know whether the user is the owner of the query until we reach its coordinator. To support cancelling child queries, this patch changes ChildQuery::Cancel() to bypass the HS2 layer so that the session of the child query will not be added to the connection used to execute the KILL QUERY statement. Testing: - A new ParserTest case is added to test using "unreserved keywords" as identifiers. - New E2E test cases are added for the KILL QUERY statement. - Added a new dimension in TestCancellation to use the KILL QUERY statement. - Added file tests/common/cluster_config.py and made CustomClusterTestSuite.with_args() composable so that common cluster configs can be reused in custom cluster tests. Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0 Reviewed-on: http://gerrit.cloudera.org:8080/21930 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2025-01-22 22:22:54 +00:00
Riza Suminto	c936f95af2	IMPALA-13678: Validate remote_submit_time against coordinator time Impala with External frontend hit a DCHECK in RuntimeProfile::EventSequence::Start(int64_t start_time_ns) because frontend report remote_submit_time that is more than 3ns ahead of Coordinator time. This can happen if there is a clock skew between Frontend node and Impala Coordinator node. This patch fix the issue by taking the minimum between given remote_submit_time vs Coordinator's MonotonicStopWatch::Now(). Change-Id: If6e04219c515fddff07bfbee43bb93babb3d307b Reviewed-on: http://gerrit.cloudera.org:8080/22360 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-22 04:45:35 +00:00
Sai Hemanth Gantasala	dc9066e7b2	IMPALA-13500: (Addendum) Do not assert CatalogD's partition id Amend verifying logs in impala to exclude partition id info. partition_id assigned by catalogD may not be in serial order, so it is best to avoid checking partition id in the logs. Change-Id: I27cdeb2a4bed8afa30a27d05c7399c78af5bcebb Reviewed-on: http://gerrit.cloudera.org:8080/22198 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-21 06:54:44 +00:00
Fang-Yu Rao	5371e0c6df	IMPALA-13666: Provide a non-null fileMetadataStats for HdfsPartition IMPALA-13154 added the method getFileMetadataStats() to HdfsPartition.java that would return the file metadata statistics. The method requires the corresponding HdfsPartition instance to have a non-null field of 'fileMetadataStats_'. This patch revises two existing constructors of HdfsPartition to provide a non-null value for 'fileMetadataStats'. This makes it easier for a third party extension to set up and update the field of 'fileMetadataStats_'. A third party extension has to update the field of 'fileMetadataStats_' if it would like to use this field to get the size of the partition since all three fields in 'fileMetadataStats_' are defaulted to 0. A new constructor was also added for HdfsPartition that allows a third party extension to provide their own FileMetadataStats when instantiating an HdfsPartition. To facilitate instantiating a FileMetadataStats, a new constructor was added for FileMetadataStats that takes in a List of FileDescriptor's to construct a FileMetadataStats. Change-Id: I7e690729fcaebb1e380cc61f2b746783c86dcbf7 Reviewed-on: http://gerrit.cloudera.org:8080/22340 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-20 23:37:28 +00:00
Peter Rozsa	d1a0c2ac14	IMPALA-13205: Do not include Iceberg position fields for MERGE statements with INSERT merge clauses This patch makes the Iceberg position field inclusion conditional by including them only if there are UPDATE or DELETE merge clauses that are listed in a MERGE statement or the target table has existing delete files. These fields can be omitted when there's no delete file creation at the sink of the MERGE statement and the table has no existing delete files. Additionally, this change disables MERGE for Iceberg target tables that contain equality delete files, see IMPALA-13674. Tests: - iceberg-merge-insert-only planner test added Change-Id: Ib62c78dab557625fa86988559b3732591755106f Reviewed-on: http://gerrit.cloudera.org:8080/21931 Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-20 17:04:59 +00:00
Riza Suminto	3118e41c26	IMPALA-2945: Account for duplicate keys on multiple nodes preAgg AggregationNode.computeStats() estimate cardinality under single node assumption. This can be an underestimation in preaggregation node case because same grouping key may exist in multiple nodes during preaggreation. This patch adjust the cardinality estimate using following model for the number of distinct values in a random sample of k rows, previously used to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644. Assumes we are picking k rows from an infinite sample with ndv distinct values, with the value uniformly distributed. The probability of a given value not appearing in a sample, in that case is ((NDV - 1) / NDV) ^ k This is because we are making k choices, and each of them has (ndv - 1) / ndv chance of not being our value. Therefore the probability of a given value appearing in the sample is: 1 - ((NDV - 1) / NDV) ^ k And the number of distinct values in the sample is: (1 - ((NDV - 1) / NDV) ^ k) * NDV Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to use the new estimation logic or not. Testing: - Pass core tests. Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55 Reviewed-on: http://gerrit.cloudera.org:8080/22047 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 20:22:03 +00:00
Riza Suminto	c298c54262	IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting getPerInstanceNdvForCpuCosting is a method to estimate the number of distinct values of exprs per fragment instance when accounting for the likelihood of duplicate keys across fragment instances. It borrows the probabilistic model described in IMPALA-2945. This method is exclusively used by AggregationNode only. getPerInstanceNdvForCpuCosting run the probabilistic formula individually for each grouping expression and then multiply it together. That match with how we estimate group NDV in the past where we simply do NDV multiplication of each grouping expression. Recently, we adds tuple-based analysis to lower cardinality estimate for all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086). All of the bounding happens in AggregationNode.computeStats(), where we call estimateNumGroups() function that returns globalNdv estimate for specific aggregation class. To take advantage from that more precise globalNdv, this patch replace getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that apply the probabilistic formula over this single globalNdv number rather than the old way where it often return an overestimated number from NDV multiplication method. Its use is still limited only to calculate ProcessingCost. Using it for preagg output cardinality will be done by IMPALA-2945. estimatePreaggCardinality is skipped if data partition of input is a subset of grouping expression. Testing: - Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True. ProcessingCost changes, but all cardinality number stays. - Add CardinalityTest#testEstimatePreaggCardinality. - Update test_executor_groups.py. Enable v2 profile as well for easier runtime profile debugging. Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1 Reviewed-on: http://gerrit.cloudera.org:8080/22320 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 10:05:09 +00:00
Riza Suminto	8c2017aa00	IMPALA-12937: (part 2) Deflake TestAdmissionControllerStress TestAdmissionControllerStress::test_mem_limit is flaky again. One fragment instance that expected to stay alive until query submission loop ends actually finished early, even though clients are only fetching 1 rows every 0.5 second. This patch attempts to address the flakiness in two ways. First, is lowering batch_size to 10. Lower batch size is expected to keep all running fragment instances runnning until the query admission loop finishes. Second, is lowering num_queries from 50 to 40 if exploration_strategy is exhaustive. This will shorten the query submission loop, expecially when submission_delay_ms is high (150 seconds). This is OK because, based on the assertions, the test framework will only retain at most 15 active queries and 10 in-queue queries once the query submission loop ends. This patch also refactors SubmitQueryThread. Set long_polling_time_ms=100 for all queries to get faster initial response. The lock is removed and replaced with threading.Event to signal the end of test. The thread client and query_handle scope is made local within run() method for proper cleanup. Set timeout for wait_for_admission_control instead of waiting indefinitely. impala_connection.py is refactored so that BeeswaxConnection has matching logging functionality as ImpylaHS2Connection. Changed ImpylaHS2Connection._collect_profile_and_log initialization for possibillity that experimental Calcite planner may have ability to pull query profile and log from Impala backend. Testing: - Run and pass test_mem_limit in both TestAdmissionControllerStress and TestAdmissionControllerStressWithACService in exhaustive exploration 10 times. - Run and pass the whole TestAdmissionControllerStress and TestAdmissionControllerStressWithACService in exhaustive exploration. Change-Id: I706e3dedce69e38103a524c64306f39eac82fac3 Reviewed-on: http://gerrit.cloudera.org:8080/22351 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 09:21:46 +00:00
Riza Suminto	8675dbfe66	IMPALA-13665: Parallelize TestDecimalFuzz TestDecimalFuzz.test_decimal_ops and TestDecimalFuzz.test_width_bucket each execute_scalar query 10000 times. This patch speed them up by breaking each into 10 parallel tests run where each run execute_scalar query 1000 times. This patch also make execute_scalar and execute_scalar_expect_success to run query with long_polling_time_ms=100 if there is no query_options specified. Adds assertion in execute_scalar_expect_success that result is indeed only a single row. Slightly change exists_func to avoid unused argument warning. Testing: - From tests/ run and pass the following command ./run-tests.py query_test/test_decimal_fuzz.py Change-Id: Ic12b51b50739deff7792e2640764bd75e8b8922d Reviewed-on: http://gerrit.cloudera.org:8080/22328 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 01:01:10 +00:00
Yida Wu	702131b677	IMPALA-13565: Add general AI platform support to ai_generate_text Currently only OpenAI sites are allowed for ai_generate_text(), this patch adds support for general AI platforms to the ai_generate_text function. It introduces a new flag, ai_additional_platforms, allowing Impala to access additional AI platforms. For these general AI platforms, only the openai standard is supported, and the default api credential serves as the api token for general platforms. The ai_api_key_jceks_secret parameter has been renamed to auth_credential to support passing both plain text and jceks encrypted secrets. A new impala_options parameter is added to ai_generate_text() to enable future extensions. Adds the api_standard option to impala_options, with "openai" as the only supported standard. Adds the credential_type option to impala_options for allowing the plain text as the token, by default it is set to jceks. Adds the payload option to impala_options for customized payload input. If set, the request will use the provided customized payload directly, and the response will follow the openai standard for parsing. The customized payload size must not exceed 5MB. Adding the impala_options parameter to ai_generate_text() should be fine for backward compatibility, as this is a relatively new feature. Example: 1. Add the site to ai_api_additional_platforms,like: ai_additional_platforms='new_ai.site,new_ai.com' 2. Example sql: select ai_generate_text("https://new_ai.com/v1/chat/completions", "hello", "model-name", "ai-api-token", "platform params", '{"api_standard":"openai", "credential_type":"plain", "payload":"payload content"}}') Tests: Added a new test AiFunctionsTestAdditionalSites. Manual tested the example with the Cloudera AI platform. Passed core and asan tests. Change-Id: I4ea2e1946089f262dda7ace73d5f7e37a5c98b14 Reviewed-on: http://gerrit.cloudera.org:8080/22130 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com>	2025-01-16 19:12:07 +00:00
Joe McDonnell	71818c673b	IMPALA-13253: Add option to enable keepalive for client connections Client connections can drop without an explicit close. This can happen if client machine resets or there is a network disruption. Some load balancers have an idle timeout that result in the connection becoming invalid without an explicit teardown. With short idle timeouts (e.g. AWS LB has a timeout of 350 seconds), this can impact many connections. This adds startup options to enable / tune TCP keepalive settings for client connections: client_keepalive_probe_period_s - idle time before doing keepalive probes If set to > 0, keepalive is enabled. client_keepalive_retry_period_s - time between keepalive probes client_keepalive_retry_count - number of keepalive probes These startup options mirror the startup options for Kudu's equivalent functionality. Thrift has preexisting support for turning on keepalive, but that support uses the OS defaults for keepalive settings. To add the ability to tune the keepalive settings, this implements a wrapper around the Thrift socket (both TLS and non-TLS) and manually sets the keepalive options on the socket (mirroring code from Kudu's Socket::SetTcpKeepAlive). This does not enable keepalive by default to make it easy to backport. A separate patch will turn keepalive on by default. Testing: - Added a custom cluster test that connects with impala-shell and verifies that the socket has the keepalive timer. Verified that it works on Ubuntu 20, Centos 7, and Redhat 8. - Used iptables to manually test cases where the client is unreachable and verified that the server detects that and closes the connection. Change-Id: I9e50f263006c456bc0797b8306aa4065e9713450 Reviewed-on: http://gerrit.cloudera.org:8080/22254 Reviewed-by: Yida Wu <wydbaggio000@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-16 16:45:27 +00:00
Csaba Ringhofer	99545fbf45	IMPALA-13592: Set IV length before setting IV in OpenSsl Setting IV with non default length before setting the length is not correct. With newer OpenSsl (3.2) this lead to failing AES-GCM encryption (likely since https://github.com/openssl/openssl/pull/22590). The fix is to call EVP_(En/De)cryptInit_ex first without iv, then set iv length and call EVP_EncryptInit_ex again with iv (but without mode). Change-Id: I243f1d487d8ba5dc44b5cc361e041c83598d83c1 Reviewed-on: http://gerrit.cloudera.org:8080/22337 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2025-01-16 15:21:35 +00:00
Noemi Pap-Takacs	55d7498b24	IMPALA-13656: MERGE redundantly accumulates memory in HDFS WRITER When IcebergMergeImpl created the table sink it didn't set 'inputIsClustered' to true. Therefore HdfsTableSink expected random input and kept the output writers open for every partition, which resulted in high memory consumption and potentially a Memory Limit Exceeded error when the number of partitions are high. Since we actually sort the rows before the sink we can set 'inputIsClustered' to true, which means HdfsTableSink can write files one by one, because whenever it gets a row that belongs to a new partition it knows that it can close the current output writer, and open a new one. Testing: - e2e regression test Change-Id: I7bad0310e96eb482af9d09ba0d41e44c07bf8e4d Reviewed-on: http://gerrit.cloudera.org:8080/22332 Reviewed-by: Peter Rozsa <prozsa@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 18:20:00 +00:00
stiga-huang	fdc4346635	IMPALA-13634: ImpalaTestSuite.cleanup_db() should use its own client ImpalaTestSuite.cleanup_db(cls, db_name, sync_ddl=1) is used to drop the entire db. Currently it uses cls.client and sets sync_ddl based on the parameter. This clears all the query options of cls.client and makes it always run with the same sync_ddl value unless the test code explicitly set query options again. This patch changes cleanup_db() to use a dedicated client. Tested with some tests that uses this method and see performance improvement: TestName Before After metadata/test_explain.py::TestExplainEmptyPartition 52s 9s query_test/test_insert_behaviour.py::TestInsertBehaviour::test_insert_select_with_empty_resultset 62s 15s metadata/test_metadata_query_statements.py::TestMetadataQueryStatements::test_describe_db (exhaustive) 160s 25s Change-Id: Icb01665bc18d24e2fce4383df87c4607cf4562f1 Reviewed-on: http://gerrit.cloudera.org:8080/22286 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 06:06:41 +00:00
gaurav1086	c3cbd79b56	IMPALA-13288: OAuth AuthN Support for Impala This patch added OAuth support with following functionality: * Load and parse OAuth JWKS from configured JSON file or url. * Read the OAuth Access token from the HTTP Header which is the same format as JWT Authorization Bearer token. * Verify the OAuth's signature with public key in JWKS. * Get the username out of the payload of OAuth Access token. * If kerberos or ldap is enabled, then both jwt and oauth are supported together. Else only one of jwt or oauth is supported. This has been a pre existing flow for jwt. So OAuth will follow the same policy. * Impala Shell side changes: OAuth options -a and --oauth_cmd Testing: - Added 3 custom cluster be test in test_shell_jwt_auth.py: - test_oauth_auth_valid: authenticate with valid token. - test_oauth_auth_expired: authentication failure with expired token. - test_oauth_auth_invalid_jwk: authentication failure with valid signature but expired. - Added 1 custom cluster fe test in JwtWebserverTest.java - testWebserverOAuthAuth: Basic tests for OAuth - Added 1 custom cluster fe test in LdapHS2Test.java - testHiveserver2JwtAndOAuthAuth: tests all combinations of jwt and oauth token verification with separate jwks keys. - Manually tested with a valid, invalid and expired oauth access token. - Passed core run. Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6 Reviewed-on: http://gerrit.cloudera.org:8080/21728 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 03:32:57 +00:00
Peter Rozsa	ff2f2ba77e	IMPALA-13324: Enable statement rewrite for merge queries for IcebergMergeImpl This change enables MERGE statements with source expressions containing subqueries that require rewrite. The change adds implementation for reset methods for each merge case, and properly handles resets for MergeStmt and IcebergMergeImpl. Tests: - Planner test added with a merge query that requires a rewrite - Analyzer test modified Change-Id: I26e5661274aade3f74a386802c0ed20e5cb068b5 Reviewed-on: http://gerrit.cloudera.org:8080/22039 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-14 14:23:42 +00:00
Laszlo Gaal	5f4321373a	IMPALA-13662: Bump the ARM toolchain to support ARM builds for RHEL 9 Pick up a new binary build of the current toolchain version for ARM. The toolchain version is identical, the only difference is that the new build added binaries for Rocky/RHEL 9 to the already supported OS versions, reaching the same level of Impala build support as Rocky/RHEL 8. Tested by building Impala for RHEL9 for Intel and ARM both on private infrastructure. Change-Id: I5fd2e8c3187cb7829de55d6739cf5d68a09a2ed3 Reviewed-on: http://gerrit.cloudera.org:8080/22323 Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 22:49:31 +00:00
Riza Suminto	12a2a04bc8	IMPALA-13664: Lower datanucleus.connectionPool.maxPoolSize to 20 IMPALA-13620 increase datanucleus.connectionPool.maxPoolSize of HMS from 10 to 30. When running all tests in single node, this seem to exhaust all 100 of postgresql max_connection and interfere with authorization/test_ranger.py and query_test/test_ext_data_sources.py. This patch lower datanucleus.connectionPool.maxPoolSize to 20. Testing: - Pass exhaustive tests in single node. Change-Id: I98eb27cbd141d5458a26d05d1decdbc7f918abd4 Reviewed-on: http://gerrit.cloudera.org:8080/22326 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 21:14:49 +00:00
Noemi Pap-Takacs	fa570e8ea7	IMPALA-13655: UPDATE redundantly accumulates memory in HDFS WRITER When IcebergUpdateImpl created the table sink it didn't set 'inputIsClustered' to true. Therefore HdfsTableSink expected random input and kept the output writers open for every partition, which resulted in high memory consumption and potentially an OOM error when the number of partitions are high. Since we actually sort the rows before the sink we can set 'inputIsClustered' to true, which means HdfsTableSink can write files one by one, because whenever it gets a row that belongs to a new partition it knows that it can close the current output writer, and open a new one. Testing: - e2e regression test Change-Id: I9bad335cc946364fc612e8aaf90858eaabd7c4af Reviewed-on: http://gerrit.cloudera.org:8080/22325 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 16:02:53 +00:00
Sai Hemanth Gantasala	1f7b9601e5	IMPALA-13403: Refactor the checks of skip reloading file metadata for ALTER_TABLE events IMPALA-12487 adds an optimization that if an ALTER_TABLE event has trivial changes in StorageDescriptor (e.g. removing optional field 'storedAsSubDirectories'=false which defaults to false), file metadata reload will be skipped, no matter what changes are in the table properties. This is problematic since some HMS clients (e.g. Spark) could modify both the table properties and StorageDescriptor. If there is a non-trivial changes in table properties (e.g. 'location' change), we shouldn't skip reloading file metadata. Testing: - Added a unit test to verify the same Change-Id: Ia969dd32385ac5a1a9a65890a5ccc8cd257f4b97 Reviewed-on: http://gerrit.cloudera.org:8080/21971 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 15:21:27 +00:00
Sai Hemanth Gantasala	46115e9a8e	IMPALA-13126: Obtain table read lock in EP to process partitioned event For a partition-level event, isOlderEvent() in catalogD needs to check whether the corresponding partition is reloaded after the event. This should be done after holding the table read lock. Otherwise, EventProcessor could hit ConcurrentModificationException error when there are concurrent DDLs/DMLs modifying the partition list. note: Created IMPALA-13650 for a cleaner solution to clear the inflight events list for partitioned table events. Testing: - Added a end-to-end stress test to verify the above scenario Change-Id: I26933f98556736f66df986f9440ebb64be395bc1 Reviewed-on: http://gerrit.cloudera.org:8080/21663 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 07:29:43 +00:00
m-sanjana19	a45a7a3745	IMPALA-13339: [DOCS] Documentation for COPY TESTCASE statements Documents the COPY TESTCASE statements used to extract and share query metadata for debugging. Change-Id: I4d3c96c5b0ca0723ea02a8b3fb72abcd31ef52fa Reviewed-on: http://gerrit.cloudera.org:8080/22284 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 07:13:34 +00:00
Laszlo Gaal	2a03499c61	IMPALA-13458: Fix installing curl on Red Hat variants for dockerised tests Red Hat 8 and 9 as well as their variants (e.g. Rocky Linux) preinstall the curl-minimal package as a prerequisite for their package manager. Unfortunately this conflicts with the installation of the full-blown curl package when the Impala daemon Docker images are built during a dockerised test run. The failure is caused by the two packages having slightly different version numbers. Fix this the same way as in bootstrap_system.sh: add the --allowerasing flag to the yum command line to let yum/DNF substitute the full curl version for the preinstalled curl-minimal package. Tested by executing dockerised tests on Rocky Linux 9.2 Change-Id: I30fa0f13a77ef2a939a1b754014a78c171443c71 Reviewed-on: http://gerrit.cloudera.org:8080/21944 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 06:36:45 +00:00
Noemi Pap-Takacs	c518d3c818	IMPALA-13501: Clean up uncommitted Iceberg files after validation check failure Iceberg supports multiple writers with optimistic concurrency. Each writer can write new files which are then added to the table after a validation check to ensure that the commit does not conflict with other modifications made during the execution. When there was a conflicting change which could not be resolved, it means that the newly written files cannot be committed to the table, so they used to become orphan files on the file system. Orphan files can accumulate over time, taking up a lot of storage space. They do not belong to the table because they are not referenced by any snapshot and therefore they can't be removed by expiring snapshots. This change introduces automatic cleanup of uncommitted files after an unsuccessful DML operation to prevent creating orphan files. No cleanup is done if Iceberg throws CommitStateUnknownException because the update success or failure is unknown in this case. Testing: - E2E test: Injected ValidationException with debug option. - stress test: Added a method to check that no orphan files were created after failed conflicting commits. Change-Id: Ibe59546ebf3c639b75b53dfa1daba37cef50eb21 Reviewed-on: http://gerrit.cloudera.org:8080/22189 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-09 18:52:40 +00:00
Michael Smith	740ee28eb1	IMPALA-13618: Move to commons-lang3 Updates from commons-lang (2.6) to commons-lang3. Switches getFullStackTrace to getStackTrace. getFullStackTrace is not present in lang3, and https://issues.apache.org/jira/browse/LANG-904 suggests that getFullStackTrace existed for handling chained exceptions in older Java runtimes. Change-Id: Ie16af2692858f6a571cc1e5b85ecba3806da8d7e Reviewed-on: http://gerrit.cloudera.org:8080/22228 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-09 07:36:39 +00:00
Riza Suminto	134de01a59	IMPALA-13642: Fix unused test vector in test_scanners.py Several test vectors were ignored in test_scanners.py. This cause repetition of the same test without actually varying the test exec_option nor debug_action. This patch fix it by: - Use execute_query() instead of client.execute() - Passing vector.get_value('exec_option') when executing test query. Repurpose ImpalaTestMatrix.embed_independent_exec_options to deepcopy 'exec_option' dimension during vector generation. Therefore, each test execution will have unique copy of 'exec_option' for them self. This patch also adds flake8-unused-arguments plugin into critique-gerrit-review.py and py3-requirements.txt so we can catch this issue during code review. impala-flake8 is also updated to use impala-python3-common.sh. Adds flake8==3.9.2 in py3-requirements.txt, which is the highest version that has compatible dependencies with pylint==2.10.2. Drop unused 'dryrun' parameter in get_catalog_compatibility_comments method of critique-gerrit-review.py. Testing: - Run impala-flake8 against test_scanners.py and confirm there is no more unused variable. - Run and pass test_scanners.py in core exploration. Change-Id: I3b78736327c71323d10bcd432e162400b7ed1d9d Reviewed-on: http://gerrit.cloudera.org:8080/22301 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-09 06:17:51 +00:00
Peter Rozsa	d23ba87d46	IMPALA-13361: Add INSERT * and UPDATE SET * syntax for MERGE statement This change adds INSERT * and UPDATE SET * language elements for WHEN NOT MATCHED and WHEN MATCHED clauses. INSERT * enumerates all source expressions from source table/subquery and analyzes the clause similarly to the regular WHEN NOT MATCHED THEN INSERT case. UPDATE SET * creates assignments for each target table column by enumerating the table columns and assigning source expressions by index. If the target column count and the source expression count mismatches or the types mismatches both clauses report analysis errors. Tests: - parser tests added - analyzer tests added - E2E tests added Change-Id: I31cb771f2355ba4acb0f3b9f570ec44fdececdf3 Reviewed-on: http://gerrit.cloudera.org:8080/22051 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-08 17:03:57 +00:00
Riza Suminto	01b8b45252	IMPALA-13620: Refresh compute_table_stats.py script This patch refreshes compute_table_stats.py script with the following changes: - Limit parallelism to IMPALA_BUILD_THREADS at maximum if --parallelism argument is not set. - Change its default connection to hs2, leveraging existing ImpylaHS2Connection. - Change OptionParser to ArgumentParser. - Use impala-python3 to run the script. - Add --exclude_table_names to skip running COMPUTE STATS on certain tables/views. - continue_on_error is False by default. This patch also improves query handle logging in ImpylaHS2Connection. collect_profile_and_log argument is added to control whether to pull logs and runtime profile at the end of __fetch_results(). The default behavior remains unchanged. Skip COMPUTE STATS for functional_kudu.alltypesagg and functional_kudu.manynulls because it is invalid to run COMPUTE STATS over view. Customized hive-site.xml to set datanucleus.connectionPool.maxPoolSize to 30 and hikaricp.connectionTimeout to 60000 ms. Also set hive.log.dir to ${IMPALA_CLUSTER_LOGS_DIR}/hive. Testing: Repeatedly run compute-table-stats.sh from cold state and confirm there is no error occurs. This is the script to do so from active minicluster: cd $IMPALA_HOME ./bin/start-impala-cluster.py --kill ./testdata/bin/kill-hive-server.sh ./testdata/bin/run-hive-server.sh ./bin/start-impala-cluster.py ./testdata/bin/compute-table-stats.sh > /tmp/compute-stats.txt 2>&1 grep error /tmp/compute-stats.txt Core tests ran and passed. Change-Id: I1ebf02f95b957e7dda3a30622b87e8fca3197699 Reviewed-on: http://gerrit.cloudera.org:8080/22231 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-08 07:49:31 +00:00
Riza Suminto	617e99981e	IMPALA-13086: Lower AggregationNode estimate using stats predicate NDV of a grouping column can be reduced if there is a predicate over that column. If the predicate is a constant equality predicate or is-null predicate, then the NDV must be equal to 1. If the predicate is a simple in-list predicate, the NDV must be the number of items in the list. This patch adds such consideration by leveraging existing analysis in HdfsScanNode.computeStatsTupleAndConjuncts(). It memorizes the first ScanNode/UnionNode that produces a TupleId in Analyzer, registered during Init()/computeStats() of the PlanNode. At AggregationNode, it looks up the PlanNode that produces a TupleId. If the origin PlanNode is an HdfsScanNode, analyze if any grouping expression is listed in statsOriginalConjuncts_ and reduce them accordingly. If HdfsScanNode.computeStatsTupleAndConjuncts() can be made generic for all ScanNode implementations in the future, we can apply this same analysis to all kinds of ScanNode and achieve the same reduction. In terms of tracking producer PlanNode, this patch made an exception for Iceberg PlanNodes that handle positional or equality deletion. In that scenario, it is possible to have two ScanNodes sharing the same TupleId to force UnionNode passthrough. Therefore, the UnionNode will be acknowledged as the first producer of that TupleId. This patch also remove some redundant operation in HdfsScanNode. Fixed typo in method name MathUtil.saturatingMultiplyCardinalities(). Testing: - Add new test cases in aggregation.test - Pass core tests. Change-Id: Ia840d68f1c4f126d4e928461ec5c44545dbf25f8 Reviewed-on: http://gerrit.cloudera.org:8080/22032 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 23:13:12 +00:00
stiga-huang	7bdd7e1010	IMPALA-13518: Show target name of COMMIT_TXN events in logs The message of a COMMIT_TXN event just contains the transaction id (txnid). In the logs of top-10 expensive events and top-10 targets that contribute to the lag, we show the target as CLUSTER_WIDE. However, when processing the events, catalogd actually finds the involved tables and reloads them. It'd be helpful to show the names of the tables involved in the transaction. This patch overrides the getTargetName() method in CommitTxnEvent to show the table names. They are collected after the event is processed. Tests: - Add tests in MetastoreEventsProcessorTest Change-Id: I4a7cb5e716453290866a4c3e74c0d269f621144f Reviewed-on: http://gerrit.cloudera.org:8080/22036 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Reviewed-by: Sai Hemanth Gantasala <saihemanth@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2025-01-07 17:37:18 +00:00
Riza Suminto	e82cf3b3a1	IMPALA-13643: Add scan_multithread_constraint in test_scanners.py Debug action HDFS_SCANNER_THREAD_CHECK_SOFT_MEM_LIMIT exist inside HdfsScanNode (hdfs-scan-node.cc) code path. MT_DOP > 0 executes using HdfsScanNodeMt (hdfs-scan-node-mt.cc) rather than HdfsScanNode, and always start single scanner thread per ScanNode. Thus, there is no need to exercise HDFS_SCANNER_THREAD_CHECK_SOFT_MEM_LIMIT and MT_DOP > 0 combination. This patch adds scan_multithread_constraint in test_scanners.py where 'mt_dop' exec option dimension is declared. This reduce core test vector combination from 1254 to 1138 and exhaustive test vector combination from 7530 to 6774. Testing: - Run and pass test_scanners.py in core exploration. Change-Id: I77c2e6f9bbd4bc1825fa1f006a22ee1a6ea5a606 Reviewed-on: http://gerrit.cloudera.org:8080/22300 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Jason Fehr <jfehr@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 04:38:26 +00:00
stiga-huang	777ae104bb	IMPALA-13305: Better thrift compatibility checks based on pyparsing There are some false positive warnings reported by critique-gerrit-review.py when adding a new thrift struct that has required fields. This patch leverages pyparsing to analyze the thrift file changes. So we can identify whether the new required field is added in an existing struct. thrift_parser.py adds a simple thrift grammar parser to parse a thrift file into an AST. It basically consists of pyparsing.ParseResults and some customized classes to inject the line number, i.e. thrift_parser.ThriftField and thrift_parser.ThriftEnumItem. Import thrift_parser to parse the current version of a thrift file and the old version of it before the commit. critique-gerrit-review.py then compares the structs and enums to report these warnings: - A required field is deleted in an existing struct. - A new required field is added in an existing struct. - An existing field is renamed. - The qualifier (required/optional) of a field is changed. - The type of a field is changed. - An enum item is removed. - Enum items are reordered. Only thrift files used in both catalogd and impalad are checked. This is the same as the current version. We can further improve this by analyzing all RPCs used between impalad and catalogd to get all thrift struct/enums used in them. Warning examples for commit `e48af8c04`: "common/thrift/StatestoreService.thrift": [ { "message": "Renaming field 'sequence' to 'catalogd_version' in TUpdateCatalogdRequest might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 345, "side": "REVISION" } ] Warning examples for commit `595212b4e`: "common/thrift/CatalogObjects.thrift": [ { "message": "Adding a required field 'type' in TIcebergPartitionField might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 612, "side": "REVISION" } ] Warning examples for commit `c57921225`: "common/thrift/CatalogObjects.thrift": [ { "message": "Renaming field 'partition_id' to 'spec_id' in TIcebergPartitionSpec might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 606, "side": "REVISION" } ], "common/thrift/CatalogService.thrift": [ { "message": "Changing field 'iceberg_data_files_fb' from required to optional in TIcebergOperationParam might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 215, "side": "REVISION" }, { "message": "Adding a required field 'operation' in TIcebergOperationParam might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 209, "side": "REVISION" } ], "common/thrift/Query.thrift": [ { "message": "Renaming field 'spec_id' to 'iceberg_params' in TFinalizeParams might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 876, "side": "REVISION" } ] Warning example for commit `2b2cf8d96`: "common/thrift/CatalogService.thrift": [ { "message": "Enum item FUNCTION_NOT_FOUND=3 changed to TABLE_NOT_LOADED=3 in CatalogLookupStatus. This might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 381, "side": "REVISION" } ] Warning example for commit `c01efd096`: "common/thrift/JniCatalog.thrift": [ { "message": "Removing the enum item TAlterTableType.SET_OWNER=15 might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 107, "side": "PARENT" } ] Warning example for commit `374783c55`: "common/thrift/Query.thrift": [ { "message": "Changing type of field 'enabled_runtime_filter_types' from PlanNodes.TEnabledRuntimeFilterTypes to set<PlanNodes.TRuntimeFilterType> in TQueryOptions might break the compatibility between impalad and catalogd/statestore during upgrade", "line": 449, "side": "REVISION" } Tests - Add tests in tests/infra/test_thrift_parser.py - Verified the script with all(1260) commits of common/thrift. Change-Id: Ia1dc4112404d0e7c5df94ee9f59a4fe2084b360d Reviewed-on: http://gerrit.cloudera.org:8080/22264 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 02:00:17 +00:00
Riza Suminto	ce6be49c08	IMPALA-13465: Trace TupleId further to reduce Agg cardinality IMPALA-13405 does tuple analysis to lower AggregationNode cardinality. It begins by focusing on the simple column SlotRef, but we can improve this further by tracing the origin TupleId across views and intermediate aggregation tuples. This patch implements deeper TupleId tracing to achieve further cardinality reduction. With this deeper TupleId resolution, it is possible now to narrow down the TupleId search across children ScanNodes and UnionNodes only. Note that this optimization is still limited to run ONLY IF there are at least two grouping expressions that refer to the same TupleId. There is a benefit to run the same optimization even though there is only a single expression per TupleId, but we defer that work until we can provide faster TupleId to PlanNode mapping without repeating the plan tree traversal. This patch also makes tuple-based reduction more conservative by capping at input cardinality/limit, or using output cardinality if the producer node is a UnionNode or has hard estimates. aggInputCardinality is still indirectly influenced by predicates and limits of children's nodes. The following PlannerTest (under testdata/workloads/functional-planner/queries/PlannerTest/) revert their cardinality estimation to their state pior to IMPALA-13405: tpcds/tpcds-q19.test tpcds/tpcds-q55.test tpcds_cpu_cost/tpcds-q03.test tpcds_cpu_cost/tpcds-q31.test tpcds_cpu_cost/tpcds-q47.test tpcds_cpu_cost/tpcds-q52.test tpcds_cpu_cost/tpcds-q57.test tpcds_cpu_cost/tpcds-q89.test Several other planner tests have increased cardinality after this change, but the numbers are still below pre-IMPALA-13405. Removed nested-view planner test in agg-node-max-mem-estimate.test that first added by IMPALA-13405. That same test has been duplicated by IMPALA-13480 at aggregation.test. Testing: - Pass core tests. Change-Id: I11f59ccc469c24c1800abaad3774c56190306944 Reviewed-on: http://gerrit.cloudera.org:8080/21955 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 00:43:22 +00:00
Andrew Sherman	21ef3e6ffe	IMPALA-13638: Translate apostrophe to underscore in Prometheus metric names. Impala has some metrics that reflect the state of the JVM. Some of these metrics have names that are partly composed of the names of the MemoryPoolMXBean objects in the Java virtual machine. In Jdk 8 these are names like "Code Cache" and "PS Eden Space". In Jdk 11 these names include apostrophe characters, for example "CodeHeap 'profiled nmethods'". The derived metric names work OK for Impala in both the webui and in json output. However the apostrophe character is illegal in Prometheus metric names per https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels and these metrics cannot be consumed by Prometheus. Fix this by adding the apostrophe to the list of characters that are mapped to underscores when we translate the metric names for Prometheus metrics. TESTING: Extended the test_prometheus_metrics test to parse all generated Prometheus metrics. Ran the test with Jdk 11 where it failed without the server fix Change-Id: I557b123c075dff0b14ac527de08bc6177bd2a3f6 IMPALA-13596: first cut at tidied code Reviewed-on: http://gerrit.cloudera.org:8080/22295 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 23:38:49 +00:00
Riza Suminto	1907153ab5	IMPALA-13641: Lazily init Parquet column read counters ParquetUncompressedBytesReadPerColumn and ParquetCompressedBytesReadPerColumn exist in runtime profile even when no parquet file is read (all scan text files). This patch lazily init those counters only if HdfsScanNodeBase::bytes_read_per_col_ is not empty. Testing: - Run and pass TestParquet::test_bytes_read_per_column. - Run TestTpcdsInsert and confirm no Parquet specific counters exist when reading TEXTFILE table. Change-Id: I8ba767b69b8c432f0eb954aa54f86876b329160c Reviewed-on: http://gerrit.cloudera.org:8080/22297 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 22:15:12 +00:00
Riza Suminto	f1de0c392f	IMPALA-13636: Fix target file_format in TestTpcdsInsert TestTpcdsInsert creates a temporary table to test insert functionality. It has three problems: 1. It does not use unique_database parameter, so the temporary table is not cleaned up after test finished. 2. It ignores file_format from test vector, causing inconsistency in the temporary table's file format. str_insert is always in PARQUET format, while store_sales_insert is always in TEXTFILE format. 3. text file_format dimension is never exercised, because --workload_exploration_strategy in run-all-tests.sh does not explicitly list tpcds-insert workload. This patch fixes all three problems and few flake8 warnings in test_tpcds_queries.py. Testing: - Run bin/run-all-tests.sh with EXPLORATION_STRATEGY=exhaustive EE_TEST=true EE_TEST_FILES="query_test/test_tpcds_queries.py::TestTpcdsInsert" Verified that the temporary table format follows file_format dimension. Change-Id: Iea621ec1d6a53eba9558b0daa3a4cc97fbcc67ae Reviewed-on: http://gerrit.cloudera.org:8080/22291 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-01-06 22:15:12 +00:00
Riza Suminto	23edbde7c7	IMPALA-13637: Add ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE option IMPALA-13405 adds a new tuple-analysis algorithm in AggregationNode to lower cardinality estimation when planning multi-column grouping. This patch adds a query option ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE that allows users to enable/disable the algorithm if necessary. Default is True. Testing: - Add testAggregationNoTupleAnalysis. This test is based on TpcdsPlannerTest#testQ19 but with ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE set to false. Change-Id: Iabd8daa3d9414fc33d232643014042dc20530514 Reviewed-on: http://gerrit.cloudera.org:8080/22294 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 21:32:45 +00:00

1 2 3 4 5 ...

11772 Commits