impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Gabor Kaszab	9c5ed44d51	IMPALA-13739: Part1: Move FileDescriptor, FileBlock and BlockReplica Move the above classes out from HdfsPartition onto their own classes. Change-Id: I2eb713619d4d231cec65a58255c3ced7b12d1880 Reviewed-on: http://gerrit.cloudera.org:8080/22441 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2025-02-06 07:22:56 +00:00
stiga-huang	59616d7bd1	IMPALA-13722: Ranger request should have no null values in the resources map Each Ranger privilege grant/revoke request has a resources map to describe the target resource, e.g. db, table, url, etc. The keys of the resources map are predefined strings like "database", "table", "column", "url", etc. The values are the actual resource name or "*" as a wildcard for all such kinds of resources. Our FE tests unintentionally set null values on "url" when the resources map doesn't have such a key. See AuthorizationTestBase#updateUri(). Using null as the value is an undefined behavior and it causes NullPointerException in newer versions of Ranger when granting/revoking privileges (might due to RANGER-4638 that adds a code to split the string value). This patch fixes AuthorizationTestBase#updateUri() to only update the URL when it's not null and starts with "/". Tests: - Verified the fix in a downstream build with a newer Ranger version that hits the issue. Change-Id: Ibed929bcd25ffdf54fa5f0fc848a0cc13c1fb0a2 Reviewed-on: http://gerrit.cloudera.org:8080/22435 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-03 16:05:52 +00:00
stiga-huang	fb45c786e9	IMPALA-13691: Partition values from HMS events don't need URL decoding Hive uses URL encoding to format the partition strings when creating the partition folders, e.g. "00:00:00" will be encoded into "00%3A00%3A00". When you create a partition of string type partition column "p" and using "00:00:00" as the partition value, the underlying partition folder is "p=00%3A00%3A00". When parsing the partition folders, Impala will URL-decode the partition folder names to get the correct partition values. This is correct in ALTER TABLE RECOVER PARTITIONS command that gets the partition strings from the file paths. However, for partition strings come from HMS events, Impala shouldn't URL-decode them since they are not URL encoded and are the original partition values. This causes HMS events on partitions that have percent signs in the value strings being matched to wrong partitions. This patch fixes the issue by only URL-decoding the partition strings that come from file paths. Tests: - Ran tests/metadata/test_recover_partitions.py - Added custom-cluster test. Change-Id: I7ba7fbbed47d39b02fa0b1b86d27dcda5468e344 Reviewed-on: http://gerrit.cloudera.org:8080/22388 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-27 22:52:43 +00:00
stiga-huang	2e59bbae37	IMPALA-12785: Add commands to control event-processor status This patch extends the existing AdminFnStmt to support operations on EventProcessor. E.g. to pause the EventProcessor: impala-shell> :event_processor('pause'); to resume the EventProcessor: impala-shell> :event_processor('start'); Or to resume the EventProcessor on a given event id (1000): impala-shell> :event_processor('start', 1000); Admin can also resume the EventProcessor at the latest event id by using -1: impala-shell> :event_processor('start', -1); Supported command actions in this patch: pause, start, status. The command output of all actions will show the latest status of EventProcessor, including - EventProcessor status: PAUSED / ACTIVE / ERROR / NEEDS_INVALIDATE / STOPPED / DISABLED. - LastSyncedEventId: The last HMS event id which we have synced to. - LatestEventId: The event id of the latest event in HMS. Example output: [localhost:21050] default> :event_processor('pause'); +--------------------------------------------------------------------------------+ \| summary \| +--------------------------------------------------------------------------------+ \| EventProcessor status: PAUSED. LastSyncedEventId: 34489. LatestEventId: 34489. \| +--------------------------------------------------------------------------------+ Fetched 1 row(s) in 0.01s If authorization is enabled, only admin users that have ALL privilege on SERVER can run this command. Note that there is a restriction in MetastoreEventsProcessor#start(long) that resuming EventProcessor back to a previous event id is only allowed when it's not in the ACTIVE state. This patch aims to expose the control of EventProcessor to the users so MetastoreEventsProcessor is not changed. We can investigate the restriction and see if we want to relax it. Note that resuming EventProcessor at a newer event id can be done on any states. Admins can use this to manually resolve the lag of HMS event processing, after they have made sure all (or important) tables are manually invalidated/refreshed. A new catalogd RPC, SetEventProcessorStatus, is added for coordinators to control the status of EventProcessor. Tests - Added e2e tests Change-Id: I5a19f67264cfe06a1819a22c0c4f0cf174c9b958 Reviewed-on: http://gerrit.cloudera.org:8080/22250 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-24 04:10:14 +00:00
Daniel Becker	c5b474d3f5	IMPALA-13594: Read Puffin stats also from older snapshots Before this change, Puffin stats were only read from the current snapshot. Now we also consider older snapshots, and for each column we choose the most recent available stats. Note that this means that the stats for different columns may come from different snapshots. In case there are both HMS and Puffin stats for a column, the more recent one will be used - for HMS stats we use the 'impala.lastComputeStatsTime' table property, and for Puffin stats we use the snapshot timestamp to determine which is more recent. This commit also renames the startup flag 'disable_reading_puffin_stats' to 'enable_reading_puffin_stats' and the table property 'impala.iceberg_disable_reading_puffin_stats' to 'impala.iceberg_read_puffin_stats' to make them more intuitive. The default values are flipped to keep the same behaviour as before. The documentation of Puffin reading is updated in docs/topics/impala_iceberg.xml Testing: - updated existing test cases and added new ones in test_iceberg_with_puffin.py - reorganised the tests in TestIcebergTableWithPuffinStats in test_iceberg_with_puffin.py: tests that modify table properties and other state that other tests rely on are now run separately to provide a clean environment for all tests. Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39 Reviewed-on: http://gerrit.cloudera.org:8080/22177 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-23 15:25:59 +00:00
Michael Smith	4a645105f9	IMPALA-13658: Enable tuple caching aggregates Enables tuple caching on aggregates directly above scan nodes. Caching aggregates requires that their children are also eligible for caching, so this excludes aggregates above an exchange, union, or hash join. Testing: - Adds Planner tests for different aggregate cases to confirm they have stable tuple cache keys and are valid for caching. - Adds custom cluster tests that cached aggregates are used, and can be re-used in slightly different statements. Change-Id: I9bd13c2813c90d23eb3a70f98068fdcdab97a885 Reviewed-on: http://gerrit.cloudera.org:8080/22322 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-22 23:26:21 +00:00
Xuebin Su	d7ee509e93	IMPALA-12648: Add KILL QUERY statement To support killing queries programatically, this patch adds a new type of SQL statements, called the KILL QUERY statement, to cancel and unregister a query on any coordinator in the cluster. A KILL QUERY statement looks like ``` KILL QUERY '123:456'; ``` where `123:456` is the query id of the query we want to kill. We follow syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY' are added as "unreserved keywords", like 'DEFAULT'. This allows the three keywords to be used as identifiers. A user is authorized to kill a query only if the user is an admin or is the owner of the query. KILL QUERY statements are not affected by admission control. Implementation: Since we don't know in advance which impalad is the coordinator of the query we want to kill, we need to broadcast the kill request to all the coordinators in the cluster. Upon receiving a kill request, each coordinator checks whether it is the coordinator of the query: - If yes, it cancels and unregisters the query, - If no, it reports "Invalid or unknown query handle". Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is created for this. For authorization, this patch adds a custom handler of AuthorizationException for each statement to allow the exception to be handled by the backend. This is because we don't know whether the user is the owner of the query until we reach its coordinator. To support cancelling child queries, this patch changes ChildQuery::Cancel() to bypass the HS2 layer so that the session of the child query will not be added to the connection used to execute the KILL QUERY statement. Testing: - A new ParserTest case is added to test using "unreserved keywords" as identifiers. - New E2E test cases are added for the KILL QUERY statement. - Added a new dimension in TestCancellation to use the KILL QUERY statement. - Added file tests/common/cluster_config.py and made CustomClusterTestSuite.with_args() composable so that common cluster configs can be reused in custom cluster tests. Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0 Reviewed-on: http://gerrit.cloudera.org:8080/21930 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2025-01-22 22:22:54 +00:00
Fang-Yu Rao	5371e0c6df	IMPALA-13666: Provide a non-null fileMetadataStats for HdfsPartition IMPALA-13154 added the method getFileMetadataStats() to HdfsPartition.java that would return the file metadata statistics. The method requires the corresponding HdfsPartition instance to have a non-null field of 'fileMetadataStats_'. This patch revises two existing constructors of HdfsPartition to provide a non-null value for 'fileMetadataStats'. This makes it easier for a third party extension to set up and update the field of 'fileMetadataStats_'. A third party extension has to update the field of 'fileMetadataStats_' if it would like to use this field to get the size of the partition since all three fields in 'fileMetadataStats_' are defaulted to 0. A new constructor was also added for HdfsPartition that allows a third party extension to provide their own FileMetadataStats when instantiating an HdfsPartition. To facilitate instantiating a FileMetadataStats, a new constructor was added for FileMetadataStats that takes in a List of FileDescriptor's to construct a FileMetadataStats. Change-Id: I7e690729fcaebb1e380cc61f2b746783c86dcbf7 Reviewed-on: http://gerrit.cloudera.org:8080/22340 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-20 23:37:28 +00:00
Peter Rozsa	d1a0c2ac14	IMPALA-13205: Do not include Iceberg position fields for MERGE statements with INSERT merge clauses This patch makes the Iceberg position field inclusion conditional by including them only if there are UPDATE or DELETE merge clauses that are listed in a MERGE statement or the target table has existing delete files. These fields can be omitted when there's no delete file creation at the sink of the MERGE statement and the table has no existing delete files. Additionally, this change disables MERGE for Iceberg target tables that contain equality delete files, see IMPALA-13674. Tests: - iceberg-merge-insert-only planner test added Change-Id: Ib62c78dab557625fa86988559b3732591755106f Reviewed-on: http://gerrit.cloudera.org:8080/21931 Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-20 17:04:59 +00:00
Riza Suminto	3118e41c26	IMPALA-2945: Account for duplicate keys on multiple nodes preAgg AggregationNode.computeStats() estimate cardinality under single node assumption. This can be an underestimation in preaggregation node case because same grouping key may exist in multiple nodes during preaggreation. This patch adjust the cardinality estimate using following model for the number of distinct values in a random sample of k rows, previously used to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644. Assumes we are picking k rows from an infinite sample with ndv distinct values, with the value uniformly distributed. The probability of a given value not appearing in a sample, in that case is ((NDV - 1) / NDV) ^ k This is because we are making k choices, and each of them has (ndv - 1) / ndv chance of not being our value. Therefore the probability of a given value appearing in the sample is: 1 - ((NDV - 1) / NDV) ^ k And the number of distinct values in the sample is: (1 - ((NDV - 1) / NDV) ^ k) * NDV Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to use the new estimation logic or not. Testing: - Pass core tests. Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55 Reviewed-on: http://gerrit.cloudera.org:8080/22047 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 20:22:03 +00:00
Riza Suminto	c298c54262	IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting getPerInstanceNdvForCpuCosting is a method to estimate the number of distinct values of exprs per fragment instance when accounting for the likelihood of duplicate keys across fragment instances. It borrows the probabilistic model described in IMPALA-2945. This method is exclusively used by AggregationNode only. getPerInstanceNdvForCpuCosting run the probabilistic formula individually for each grouping expression and then multiply it together. That match with how we estimate group NDV in the past where we simply do NDV multiplication of each grouping expression. Recently, we adds tuple-based analysis to lower cardinality estimate for all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086). All of the bounding happens in AggregationNode.computeStats(), where we call estimateNumGroups() function that returns globalNdv estimate for specific aggregation class. To take advantage from that more precise globalNdv, this patch replace getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that apply the probabilistic formula over this single globalNdv number rather than the old way where it often return an overestimated number from NDV multiplication method. Its use is still limited only to calculate ProcessingCost. Using it for preagg output cardinality will be done by IMPALA-2945. estimatePreaggCardinality is skipped if data partition of input is a subset of grouping expression. Testing: - Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True. ProcessingCost changes, but all cardinality number stays. - Add CardinalityTest#testEstimatePreaggCardinality. - Update test_executor_groups.py. Enable v2 profile as well for easier runtime profile debugging. Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1 Reviewed-on: http://gerrit.cloudera.org:8080/22320 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 10:05:09 +00:00
Noemi Pap-Takacs	55d7498b24	IMPALA-13656: MERGE redundantly accumulates memory in HDFS WRITER When IcebergMergeImpl created the table sink it didn't set 'inputIsClustered' to true. Therefore HdfsTableSink expected random input and kept the output writers open for every partition, which resulted in high memory consumption and potentially a Memory Limit Exceeded error when the number of partitions are high. Since we actually sort the rows before the sink we can set 'inputIsClustered' to true, which means HdfsTableSink can write files one by one, because whenever it gets a row that belongs to a new partition it knows that it can close the current output writer, and open a new one. Testing: - e2e regression test Change-Id: I7bad0310e96eb482af9d09ba0d41e44c07bf8e4d Reviewed-on: http://gerrit.cloudera.org:8080/22332 Reviewed-by: Peter Rozsa <prozsa@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 18:20:00 +00:00
gaurav1086	c3cbd79b56	IMPALA-13288: OAuth AuthN Support for Impala This patch added OAuth support with following functionality: * Load and parse OAuth JWKS from configured JSON file or url. * Read the OAuth Access token from the HTTP Header which is the same format as JWT Authorization Bearer token. * Verify the OAuth's signature with public key in JWKS. * Get the username out of the payload of OAuth Access token. * If kerberos or ldap is enabled, then both jwt and oauth are supported together. Else only one of jwt or oauth is supported. This has been a pre existing flow for jwt. So OAuth will follow the same policy. * Impala Shell side changes: OAuth options -a and --oauth_cmd Testing: - Added 3 custom cluster be test in test_shell_jwt_auth.py: - test_oauth_auth_valid: authenticate with valid token. - test_oauth_auth_expired: authentication failure with expired token. - test_oauth_auth_invalid_jwk: authentication failure with valid signature but expired. - Added 1 custom cluster fe test in JwtWebserverTest.java - testWebserverOAuthAuth: Basic tests for OAuth - Added 1 custom cluster fe test in LdapHS2Test.java - testHiveserver2JwtAndOAuthAuth: tests all combinations of jwt and oauth token verification with separate jwks keys. - Manually tested with a valid, invalid and expired oauth access token. - Passed core run. Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6 Reviewed-on: http://gerrit.cloudera.org:8080/21728 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 03:32:57 +00:00
Peter Rozsa	ff2f2ba77e	IMPALA-13324: Enable statement rewrite for merge queries for IcebergMergeImpl This change enables MERGE statements with source expressions containing subqueries that require rewrite. The change adds implementation for reset methods for each merge case, and properly handles resets for MergeStmt and IcebergMergeImpl. Tests: - Planner test added with a merge query that requires a rewrite - Analyzer test modified Change-Id: I26e5661274aade3f74a386802c0ed20e5cb068b5 Reviewed-on: http://gerrit.cloudera.org:8080/22039 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-14 14:23:42 +00:00
Riza Suminto	12a2a04bc8	IMPALA-13664: Lower datanucleus.connectionPool.maxPoolSize to 20 IMPALA-13620 increase datanucleus.connectionPool.maxPoolSize of HMS from 10 to 30. When running all tests in single node, this seem to exhaust all 100 of postgresql max_connection and interfere with authorization/test_ranger.py and query_test/test_ext_data_sources.py. This patch lower datanucleus.connectionPool.maxPoolSize to 20. Testing: - Pass exhaustive tests in single node. Change-Id: I98eb27cbd141d5458a26d05d1decdbc7f918abd4 Reviewed-on: http://gerrit.cloudera.org:8080/22326 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 21:14:49 +00:00
Noemi Pap-Takacs	fa570e8ea7	IMPALA-13655: UPDATE redundantly accumulates memory in HDFS WRITER When IcebergUpdateImpl created the table sink it didn't set 'inputIsClustered' to true. Therefore HdfsTableSink expected random input and kept the output writers open for every partition, which resulted in high memory consumption and potentially an OOM error when the number of partitions are high. Since we actually sort the rows before the sink we can set 'inputIsClustered' to true, which means HdfsTableSink can write files one by one, because whenever it gets a row that belongs to a new partition it knows that it can close the current output writer, and open a new one. Testing: - e2e regression test Change-Id: I9bad335cc946364fc612e8aaf90858eaabd7c4af Reviewed-on: http://gerrit.cloudera.org:8080/22325 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 16:02:53 +00:00
Sai Hemanth Gantasala	1f7b9601e5	IMPALA-13403: Refactor the checks of skip reloading file metadata for ALTER_TABLE events IMPALA-12487 adds an optimization that if an ALTER_TABLE event has trivial changes in StorageDescriptor (e.g. removing optional field 'storedAsSubDirectories'=false which defaults to false), file metadata reload will be skipped, no matter what changes are in the table properties. This is problematic since some HMS clients (e.g. Spark) could modify both the table properties and StorageDescriptor. If there is a non-trivial changes in table properties (e.g. 'location' change), we shouldn't skip reloading file metadata. Testing: - Added a unit test to verify the same Change-Id: Ia969dd32385ac5a1a9a65890a5ccc8cd257f4b97 Reviewed-on: http://gerrit.cloudera.org:8080/21971 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 15:21:27 +00:00
Sai Hemanth Gantasala	46115e9a8e	IMPALA-13126: Obtain table read lock in EP to process partitioned event For a partition-level event, isOlderEvent() in catalogD needs to check whether the corresponding partition is reloaded after the event. This should be done after holding the table read lock. Otherwise, EventProcessor could hit ConcurrentModificationException error when there are concurrent DDLs/DMLs modifying the partition list. note: Created IMPALA-13650 for a cleaner solution to clear the inflight events list for partitioned table events. Testing: - Added a end-to-end stress test to verify the above scenario Change-Id: I26933f98556736f66df986f9440ebb64be395bc1 Reviewed-on: http://gerrit.cloudera.org:8080/21663 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-10 07:29:43 +00:00
Noemi Pap-Takacs	c518d3c818	IMPALA-13501: Clean up uncommitted Iceberg files after validation check failure Iceberg supports multiple writers with optimistic concurrency. Each writer can write new files which are then added to the table after a validation check to ensure that the commit does not conflict with other modifications made during the execution. When there was a conflicting change which could not be resolved, it means that the newly written files cannot be committed to the table, so they used to become orphan files on the file system. Orphan files can accumulate over time, taking up a lot of storage space. They do not belong to the table because they are not referenced by any snapshot and therefore they can't be removed by expiring snapshots. This change introduces automatic cleanup of uncommitted files after an unsuccessful DML operation to prevent creating orphan files. No cleanup is done if Iceberg throws CommitStateUnknownException because the update success or failure is unknown in this case. Testing: - E2E test: Injected ValidationException with debug option. - stress test: Added a method to check that no orphan files were created after failed conflicting commits. Change-Id: Ibe59546ebf3c639b75b53dfa1daba37cef50eb21 Reviewed-on: http://gerrit.cloudera.org:8080/22189 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-09 18:52:40 +00:00
Michael Smith	740ee28eb1	IMPALA-13618: Move to commons-lang3 Updates from commons-lang (2.6) to commons-lang3. Switches getFullStackTrace to getStackTrace. getFullStackTrace is not present in lang3, and https://issues.apache.org/jira/browse/LANG-904 suggests that getFullStackTrace existed for handling chained exceptions in older Java runtimes. Change-Id: Ie16af2692858f6a571cc1e5b85ecba3806da8d7e Reviewed-on: http://gerrit.cloudera.org:8080/22228 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-09 07:36:39 +00:00
Peter Rozsa	d23ba87d46	IMPALA-13361: Add INSERT * and UPDATE SET * syntax for MERGE statement This change adds INSERT * and UPDATE SET * language elements for WHEN NOT MATCHED and WHEN MATCHED clauses. INSERT * enumerates all source expressions from source table/subquery and analyzes the clause similarly to the regular WHEN NOT MATCHED THEN INSERT case. UPDATE SET * creates assignments for each target table column by enumerating the table columns and assigning source expressions by index. If the target column count and the source expression count mismatches or the types mismatches both clauses report analysis errors. Tests: - parser tests added - analyzer tests added - E2E tests added Change-Id: I31cb771f2355ba4acb0f3b9f570ec44fdececdf3 Reviewed-on: http://gerrit.cloudera.org:8080/22051 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-08 17:03:57 +00:00
Riza Suminto	01b8b45252	IMPALA-13620: Refresh compute_table_stats.py script This patch refreshes compute_table_stats.py script with the following changes: - Limit parallelism to IMPALA_BUILD_THREADS at maximum if --parallelism argument is not set. - Change its default connection to hs2, leveraging existing ImpylaHS2Connection. - Change OptionParser to ArgumentParser. - Use impala-python3 to run the script. - Add --exclude_table_names to skip running COMPUTE STATS on certain tables/views. - continue_on_error is False by default. This patch also improves query handle logging in ImpylaHS2Connection. collect_profile_and_log argument is added to control whether to pull logs and runtime profile at the end of __fetch_results(). The default behavior remains unchanged. Skip COMPUTE STATS for functional_kudu.alltypesagg and functional_kudu.manynulls because it is invalid to run COMPUTE STATS over view. Customized hive-site.xml to set datanucleus.connectionPool.maxPoolSize to 30 and hikaricp.connectionTimeout to 60000 ms. Also set hive.log.dir to ${IMPALA_CLUSTER_LOGS_DIR}/hive. Testing: Repeatedly run compute-table-stats.sh from cold state and confirm there is no error occurs. This is the script to do so from active minicluster: cd $IMPALA_HOME ./bin/start-impala-cluster.py --kill ./testdata/bin/kill-hive-server.sh ./testdata/bin/run-hive-server.sh ./bin/start-impala-cluster.py ./testdata/bin/compute-table-stats.sh > /tmp/compute-stats.txt 2>&1 grep error /tmp/compute-stats.txt Core tests ran and passed. Change-Id: I1ebf02f95b957e7dda3a30622b87e8fca3197699 Reviewed-on: http://gerrit.cloudera.org:8080/22231 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-08 07:49:31 +00:00
Riza Suminto	617e99981e	IMPALA-13086: Lower AggregationNode estimate using stats predicate NDV of a grouping column can be reduced if there is a predicate over that column. If the predicate is a constant equality predicate or is-null predicate, then the NDV must be equal to 1. If the predicate is a simple in-list predicate, the NDV must be the number of items in the list. This patch adds such consideration by leveraging existing analysis in HdfsScanNode.computeStatsTupleAndConjuncts(). It memorizes the first ScanNode/UnionNode that produces a TupleId in Analyzer, registered during Init()/computeStats() of the PlanNode. At AggregationNode, it looks up the PlanNode that produces a TupleId. If the origin PlanNode is an HdfsScanNode, analyze if any grouping expression is listed in statsOriginalConjuncts_ and reduce them accordingly. If HdfsScanNode.computeStatsTupleAndConjuncts() can be made generic for all ScanNode implementations in the future, we can apply this same analysis to all kinds of ScanNode and achieve the same reduction. In terms of tracking producer PlanNode, this patch made an exception for Iceberg PlanNodes that handle positional or equality deletion. In that scenario, it is possible to have two ScanNodes sharing the same TupleId to force UnionNode passthrough. Therefore, the UnionNode will be acknowledged as the first producer of that TupleId. This patch also remove some redundant operation in HdfsScanNode. Fixed typo in method name MathUtil.saturatingMultiplyCardinalities(). Testing: - Add new test cases in aggregation.test - Pass core tests. Change-Id: Ia840d68f1c4f126d4e928461ec5c44545dbf25f8 Reviewed-on: http://gerrit.cloudera.org:8080/22032 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 23:13:12 +00:00
stiga-huang	7bdd7e1010	IMPALA-13518: Show target name of COMMIT_TXN events in logs The message of a COMMIT_TXN event just contains the transaction id (txnid). In the logs of top-10 expensive events and top-10 targets that contribute to the lag, we show the target as CLUSTER_WIDE. However, when processing the events, catalogd actually finds the involved tables and reloads them. It'd be helpful to show the names of the tables involved in the transaction. This patch overrides the getTargetName() method in CommitTxnEvent to show the table names. They are collected after the event is processed. Tests: - Add tests in MetastoreEventsProcessorTest Change-Id: I4a7cb5e716453290866a4c3e74c0d269f621144f Reviewed-on: http://gerrit.cloudera.org:8080/22036 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Reviewed-by: Sai Hemanth Gantasala <saihemanth@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2025-01-07 17:37:18 +00:00
Riza Suminto	ce6be49c08	IMPALA-13465: Trace TupleId further to reduce Agg cardinality IMPALA-13405 does tuple analysis to lower AggregationNode cardinality. It begins by focusing on the simple column SlotRef, but we can improve this further by tracing the origin TupleId across views and intermediate aggregation tuples. This patch implements deeper TupleId tracing to achieve further cardinality reduction. With this deeper TupleId resolution, it is possible now to narrow down the TupleId search across children ScanNodes and UnionNodes only. Note that this optimization is still limited to run ONLY IF there are at least two grouping expressions that refer to the same TupleId. There is a benefit to run the same optimization even though there is only a single expression per TupleId, but we defer that work until we can provide faster TupleId to PlanNode mapping without repeating the plan tree traversal. This patch also makes tuple-based reduction more conservative by capping at input cardinality/limit, or using output cardinality if the producer node is a UnionNode or has hard estimates. aggInputCardinality is still indirectly influenced by predicates and limits of children's nodes. The following PlannerTest (under testdata/workloads/functional-planner/queries/PlannerTest/) revert their cardinality estimation to their state pior to IMPALA-13405: tpcds/tpcds-q19.test tpcds/tpcds-q55.test tpcds_cpu_cost/tpcds-q03.test tpcds_cpu_cost/tpcds-q31.test tpcds_cpu_cost/tpcds-q47.test tpcds_cpu_cost/tpcds-q52.test tpcds_cpu_cost/tpcds-q57.test tpcds_cpu_cost/tpcds-q89.test Several other planner tests have increased cardinality after this change, but the numbers are still below pre-IMPALA-13405. Removed nested-view planner test in agg-node-max-mem-estimate.test that first added by IMPALA-13405. That same test has been duplicated by IMPALA-13480 at aggregation.test. Testing: - Pass core tests. Change-Id: I11f59ccc469c24c1800abaad3774c56190306944 Reviewed-on: http://gerrit.cloudera.org:8080/21955 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-07 00:43:22 +00:00
Riza Suminto	23edbde7c7	IMPALA-13637: Add ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE option IMPALA-13405 adds a new tuple-analysis algorithm in AggregationNode to lower cardinality estimation when planning multi-column grouping. This patch adds a query option ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE that allows users to enable/disable the algorithm if necessary. Default is True. Testing: - Add testAggregationNoTupleAnalysis. This test is based on TpcdsPlannerTest#testQ19 but with ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE set to false. Change-Id: Iabd8daa3d9414fc33d232643014042dc20530514 Reviewed-on: http://gerrit.cloudera.org:8080/22294 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 21:32:45 +00:00
Sai Hemanth Gantasala	e7c97439d1	IMPALA-12141: EP shouldn't fail while releasing write lock if the lock is not held previously Without IMPALA-12832, Event Processor (EP) is going into error state when there is an issue while obtaining a table write lock because the finally-clause of releaseWriteLock() is always invoked even if the lock is not held by current thread. This patch addresses the problem by checking if the table holds write lock before releasing it. Note: With IMPALA-12832, the EP invalidates the table when an error is encountered which is still an overhead. With this patch EP will neither goes into error state nor invalidates when this issue is encountered. Testing: - Added an end-to-end to verify the same. Change-Id: Ib2e4c965796dd515ab8549efa616f72510ca447f Reviewed-on: http://gerrit.cloudera.org:8080/22080 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 20:42:07 +00:00
Xuebin Su	731c16c73a	IMPALA-13154: Update metrics when loading an HDFS table Previously, some table metrics, such as the estimated memory usage and the number of files, were only updated when a "FULL" Thrift object of the table is requested. As a result, if a user ran a DESCRIBE command on a table, and then tried to find the table on the Top-N page of the web UI, the user would not find it. This patch fixes the issue by updating the table metrics as soon as an HDFS table is loaded. With this, no matter what Thrift object type of the table is requested, the metrics will always be updated and displayed on the web UI. Testing: - Added two custom cluster tests in test_web_pages.py to make sure that table stats can be viewed on the web UI after DESCRIBE, for both legacy and local catalog modes. Change-Id: I6e2eb503b0f61b1e6403058bc5dc78d721e7e940 Reviewed-on: http://gerrit.cloudera.org:8080/22014 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-02 19:31:18 +00:00
Peter Rozsa	19110b490d	IMPALA-13362: Implement WHEN NOT MATCHED BY SOURCE syntax for MERGE statement This change adds support for a new MERGE clause that covers the condition when the source statement's rows do not match the target tables rows. Example: MERGE INTO target t using source s on t.id = s.id WHEN NOT MATCHED BY SOURCE THEN UPDATE set t.column = "a"; This change also adds support to use WHEN NOT MATCHED BY TARGET explicitly, this is equivalent to WHEN NOT MATCHED. Tests: - Parser tests for the new language elements. - Analyzer and planner test for WHEN NOT MATCHED BY SOURCE/TARGET clauses. - E2E tests for WHEN NOT MATCHED BY SOURCE clause. Change-Id: Ia0e0607682a616ef6ad9eccf499dc0c5c9278c5f Reviewed-on: http://gerrit.cloudera.org:8080/21988 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-02 15:02:20 +00:00
Andrew Sherman	6d715fe7dc	IMPALA-13596: Add warnings and exceptions to reading of fair-scheduler file. The fair-scheduler file contains part of the configuration for Admission Control. This change adds some better error handling to the parsing of this file. Where it is safe to do so, new exceptions are thrown; this will cause Impala to refuse to start. This is consistent with other serious configuration errors. Where new exceptions might cause problems with existing configurations, or for less dangerous faults, new warnings are written to the server log. For the recently added User Quota configuration (IMPALA-12345) throw an exception when a duplicate snippet of configuration is found. New warning log messages are added for these cases: - when a user quota at the leaf level is completely ignored because of a user quota at the root level - when there is no user ACL on a leaf level queue. This prevents any queries from being submitted to the queue. Change-Id: Idcd50442ce16e7c4346c6da1624216d694f6f44d Reviewed-on: http://gerrit.cloudera.org:8080/22209 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-31 22:13:44 +00:00
Riza Suminto	b1628d8644	IMPALA-13622: Fix negative cardinality bug in AggregationNode.java An incomplete COMPUTE STATS during data loading reveal a bug in AggregationNode.java where estimateNumGroups() can return value less than -1. This patch fix the bug by implementing PlanNode.smallestValidCardinality() and MathUtil.saturatingMultiplyCardinalities(). Both function validates that the function arguments are valid cardinality number. smallestValidCardinality() correctly compares two cardinality numbers and return the smallest and valid one. It generalizes and replaces static function PlanNode.capCardinalityAtLimit(). saturatingMultiplyCardinalities() adds validation and normalization over MathUtil.saturatingMultiply(). Reorder logic of tuple-based estimation from IMPALA-13405 such that negative estimate is handled properly. Testing: - Added more preconditions in AgggregationNode.java. - Added CardinalityTest.testSmallestValidCardinality and MathUtilTest.testSaturatingMultiplyCardinality. - Added test in resource-requirements.test that will consistently fail without this fix. - Pass testResourceRequirement. Change-Id: Ib862a010b2094daa2cbdd5d555e46443009672ad Reviewed-on: http://gerrit.cloudera.org:8080/22235 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Jason Fehr <jfehr@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-31 07:36:19 +00:00
Riza Suminto	818057b875	IMPALA-13526: Fix Agg node creation order in DistributedPlanner Within DistributedPlanner.java, there are several places where Planner need to insert extra merge aggregation node. It requires transferring HAVING conjuncts from preaggregation node to merge aggregation, unsetting limit, and recompute stats of preaggregation node. However, the stats recompute is not consistently done, and there might be an inefficient recompute happening. This patch fixes the order of AggregationNode creation order in DistributedPlanner.java so that stats recomputation is done consistently and efficiently. Testing: - Pass core tests. Change-Id: Ica8227fdc46a1ef59bef5ae5424ba3907827411d Reviewed-on: http://gerrit.cloudera.org:8080/22046 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2024-12-18 23:15:49 +00:00
Riza Suminto	2828e47371	IMPALA-13480: VALIDATE_CARDINALITY in some aggregation tests This patch enable VALIDATE_CARDINALITY test options in several planner tests that touch aggregation node. Enabling it has revealed three bugs. First, in IMPALA-13405, cardinality estimate of MERGE phase aggregation is not capped against the output cardinality of the EXCHANGE node. This patch fix it by adding such capping. Second, tuple-based optimization IMPALA-13405 can cause cardinality underestimation if HAVING predicate exist. This is due to the default selectivity of 10% applied for each HAVING predicate. This patch skip tuple-based optimization if AggregationNode.conjuncts_ is ever not empty. It will stay skipped on stats recompute, even if conjuncts_ is transfered into the next Merge AggregationNode above the plan. The optimization skip causes following PlannerTest (under testdata/workloads/functional-planner/queries/PlannerTest/) to revert their cardinality estimation to their state pior to IMPALA-13405: - tpcds/tpcds-q39a.test - tpcds/tpcds-q39b.test - tpcds_cpu_cost/tpcds-q39a.test - tpcds_cpu_cost/tpcds-q39b.test In the future, we should consider raising the default selectivity for HAVING predicate and undo this skipping logic (IMPALA-13542). Third, is missing stats recompute after conjunct transfer in multi-phase aggregation. This will be fixed separately by IMPALA-13526. Testing: - Enable cardinality validation in testMultipleDistinct* - Update aggregation.test to reflect current PlannerTest output. Added some test cases in aggregation.test. - Run and pass TpcdsPlannerTest and TpcdsCpuPlannerTest. - Selectively run some more planner tests that touch AggregationNode and pass them. Change-Id: Iadb4af9fd65fdb85b66fae1e403ccec8ca5eb102 Reviewed-on: http://gerrit.cloudera.org:8080/22184 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-13 20:26:05 +00:00
Pranav Lodha	907c1738a0	IMPALA-12789: Fix unit-test code JdbcDataSourceTest.java The unit test `JdbcDataSourceTest.java` was originally implemented using the H2 database, which is no longer available in Impala's environment. The test code was also outdated and erroneous. This commit addresses and fixes the failure of JdbcDataSourceTest.java and rewrites it in Postgres, hence ensures compatibility with Impala's current environment and aligns with JDBC and external data source APIs. Please note, this test is moved to fe folder to fix the BackendConfig instance not initialized error. To test this file, run the following command: pushd fe && mvn -fae test -Dtest=JdbcDataSourceTest Please note that the tests in JdbcDataSourceTest have a dependency on previous tests and individual tests cannot be ran separately for this class. Change-Id: Ie07173d256d73c88f5a6c041f087db16b6ff3127 Reviewed-on: http://gerrit.cloudera.org:8080/21805 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-12 11:34:45 +00:00
Zoltan Borok-Nagy	d086babdbd	IMPALA-13598: OPTIMIZE redundantly accumulates memory in HDFS WRITER When OptimizeStmt created the table sink it didn't set 'inputIsClustered' to true. Therefore HdfsTableSink expected random input and kept the output writers open for every partition, which resulted in high memory consumption and potentially an OOM error when the number of partitions are high. Since we actually sort the rows before the sink we can set 'inputIsClustered' to true, which means HdfsTableSink can write files one by one, because whenever it gets a row that belongs to a new partition it knows that it can close the current output writer, and open a new one. Testing: * added e2e test Change-Id: I8d451c50c4b6dff9433ab105493051bee106bc63 Reviewed-on: http://gerrit.cloudera.org:8080/22192 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-11 22:19:10 +00:00
Andrew Sherman	2280c1362e	IMPALA-12943: Document Admission Control User Quotas. Document the feature introduced in IMPALA-12345. Add a few more tests to the QuotaExamples test which demonstrate the examples used in the docs. Clarify in docs and code the behavior when a user is a member of more than one group for which there are rules. In this case the least restrictive rule applies. Also document the '--max_hs2_sessions_per_user' flag introduced in IMPALA-12264. Change-Id: I82e044adb072a463a1e4f74da71c8d7d48292970 Reviewed-on: http://gerrit.cloudera.org:8080/22100 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-11 02:18:18 +00:00
Mihaly Szjatinya	74c192c969	IMPALA-13544: Addendum: fixed assert message. Change-Id: I77a0c8b0612845596f7c0e153453d3f45b5efb61 Reviewed-on: http://gerrit.cloudera.org:8080/22190 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-10 19:32:13 +00:00
Mihaly Szjatinya	79eb48e9f6	IMPALA-13544: Expose TRANSLATED_TO_EXTERNAL property in SHOW CREATE TABLE Makes TRANSLATED_TO_EXTERNAL property from HMS exposed as a result of SHOW CREATE TABLE request by removing it from ToSqlUtils' hidden properties list. Change-Id: I0048a041d50f5e520f5286a613a428393397bc4d Reviewed-on: http://gerrit.cloudera.org:8080/22108 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-09 15:07:45 +00:00
Riza Suminto	8e71f5ec86	IMPALA-13535: Add script to restore stats on PlannerTest Impala has several PlannerTest that validate over EXTENDED profile and validate cardinality. In EXTENDED level, profile display stored table stats from HMS like 'numRows' and 'totalSize', which can vary between data loads. They are not validated by PlannerTest. But frequent change of these lines can disturb code review process because they are mostly noise. This patch provides a python script restore-stats-on-planner-tests.py to fix the table stats information in selected .test files. The test files to check and fixed table stats is declared inside the script. It is currently focus on tests under functional-planner/queries/PlannerTest/tpcds/ and some that test against tpcds_partitioned_parquet_snap table. critique-gerrit-review.py is updated to run with python3, trigger restore-stats-on-planner-tests.py, and warn if there is any unnecessary table stats change detected. This patch also fixed table size for tests under functional-planner/queries/PlannerTest/tpcds_cpu_cost/ because all tests there runs with synthetic stats declared in stats-3TB.json. Before the patch, the table stats printed in plan is the real stats from HMS. After this patch, the table stats displayed is calculated from the stats-3TB.json. See IMPALA-12726 for more detail on large scale planner test simulation. Testing: - Manually run the script and confirm that stats line are replaced correctly. - Run affected PlannerTest and all passed. Change-Id: I27bab7cee93880cd59f01b9c2d1614dfcabdc682 Reviewed-on: http://gerrit.cloudera.org:8080/22045 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-03 03:03:26 +00:00
Mihaly Szjatinya	81f2673883	IMPALA-889: Add trim() function matching ANSI SQL definition As agreed in JIRA discussions, the current PR extends existing TRIM functionality with the support of SQL-standardized TRIM-FROM syntax: TRIM({[LEADING / TRAILING / BOTH] \| [STRING characters]} FROM expr). Implemented based on the existing LTRIM / RTRIM / BTRIM family of functions prepared earlier in IMPALA-6059 and extended for UTF-8 in IMPALA-12718. Besides, partly based on abandoned PR https://gerrit.cloudera.org/#/c/4474 and similar EXTRACT-FROM functionality from https://github.com/apache/impala/commit/543fa73f3a846 f0e4527514c993cb0985912b06c. Supported syntaxes: Syntax #1 TRIM(<where> FROM <string>); Syntax #2 TRIM(<charset> FROM <string>); Syntax #3 TRIM(<where> <charset> FROM <string>); "where": Case-insensitive trim direction. Valid options are "leading", "trailing", and "both". "leading" means trimming characters from the start; "trailing" means trimming characters from the end; "both" means trimming characters from both sides. For Syntax #2, since no "where" is specified, the option "both" is implied by default. "charset": Case-sensitive characters to be removed. This argument is regarded as a character set going to be removed. The occurrence order of each character doesn't matter and duplicated instances of the same character will be ignored. NULL argument implies " " (standard space) by default. Empty argument ("" or '') makes TRIM return the string untouched. For Syntax #1, since no "charset" is specified, it trims " " (standard space) by default. "string": Case-sensitive target string to trim. This argument can be NULL. The UTF8_MODE query option is honored by TRIM-FROM, similarly to existing TRIM(). UTF8_TRIM-FROM can be used to force UTF8 mode regardless of the query option. Design Notes: 1. No-BE. Since the existing LTRIM / RTRIM / BTRIM functions fully cover all needed use-cases, no backend logic is required. This differs from similar EXTRACT-FROM. 2. Syntax wrapper. TrimFromExpr class was introduced as a syntax wrapper around FunctionCallExpr, which instantiates one of the regular LTRIM / RTRIM / BTRIM functions. TrimFromExpr's role is to maintain the integrity of the "phantom" TRIM-FROM built-in function. 3. No TRIM keyword. Following EXTRACT-FROM, no "TRIM" keyword was added to the language. Although generally a keyword would allow easier and better parsing, on the negative side it restricts token's usage in general context. However, leading/trailing/both, being previously saved as reserved words, are now added as keywords to make possible their usage with no escaping. Change-Id: I3c4fa6d0d8d0684c4b6d8dac8fd531d205e4f7b4 Reviewed-on: http://gerrit.cloudera.org:8080/21825 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2024-12-02 15:15:15 +00:00
Andrew Sherman	8416432cc3	IMPALA-13560: Second attempt at fixing test_admission_controller_with_quota_configs test. The test test_admission_controller_with_quota_configs() is designed to be a similar test to test_admission_controller_with_configs() which uses the 'queueB' queue. The newer test uses a newly added queue 'queueF'. Because Admission Control configuration is split across two files, and because of user stupidity, the queue timeout configuration for 'queueB' was not copied when the new test was created. This causes queued queries to be timed out while waiting for admission, which confuses the test. Set pool-queue-timeout-ms.root to 600000 for queueF in llama-site-test2.xml. Change-Id: I1378cd4e42ed5629b92b1c16dd17d4d16ec4a19d Reviewed-on: http://gerrit.cloudera.org:8080/22126 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-26 22:46:18 +00:00
Gabor Kaszab	ce35e81bca	IMPALA-11265: Part2: Store Iceberg file descriptors in encoded format The file descriptors in HdfsPartition are cached as byte arrays to keep the memory footprint low. They are transformed into actual FileDescriptor objects once queried. This patch changes IcebergContentFileStore to similarly use byte arrays as an internal representation for file descriptors. Note, file descriptors for Iceberg tables have 2 components: one is the same as in HdfsPartition and the other stores Iceberg specific file metadata in an additional byte array. Measurements and observations: - I have a test table that has 110k data files. For this table the JVM memory usage in the catalogd got reduced from 80MB to 65MB. - Both HdfsPartition.FileDescriptor and IcebergContentFileStore use flatbuffers and in turn byte arrays to represent file descriptors and these byte arrays are shared between these 2 places. As a result there is no redundancy in storing the file descriptors both for the Iceberg and the Hdfs table. - There is no measurable difference in planning times with this patch. Change-Id: I9d7794df999bdaf118158eace26cea610f911c0a Reviewed-on: http://gerrit.cloudera.org:8080/21869 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2024-11-25 14:14:59 +00:00
Daniel Becker	e5919f13f9	IMPALA-13370: Read Puffin stats from metadata.json property if available When Trino writes Puffin stats for a column, it includes the NDV as a property (with key "ndv") in the "statistics" section of the metadata.json file, in addition to the Theta sketch in the Puffin file. When we are only reading the stats and not writing/updating them, it is enough to read this property if it is present. After this change, Impala only opens and reads a Puffin stats file if it contains stats for at least one column for which the "ndv" property is not set in the metadata.json file. Testing: - added a test in test_iceberg_with_puffin.py that verifies that the Puffin stats file is not read if the the metadata.json file contains the NDV property. It uses the newly added stats file with corrupt datasketches: 'metadata_ndv_ok_sketches_corrupt.stats'. Change-Id: I5e92056ce97c4849742db6309562af3b575f647b Reviewed-on: http://gerrit.cloudera.org:8080/21959 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-23 16:04:06 +00:00
Andrew Sherman	201e7becad	IMPALA-13560: Fix test_admission_controller_with_quota_configs test. The intention of test_admission_controller_with_quota_configs() is to run a workload with a variety of outcomes in a pool that has Admission Control User Quotas Configured. The idea was that the User Quotas configuration would not affect the workload that is run by run_admission_test(). The configuration for 'queueF' limits the number of concurrent queries that can be run by any user to 30. In the test there is only one user, and the number of queries that are run is 50, so there is potential for the User Quotas configuration to affect the operation of the test. Fix this by bumping the Quota limit to 50. TESTING I ran tests in a similar environment to that where failures were observed. Without the fix I saw a failure, and with the fix there were no failures. This isn't sufficient to prove this fix is all that is needed, but the change is safe and isolated. Change-Id: Ie2cc81a5b95d07154b73d32daf67617c79283ac8 Reviewed-on: http://gerrit.cloudera.org:8080/22096 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-22 09:01:49 +00:00
Andrew Sherman	de6b902581	IMPALA-12345: Add user quotas to Admission Control Allow administrators to configure per user limits on queries that can run in the Impala system. In order to do this, there are two parts. Firstly we must track the total counts of queries in the system on a per-user basis. Secondly there must be a user model that allows rules that control per-user limits on the number of queries that can be run. In a Kerberos environment the user names that are used for both the user model and at runtime are short user names, e.g. testuser when the Kerberos principal is testuser/scm@EXAMPLE.COM TPoolStats (the data that is shared between Admission Control instances) is extended to include a map from user name to a count of queries running. This (along with some derived data structures) is updated when queries are queued and when they are released from Admission Control. This lifecycle is slightly different from other TPoolStats data which usually tracks data about queries that are running. Queries can be rejected because of user quotas at submission time. This is done for two reasons: (1) queries can only be admitted from the front of the queue and we do not want to block other queries due to quotas, and (2) it is easy for users to understand what is going on when queries are rejected at submission time. Note that when running in configurations without an Admission Daemon then Admission Control does not have perfect information about the system and over-admission is possible for User-Level Admission Quotas in the same way that it is for other Admission Control controls. The User Model is implemented by extending the format of the fair-scheduler.xml file. The rules controlling the per-user limits are specified in terms of user or group names. Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to the fair-scheduler.xml file. These elements can be placed on the root configuration, which applies to all pools, or the pool configuration. The ‘userQueryLimit’ element has 2 child elements: "user" and "totalCount". The 'user' element contains the short names of users, and can be repeated, or have the value "*" for a wildcard name which matches all users. The ‘groupQueryLimit’ element has 2 child elements: "group" and "totalCount". The 'group' element contains group names. The root level rules and pool level rules must both be passed for a new query to be queued. The rules dictate a maximum number of queries that can run by a user. When evaluating rules at either the root level, or at the pool level, when a rule matches a user then there is no more evaluation done. To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the RequestPoolService is enhanced. If user quotas are enabled for a pool then a list of the users with running or queued queries in that pool is visible on the coordinator webui admission control page. More comprehensive documentation of the user model will be provided in IMPALA-12943 TESTING New end-to-end tests are added to test_admission_controller.py, and admission-controller-test is extended to provide unit tests for the user model. Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41 Reviewed-on: http://gerrit.cloudera.org:8080/21616 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-16 06:38:38 +00:00
Joe McDonnell	68c42a5d66	IMPALA-13179: Make non-deterministic functions ineligible for tuple caching Non-deterministic functions should make a location ineligible for caching. Unlike existing definitions of non-determinism like FunctionCallExpr.isNondeterministicBuiltinFn(), the non-determinism needs to apply over time and across query boundaries, so it is a broader list of functions. The following are considered non-deterministic in this change: 1. Random functions like rand/random/uuid 2. Current time functions like now/current_timestamp 3. Session/system information like current_user/pid/coordinator 4. AI functions 5. UDFs With enable_expr_rewrites=true, constant folding can replace some of these with a single constant (e.g. now() becomes a specific timestamp). This is not a correctness problem for tuple caching, because the specific value is incorporated into the cache key. Testing: - Added test cases to TupleCacheTest Change-Id: I9601dba87b3c8f24cbe42eca0d8070db42b50488 Reviewed-on: http://gerrit.cloudera.org:8080/22011 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-15 03:01:53 +00:00
Fang-Yu Rao	4255926b12	IMPALA-12554: Create one Ranger policy for multi-column GRANT This patch makes Impala create only one Ranger policy for the GRANT statement when there are multiple columns specified to reduce the number of policies created on the Ranger server. Note that this patch relies on RANGER-4585 and RANGER-4638. Testing: - Manually verified that Impala's catalog daemon only sends one GrantRevokeRequest to the Ranger plug-in and that the value of the key 'column' is a comma-separated list of column names involved in the GRANT statement. - Added an end-to-end test to verify only one Ranger policy will be created in a multi-column GRANT statement. Change-Id: I2b0ebba256c7135b4b0d2160856202292d720c6d Reviewed-on: http://gerrit.cloudera.org:8080/21940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-08 11:48:45 +00:00
Joe McDonnell	d0423a83ef	IMPALA-13495: Make exceptions from the Calcite planner easier to classify This makes several changes to the Calcite planner to improve the generated exceptions when there are errors: 1. When the Calcite parser produces SqlParseException, this is converted to Impala's regular ParseException. 2. When the Calcite validation fails, it produces a CalciteContextException, which is a wrapper around the real cause. This converts these validation errors into AnalysisExceptions. 3. This produces UnsupportedFeatureException for non-HDFS table types like Kudu, HBase, Iceberg, and views. It also produces UnsupportedFeatureException for HDFS tables with complex types (which otherwise would hit ClassCastException). 4. This changes exception handling in CalciteJniFrontend.java so it does not convert exceptions to InternalException. The JNI code will print the stacktrace for exceptions, so this drops the existing call to print the exception stack trace. Testing: - Ran some end-to-end tests with a mode that continues past failures and examined the output. Change-Id: I6702ceac1d1d67c3d82ec357d938f12a6cf1c828 Reviewed-on: http://gerrit.cloudera.org:8080/21989 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2024-11-07 23:25:13 +00:00
Steve Carlin	dea4e99be6	IMPALA-13494: Calcite planner group_concat with distinct failing The following query is failing. select sum(len_orderkey), sum(len_comment) from ( select length(group_concat(distinct cast(l_orderkey as string))) len_orderkey, length(group_concat(distinct(l_comment))) len_comment from tpch.lineitem group by l_comment ) v There is code where in AggregationFunction for group_concat that calls to ignore an implicit cast. The 'isImplicitCast_' member is being used directly in this function, but the variable is overridden in the isImplicitCast method for the Calcite planner. A small change was needed to call the isImplicitCast() function rather than use the member variable. Change-Id: Idec41597b40a533bc0774b4ff2ab5059c7f324e2 Reviewed-on: http://gerrit.cloudera.org:8080/22025 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-07 20:58:32 +00:00
Riza Suminto	24581704d8	IMPALA-13512: Print .test file name if PlannerTest fail This patch improve PlannerTest by printing the path to .test file that is failed. It also skip printing VERBOSE plan if PlannerTestOption.EXTENDED_EXPLAIN is specified, since EXTENDED level already contains sufficient details including tuples, sizes, and cardinality. This patch also change target path to save the updated end-to-end test file if --update_results parameter is set. For example: Before: $EE_TEST_LOGS_DIR/tpcds-decimal_v2-q98.test After: $EE_TEST_LOGS_DIR/impala_updated_results/tpcds/queries/tpcds-decimal_v2-q98.test Also ensure that the updated test file ends with a newline. Testing: - Manualy run modified PlannerTest that will fail and set EXTENDED_EXPLAIN test option. Verified that .test file name is printed and VERBOSE plan is not printed. - Manually run TestTpcdsDecimalV2Query with --update_results parameter and confirm the updated test file path is correct. Change-Id: I5e15af93d9016d78ac0575c433146c8513a11949 Reviewed-on: http://gerrit.cloudera.org:8080/22030 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-07 10:10:59 +00:00

... 3 4 5 6 7 ...

3727 Commits