impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	c5a0ec8bdf	IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package This puts all of the thrift-generated python code into the impala_thrift_gen package. This is similar to what Impyla does for its thrift-generated python code, except that it uses the impala_thrift_gen package rather than impala._thrift_gen. This is a preparatory patch for fixing the absolute import issues. This patches all of the thrift files to add the python namespace. This has code to apply the patching to the thirdparty thrift files (hive_metastore.thrift, fb303.thrift) to do the same. Putting all the generated python into a package makes it easier to understand where the imports are getting code. When the subsequent change rearranges the shell code, the thrift generated code can stay in a separate directory. This uses isort to sort the imports for the affected Python files with the provided .isort.cfg file. This also adds an impala-isort shell script to make it easy to run. Testing: - Ran a core job Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0 Reviewed-on: http://gerrit.cloudera.org:8080/20169 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-04-15 17:03:02 +00:00
Steve Carlin	706e1f026c	IMPALA-13657: Connect Calcite planner to Impala Frontend framework This commit adds the plumbing created by IMPALA-13653. The Calcite planner is now called from Impala's Frontend code via 4 hooks which are: - CalciteCompilerFactory: the factory class that creates the implementations of the parser, analysis, and single node planner hooks. - CalciteParsedStatement: The class which holds the Calcite SqlNode AST. - CalciteAnalysisDriver: The class that does the validation of the SqlNode AST - CalciteSingleNodePlanner: The class that converts the AST to a logical plan, optimizes it, and converts it into an Impala PlanNode physical plan. To run on Calcite, one needs to do two things: 1) set the USE_CALCITE_PLANNER env variable to true before starting the cluster. This adds the jar file into the path in the bin/setclasspath.sh file, which is not there by default at the time of this commit. 2) set the use_calcite_planner query option to true. This commit makes the CalciteJniFrontend class obsolete. Once the test cases are moved out of there, that class and others can be removed. Change-Id: I3b30571beb797ede827ef4d794b8daefb130ccb1 Reviewed-on: http://gerrit.cloudera.org:8080/22319 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2025-04-09 23:55:15 +00:00
Zoltan Borok-Nagy	bd3486c051	IMPALA-13586: Initial support for Iceberg REST Catalogs This patch adds initial support for Iceberg REST Catalogs. This means now it's possible to run an Impala cluster without the Hive Metastore, and without the Impala CatalogD. Impala Coordinators can directly connect to an Iceberg REST server and fetch metadata for databases and tables from there. The support is read-only, i.e. DDL and DML statements are not supported yet. This was initially developed in the context of a company Hackathon program, i.e. it was a team effort that I squashed into a single commit and polished the code a bit. The Hackathon team members were: * Daniel Becker * Gabor Kaszab * Kurt Deschler * Peter Rozsa * Zoltan Borok-Nagy The Iceberg REST Catalog support can be configured via a Java properties file, the location of it can be specified via: --catalog_config_dir: Directory of configuration files Currently only one configuration file can be in the direcory as we only support a single Catalog at a time. The following properties are mandatory in the config file: * connector.name=iceberg * iceberg.catalog.type=rest * iceberg.rest-catalog.uri The first two properties can only be 'iceberg' and 'rest' for now, they are needed for extensibility in the future. Moreover, Impala Daemons need to specify the following flags to connect to an Iceberg REST Catalog: --use_local_catalog=true --catalogd_deployed=false Testing * e2e added to test basic functionlity with against a custom-built Iceberg REST server that delegates to HadoopCatalog under the hood * Further testing, e.g. Ranger tests are expected in subsequent commits TODO: * manual testing against Polaris / Lakekeeper, we could add automated tests in a later patch Change-Id: I1722b898b568d2f5689002f2b9bef59320cb088c Reviewed-on: http://gerrit.cloudera.org:8080/22353 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-02 20:04:12 +00:00
Daniel Becker	d714798904	IMPALA-13609: Store Iceberg snapshot id for COMPUTE STATS Currently, when COMPUTE STATS is run from Impala, we set the 'impala.lastComputeStatsTime' table property. Iceberg Puffin stats, on the other hand, store the snapshot id for which the stats were calculated. Although it is possible to retrieve the timestamp of a snapshot, comparing these two values is error-prone, e.g. in the following situation: - COMPUTE STATS calculation is running on snapshot N - snapshot N+1 is committed at time T - COMPUTE STATS finishes and sets 'impala.lastComputeStatsTime' at time T + Delta - some engine writes Puffin statistics for snapshot N+1 After this, HMS stats will appear to be more recent even though they were calculated on snapshot N, while we have Puffin stats for snapshot N+1. To make comparisons easier, after this change, COMPUTE STATS sets a new table property, 'impala.computeStatsSnapshotIds'. This property stores the snapshot id for which stats have been computed, for each column. It is a comma-separated list of values of the form "fieldIdRangeStart[-fieldIdRangeEndIncl]:snapshotId". The fieldId part may be a single value or a contiguous, inclusive range. Storing the snapshot ids on a per-column basis is needed because COMPUTE STATS can be set to calculate stats for only a subset of the columns, and then a different subset in a subsequent run. The recency of the stats will then be different for each column. Storing the Iceberg field ids instead of column names makes the format easier to handle as we do not need to take care of escaping special characters. The 'impala.computeStatsSnapshotIds' table property is deleted after DROP STATS. Note that this change does not yet modify how Impala chooses between Puffin and HMS stats: that will be done in a separate change. Testing: - Added tests in iceberg-compute-stats.test checking that 'impala.computeStatsSnapshotIds' is set correctly and is deleted after DROP STATS - added unit tests in IcebergUtilTest.java that check the parsing and serialisation of the table property Change-Id: Id9998b84c4fd20d1cf5e97a34f3553832ec70ae7 Reviewed-on: http://gerrit.cloudera.org:8080/22339 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-01 13:53:43 +00:00
Riza Suminto	7a5adc15ca	IMPALA-13333: Limit memory estimation if PlanNode can spill SortNode, AggregationNode, and HashJoinNode (the build side) can spill to disk. However, their memory estimation does not consider this capability and assumes it will hold all rows in memory. This causes memory overestimation if cardinality is also overestimated. In reality, the whole query execution in a single host is often subject to much lower memory upper-bound and not allowed to exceed it. This upper-bound is dictated by, but not limited to: - MEM_LIMIT - MEM_LIMIT_COORDINATORS - MEM_LIMIT_EXECUTORS - MAX_MEM_ESTIMATE_FOR_ADMISSION - impala.admission-control.max-query-mem-limit.<pool_name> from admission control. This patch adds SpillableOperator interface that defines an alternative of either PlanNode.computeNodeResourceProfile() or DataSink.computeResourceProfile() if a lower memory upper-bound can be reasoned about from configs mentioned above. This interface is applied to SortNode, AggregationNode, HashJoinNode, and JoinBuildSink. The in-memory vs spill-to-disk bias is controlled through MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR option. A scale between [0.0,1.0] to control estimate peak memory of query operator that has spill-to-disk capabilities. Setting value closer to 1.0 will make Planner bias towards keeping as much rows as possible in memory, while setting value closer to 0.0 will make Planner bias towards spilling rows to disk under memory pressure. Note that lowering MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR can make query that previously rejected by Admission Controller becomes admittable, but also may spill-to-disk more and have higher risk to exhaust scratch space more than before. There are some caveats on this memory bounding patch: - It checks if spill-to-disk is enabled in the coordinator, but individual backend executors might not have it configured. Mismatch of spill-to-disk configs between the coordinator and backend executor, however, is rare and can be considered as misconfiguration. - It does not check the actual total scratch space available to spill-to-disk. However, query execution will be forced to spill anyway if memory usage exceeds the three memory configs above. Raising MEM_LIMIT / MEM_LIMIT_EXECUTORS option can help increase memory estimation and increase the likelihood for the query to get assigned to a larger executor group set, which usually has a bigger total scratch space. - The memory bound is divided evenly among all instances of a fragment kind in a single host. But in theory, they should be able to share and grow their memory usage independently beyond memory estimate as long max memory reservation is not set. - This does not consider other memory-related configs such as clamp_query_mem_limit_backend_mem_limit or disable_pool_mem_limits flag. But the admission controller will still enforce them if set. Testing: - Pass FE and custom cluster tests with core exploration. Change-Id: I290c4e889d4ab9e921e356f0f55a9c8b11d0854e Reviewed-on: http://gerrit.cloudera.org:8080/21762 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-03-13 18:38:27 +00:00
Peter Rozsa	167ced7844	IMPALA-13674: Enable MERGE statement for Iceberg tables with equality deletes This change fixes the delete expression calculation for IcebergMergeImpl, when an Iceberg table contains equality deletes, the merge implementation now includes the data sequence number in the result expressions as the underlying tuple descriptor also includes it implicitly. Without including this field, the row evaluation fails because of the mismatching number of evaluators and slot descriptors. Tests: - manually validated on an Iceberg table that contains equality delete - e2e test added Change-Id: I60e48e2731a59520373dbb75104d75aae39a94c1 Reviewed-on: http://gerrit.cloudera.org:8080/22423 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-06 19:20:52 +00:00
Zoltan Borok-Nagy	d928815d1a	IMPALA-13654: Tolerate missing data files of Iceberg tables Before this patch we got a TableLoadingException for missing data files. This means the IcebergTable will be in an incomplete state in Impala's memory, therefore we won't be able to do any operation on it. We should continue table loading in such cases, and only throw exception for queries that are about to read the missing data files. This way ROLLBACK / DROP PARTITION, and some SELECT statements should still work. If Impala is running in strict mode via CatalogD flag --iceberg_allow_datafiles_in_table_location_only, and an Iceberg table has data files outside of table location, we still raise an exception and leave the table in an unloaded state. To retain this behavior, the IOException we threw is substituted to TableLoadingException which fits better to logic errors anyway. Testing * added e2e tests Change-Id: If753619d8ee1b30f018e90157ff7bdbe5d7f1525 Reviewed-on: http://gerrit.cloudera.org:8080/22367 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-03 17:50:04 +00:00
Noemi Pap-Takacs	79f51b018f	IMPALA-12588: Don't UPDATE rows that already have the desired value When UPDATEing an Iceberg or Kudu table, we should change as few rows as possible. In case of Iceberg tables it means writing as few new data records and delete records as possible. Therefore, if rows already have the new values we should just ignore them. One way to achieve this is to add extra predicates, e.g.: UPDATE tbl SET k = 3 WHERE i > 4; ==> UPDATE tbl SET k = 3 WHERE i > 4 AND k IS DISTINCT FROM 3; So we won't write new data/delete records for the rows that already have the desired value. Explanation on how to create extra predicates to filter out these rows: If there are multiple assignments in the SET list, we can only skip updating a row if all the mentioned values are already equal. If either of the values needs to be updated, the entire row does. Therefore we can think of the SET list as predicates connected with AND and all of them need to be taken into consideration. To negate this SET list, we have to negate the individual SET assignments and connect them with OR. Then add this new compound predicate to the original where predicates with an AND (if there were none, just create a WHERE predicate from it). AND / \ original OR WHERE predicate / \ !a OR / \ !b !c This simple graph illustrates how the where predicate is rewritten. (Considering an UPDATE statement that sets 3 columns.) '!a', '!b' and '!c' are the negations of the individual assignments in the SET list. So the extended WHERE predicate is: (original WHERE predicate) AND (!a OR !b OR !c) To handle NULL values correctly, we use IS DISTINCT FROM instead of simply negating the assignment with operator '!='. If the assignments contain UDFs, the result might be inconsistent because of possible non-deterministic values or state in the UDFs, therefore we should not rewrite the WHERE predicate at all. Evaluating expressions can be expensive, therefore this optimization can be limited or switched off entirely using the Query Option SKIP_UNNEEDED_UPDATES_COL_LIMIT. By default, there is no filtering if more than 10 assignments are in the SET list. ------------------------------------------------------------------- Some performance measurements on a tpch lineitem table: - predicates in HASH join, all updates can be skipped Q1/[Q2] (Similar, but Q2 adds extra 4 items to the SET list): update t set t.l_suppkey = s.l_suppkey, [ t.l_partkey=s.l_partkey, t.l_quantity=s.l_quantity, t.l_returnflag=s.l_returnflag, t.l_shipmode=s.l_shipmode ] from ice_lineitem t join ice_lineitem s on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber; - predicates in HASH join, all rows need to be updated Q3: update t set t.l_suppkey = s.l_suppkey, t.l_partkey=s.l_partkey, t.l_quantity=s.l_quantity, t.l_returnflag=s.l_returnflag, t.l_shipmode=concat(s.l_shipmode,' ') from ice_lineitem t join ice_lineitem s on t.l_orderkey=s.l_orderkey and t.l_linenumber=s.l_linenumber; - predicates pushed down to the scanner, all rows updated Q4/[Q5] (Similar, but Q5 adds extra 8 items to the SET ist): update ice_lineitem set [ l_suppkey = l_suppkey + 0, l_partkey=l_partkey + 0, l_quantity=l_quantity, l_returnflag=l_returnflag, l_tax = l_tax, l_discount= l_discount, l_comment = l_comment, l_receiptdate = l_receiptdate, ] l_shipmode=concat(l_shipmode,' '); +=======+============+==========+======+ \| Query \| unfiltered \| filtered \| diff \| +=======+============+==========+======+ \| Q1 \| 4.1s \| 1.9s \| -54% \| +-------+------------+----------+------+ \| Q2 \| 4.2s \| 2.1s \| -50% \| +-------+------------+----------+------+ \| Q3 \| 4.3s \| 4.7s \| +10% \| +-------+------------+----------+------+ \| Q4 \| 3.0s \| 3.0s \| +0% \| +-------+------------+----------+------+ \| Q5 \| 3.1s \| 3.1s \| +0% \| +-------+------------+----------+------+ The results show that in the best case (we can skip all rows) this change can cause significant perf improvement ~50%, since 0 rows were written. See Q1 and Q2. If the predicates are evaluated in the join operator, but there were no matches (worst case scenario) we can lose about 10%. (Q3) If all the predicates can be pushed down to the scanners, the change does not seem to cause significant difference (~0% in Q4 and Q5) even if all rows have to be updated. Testing: - Analysis - Planner - E2E - Kudu - Iceberg - testing the new query option: SKIP_UNNEEDED_UPDATES_COL_LIMIT Change-Id: I926c80e8110de5a4615a3624a81a330f54317c8b Reviewed-on: http://gerrit.cloudera.org:8080/22407 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-28 05:52:21 +00:00
stiga-huang	b410a90b47	IMPALA-12152: Add query options to wait for HMS events sync up It's a common scenario to run Impala queries after the dependent external changes are done. E.g. running COMPUTE STATS on a table after Hive/Spark jobs ingest some new data to it. Currently, it's unsafe to run the Impala queries immediately after the Hive/Spark jobs finish since EventProcessor might have a long lag in applying the HMS events. Note that running REFRESH/INVALIDATE on the table can also solve the problem. But one of the motivation of EventProcessor is to get rid of such Impala specific commands. This patch adds a mechanism to let query planning wait until the metadata is synced up. Two new query options are added: - SYNC_HMS_EVENTS_WAIT_TIME_S configures the timeout in seconds for waiting. It's 0 by default, which disables the waiting mechanism. - SYNC_HMS_EVENTS_STRICT_MODE controls the behavior if we can't wait for metadata to be synced up, e.g. when the waiting times out or EventProcessor is in ERROR state. It defaults to false (non-strict mode). In the strict mode, coordinator will fail the query. In the non-strict mode, coordinator will start planning with a warning message in profile (and in client outputs if the client consumes the get_log results, e.g. in impala-shell). Example usage - query the table after inserting into dynamic partitions in Hive. We don't know what partitions are modified so running REFRESH in Impala is inefficient since it reloads all partitions. hive> insert into tbl partition(p) select * from tbl2; impala> set sync_hms_events_wait_time_s=300; impala> select * from tbl; With this new feature, let catalogd reload the updated partitions based on HMS events, which is more efficient than REFRESH. The wait time can be set to the largest lag of event processing that has been observed in the cluster. Note the lag of event processing is shown as the "Lag time" in the /events page of catalogd WebUI and "events-processor.lag-time" in the /metrics page. Users can monitor it to get a sense of the lag. Some timeline items are added in query profile for this waiting mechanism, e.g. A succeeded wait: Query Compilation: 937.279ms - Synced events from Metastore: 909.162ms (909.162ms) - Metadata of all 1 tables cached: 911.005ms (1.843ms) - Analysis finished: 919.600ms (8.595ms) A failed wait: Query Compilation: 1s321ms - Continuing without syncing Metastore events: 40.883ms (40.883ms) - Metadata load started: 41.618ms (735.633us) Added a histogram metric, impala-server.wait-for-hms-event-durations-ms, to track the duration of this waiting. -------- Implementation A new catalogd RPC, WaitForHmsEvent, is added to CatalogService API so that coordinator can wait until catalogd processes the latest event when this RPC is triggered. Query planning starts or fails after this RPC returns. The RPC request contains the potential dbs/tables that are required by the query. Catalogd records the latest event id when it receives this RPC. When the last synced event id reaches this, catalogd returns the catalog updates to the coordinator in the RPC response. Before that, the RPC thread is in a waiting loop that sleeps in a configurable interval. It's configured by a hidden flag, hms_event_sync_sleep_interval_ms (defaults to 100). Entry-point functions - Frontend#waitForHmsEvents() - CatalogServiceCatalog#waitForHmsEvent() Some statements don't need to wait for HMS events, e.g. CREATE/DROP ROLE statements. This patch adds an overrided method, requiresHmsMetadata(), in each Statement to mark whether they can skip HMS event sync. Test side changes: - Some test codes use EventProcessorUtils.wait_for_event_processing() to wait for HMS events being synced up before running a query. Now they are updated to just use these new query options in the query. - Note that we still need wait_for_event_processing() in test codes that verify metrics after HMS events are synced up. -------- Limitation Currently, UPDATE_TBL_COL_STAT_EVENT, UPDATE_PART_COL_STAT_EVENT, OPEN_TXN events are ignored by the event processor. If the latest event happens to be in these types and there are no more other events, the last synced event id can never reach the latest event id. We need to fix last synced event id to also consider ignored events (IMPALA-13623). The current implementation waits for the event id when the WaitForHmsEvent RPC is received at catalogd side. We can improve it by leveraging HIVE-27499 to efficiently detect whether the given dbs/tables have unsynced events and just wait for the largest id of them. Dbs/tables without unsynced events don't need to block query planning. However, this only works for non-transactional tables. Transactional tables might be modified by COMMIT_TXN or ABORT_TXN events which don't have the table names. So even with HIVE-27499, we can't determine whether a transactional table has pending events. IMPALA-13684 will target on improving this on non-transactional tables. Tests - Add test to verify planning waits until catalogd is synced with HMS changes. - Add test on the error handling when HMS event processing is disabled Change-Id: I36ac941bb2c2217b09fcfa2eb567b011b38efa2a Reviewed-on: http://gerrit.cloudera.org:8080/20131 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-22 18:14:37 +00:00
Michael Smith	1b6395b8db	IMPALA-13627: Handle legacy Hive timezone conversion After HIVE-12191, Hive has 2 different methods of calculating timestamp conversion from UTC to local timezone. When Impala has convert_legacy_hive_parquet_utc_timestamps=true, it assumes times written by Hive are in UTC and converts them to local time using tzdata, which matches the newer method introduced by HIVE-12191. Some dates convert differently between the two methods, such as Asia/Kuala_Lumpur or Singapore prior to 1982 (also seen in HIVE-24074). After HIVE-25104, Hive writes 'writer.zone.conversion.legacy' to distinguish which method is being used. As a result there are three different cases we have to handle: 1. Hive prior to 3.1 used what’s now called “legacy conversion” using SimpleDateFormat. 2. Hive 3.1.2 (with HIVE-21290) used a new Java API that’s based on tzdata and added metadata to identify the timezone. 3. Hive 4 support both, and added a new file metadata to identify it. Adds handling for Hive files (identified by created_by=parquet-mr) where we can infer the correct handling from Parquet file metadata: 1. if writer.zone.conversion.legacy is present (Hive 4), use it to determine whether to use a legacy conversion method compatible with Hive's legacy behavior, or convert using tzdata. 2. if writer.zone.conversion.legacy is not present but writer.time.zone is, we can infer it was written by Hive 3.1.2+ using new APIs. 3. otherwise it was likely written by an earlier Hive version. Adds a new CLI and query option - use_legacy_hive_timestamp_conversion - to select what conversion method to use in the 3rd case above, when Impala determines that the file was written by Hive older than 3.1.2. Defaults to false to minimize changes in Impala's behavior and because going through JNI is ~50x slower even when the results would not differ; Hive defaults to true for its equivalent setting: hive.parquet.timestamp.legacy.conversion.enabled. Hive legacy-compatible conversion uses a Java method that would be complicated to mimic in C++, doing DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); formatter.setTimeZone(TimeZone.getTimeZone(timezone_string)); java.util.Date date = formatter.parse(date_time_string); formatter.setTimeZone(TimeZone.getTimeZone("UTC")); return out.println(formatter.format(date); IMPALA-9385 added a check against a Timezone pointer in FromUnixTimestamp. That dominates the time in FromUnixTimeNanos, overriding any benchmark gains from IMPALA-7417. Moves FromUnixTime to allow inlining, and switches to using UTCPTR in the benchmark - as IMPALA-9385 did in most other code - to restore benchmark results. Testing: - Adds JVM conversion method to convert-timestamp-benchmark. - Adds tests for several cases from Hive conversion tests. Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067 Reviewed-on: http://gerrit.cloudera.org:8080/22293 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2025-02-18 16:33:39 +00:00
jasonmfehr	aac67a077e	IMPALA-13201: System Table Queries Execute When Admission Queues are Full Queries that run only against in-memory system tables are currently subject to the same admission control process as all other queries. Since these queries do not use any resources on executors, admission control does not need to consider the state of executors when deciding to admit these queries. This change adds a boolean configuration option 'onlyCoordinators' to the fair-scheduler.xml file for specifying a request pool only applies to the coordinators. When a query is submitted to a coordinator only request pool, then no executors are required to be running. Instead, all fragment instances are executed exclusively on the coordinators. A new member was added to the ClusterMembershipMgr::Snapshot struct to hold the ExecutorGroup of all coordinators. This object is kept up to date by processing statestore messages and is used when executing queries that either require the coordinators (such as queries against sys.impala_query_live) or that use an only coordinators request pool. Testing was accomplished by: 1. Adding cluster membership manager ctests to assert cluster membership manager correctly builds the list of non-quiescing coordinators. 2. RequestPoolService JUnit tests to assert the new optional <onlyCoords> config in the fair scheduler xml file is correctly parsed. 3. ExecutorGroup ctests modified to assert the new function. 4. Custom cluster admission controller tests to assert queries with a coordinator only request pool only run on the active coordinators. Change-Id: I5e0e64db92bdbf80f8b5bd85d001ffe4c8c9ffda Reviewed-on: http://gerrit.cloudera.org:8080/22249 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-14 04:27:11 +00:00
Yida Wu	80a45014ea	IMPALA-13703: Cancel running queries before shutdown deadline Currently, when the graceful shutdown deadline is reached, Impala daemon exits immediately, leaving any running queries unfinished. This approach is not quite graceful, as it may result in unreleased resources, such as scratch files in remote storage. This patch adds a new state in the graceful shutdown process. Before reaching the shutdown deadline, Impala daemon will try to cancel any remaining running queries within a configurable timelimit flag, shutdown_query_cancel_period_s. If this time limit exceeds 20% of the total shutdown deadline, it will be automatically capped at that value. The idea is to cancel queries only near the end of the graceful shutdown deadline. The 20% is the threshold to allow us to take a more aggressive way to ensure a graceful shutdown. If all queries are successfully canceled within this period, the server shuts down immediately. Otherwise, it shuts down once the deadline is reached, with queries still running. Tests: Passed core tests. Added testcases test_shutdown_coordinator_cancel_query and test_shutdown_executor_with_query_cancel_period and test_shutdown_coordinator_and_executor_cancel_query. Manually tested shutdown a coord or an executor with long running queries and they were canceled. Change-Id: I1cac2e100d329644e21fdceb0b23901b08079130 Reviewed-on: http://gerrit.cloudera.org:8080/22422 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-07 10:12:54 +00:00
Andrew Sherman	35c6a0b76d	IMPALA-13726 Add admission control slots to /queries page in webui When tracking resource utilization it is useful to see how many admission control slots are being used by queries. Add the slots used by coordinator and executors to the webui queries tables. For implementation reasons this entails also adding these fields to the query history and live query tables. The executor admission control slots are calculated by looking at a single executor backend. In theory this single number could be misleading but in practice queries are expected to have symmetrical slots across executors. Bump the schema number for the query history schema, and add some new tests. Change-Id: I057493b7767902a417dfeb75cdaeffd452d66789 Reviewed-on: http://gerrit.cloudera.org:8080/22443 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-05 03:53:18 +00:00
pranavyl	a61b90f860	IMPALA-13039: AES Encryption/ Decryption Support in Impala AES (Advanced Encryption Standard) crypto functions are widely recognized and respected encryption algorithm used to protect sensitive data which operate by transforming plaintext data into ciphertext using a symmetric key, ensuring confidentiality and integrity. This standard speciﬁes the Rijndael algorithm, a symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128 and 256 bits. The patch makes use of the EVP_*() algorithms from the OpenSSL library. The patch includes: 1. AES-GCM, AES-CTR, and AES-CFB encryption functionalities and AES-GCM, AES-ECB, AES-CTR, and AES-CFB decryption functionalities. 2. Support for both 128-bit and 256-bit key sizes for GCM and ECB modes. 3. Enhancements to EncryptionKey class to accommodate various AES modes. The aes_encrypt() and aes_decrypt() functions serve as entry points for encryption and decryption operations, handling encryption and decryption based on user-provided keys, AES modes, and initialization vectors (IVs). The implementation includes key length validation and IV vector size checks to ensure data integrity and confidentiality. Multiple AES modes: GCM, CFB, CTR for encryption, and GCM, CFB, CTR and ECB for decryption are supported to provide flexibility and compatibility with various use cases and OpenSSL features. AES-GCM is set as the default mode due to its strong security properties. AES-CTR and AES-CFB are provided as fallbacks for environments where AES-GCM may not be supported. Note that AES-GCM is not available in OpenSSL versions prior to 1.0.1, so having multiple methods ensures broader compatibility. Testing: The patch is thouroughly tested and the tests are included in exprs.test. Change-Id: I3902f2b1d95da4d06995cbd687e79c48e16190c9 Reviewed-on: http://gerrit.cloudera.org:8080/20447 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-27 22:16:37 +00:00
stiga-huang	2e59bbae37	IMPALA-12785: Add commands to control event-processor status This patch extends the existing AdminFnStmt to support operations on EventProcessor. E.g. to pause the EventProcessor: impala-shell> :event_processor('pause'); to resume the EventProcessor: impala-shell> :event_processor('start'); Or to resume the EventProcessor on a given event id (1000): impala-shell> :event_processor('start', 1000); Admin can also resume the EventProcessor at the latest event id by using -1: impala-shell> :event_processor('start', -1); Supported command actions in this patch: pause, start, status. The command output of all actions will show the latest status of EventProcessor, including - EventProcessor status: PAUSED / ACTIVE / ERROR / NEEDS_INVALIDATE / STOPPED / DISABLED. - LastSyncedEventId: The last HMS event id which we have synced to. - LatestEventId: The event id of the latest event in HMS. Example output: [localhost:21050] default> :event_processor('pause'); +--------------------------------------------------------------------------------+ \| summary \| +--------------------------------------------------------------------------------+ \| EventProcessor status: PAUSED. LastSyncedEventId: 34489. LatestEventId: 34489. \| +--------------------------------------------------------------------------------+ Fetched 1 row(s) in 0.01s If authorization is enabled, only admin users that have ALL privilege on SERVER can run this command. Note that there is a restriction in MetastoreEventsProcessor#start(long) that resuming EventProcessor back to a previous event id is only allowed when it's not in the ACTIVE state. This patch aims to expose the control of EventProcessor to the users so MetastoreEventsProcessor is not changed. We can investigate the restriction and see if we want to relax it. Note that resuming EventProcessor at a newer event id can be done on any states. Admins can use this to manually resolve the lag of HMS event processing, after they have made sure all (or important) tables are manually invalidated/refreshed. A new catalogd RPC, SetEventProcessorStatus, is added for coordinators to control the status of EventProcessor. Tests - Added e2e tests Change-Id: I5a19f67264cfe06a1819a22c0c4f0cf174c9b958 Reviewed-on: http://gerrit.cloudera.org:8080/22250 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-24 04:10:14 +00:00
Daniel Becker	c5b474d3f5	IMPALA-13594: Read Puffin stats also from older snapshots Before this change, Puffin stats were only read from the current snapshot. Now we also consider older snapshots, and for each column we choose the most recent available stats. Note that this means that the stats for different columns may come from different snapshots. In case there are both HMS and Puffin stats for a column, the more recent one will be used - for HMS stats we use the 'impala.lastComputeStatsTime' table property, and for Puffin stats we use the snapshot timestamp to determine which is more recent. This commit also renames the startup flag 'disable_reading_puffin_stats' to 'enable_reading_puffin_stats' and the table property 'impala.iceberg_disable_reading_puffin_stats' to 'impala.iceberg_read_puffin_stats' to make them more intuitive. The default values are flipped to keep the same behaviour as before. The documentation of Puffin reading is updated in docs/topics/impala_iceberg.xml Testing: - updated existing test cases and added new ones in test_iceberg_with_puffin.py - reorganised the tests in TestIcebergTableWithPuffinStats in test_iceberg_with_puffin.py: tests that modify table properties and other state that other tests rely on are now run separately to provide a clean environment for all tests. Change-Id: Ia37abe8c9eab6d91946c8f6d3df5fb0889704a39 Reviewed-on: http://gerrit.cloudera.org:8080/22177 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-23 15:25:59 +00:00
Xuebin Su	d7ee509e93	IMPALA-12648: Add KILL QUERY statement To support killing queries programatically, this patch adds a new type of SQL statements, called the KILL QUERY statement, to cancel and unregister a query on any coordinator in the cluster. A KILL QUERY statement looks like ``` KILL QUERY '123:456'; ``` where `123:456` is the query id of the query we want to kill. We follow syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY' are added as "unreserved keywords", like 'DEFAULT'. This allows the three keywords to be used as identifiers. A user is authorized to kill a query only if the user is an admin or is the owner of the query. KILL QUERY statements are not affected by admission control. Implementation: Since we don't know in advance which impalad is the coordinator of the query we want to kill, we need to broadcast the kill request to all the coordinators in the cluster. Upon receiving a kill request, each coordinator checks whether it is the coordinator of the query: - If yes, it cancels and unregisters the query, - If no, it reports "Invalid or unknown query handle". Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is created for this. For authorization, this patch adds a custom handler of AuthorizationException for each statement to allow the exception to be handled by the backend. This is because we don't know whether the user is the owner of the query until we reach its coordinator. To support cancelling child queries, this patch changes ChildQuery::Cancel() to bypass the HS2 layer so that the session of the child query will not be added to the connection used to execute the KILL QUERY statement. Testing: - A new ParserTest case is added to test using "unreserved keywords" as identifiers. - New E2E test cases are added for the KILL QUERY statement. - Added a new dimension in TestCancellation to use the KILL QUERY statement. - Added file tests/common/cluster_config.py and made CustomClusterTestSuite.with_args() composable so that common cluster configs can be reused in custom cluster tests. Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0 Reviewed-on: http://gerrit.cloudera.org:8080/21930 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2025-01-22 22:22:54 +00:00
Riza Suminto	3118e41c26	IMPALA-2945: Account for duplicate keys on multiple nodes preAgg AggregationNode.computeStats() estimate cardinality under single node assumption. This can be an underestimation in preaggregation node case because same grouping key may exist in multiple nodes during preaggreation. This patch adjust the cardinality estimate using following model for the number of distinct values in a random sample of k rows, previously used to calculate ProcessingCost model by IMPALA-12657 and IMPALA-13644. Assumes we are picking k rows from an infinite sample with ndv distinct values, with the value uniformly distributed. The probability of a given value not appearing in a sample, in that case is ((NDV - 1) / NDV) ^ k This is because we are making k choices, and each of them has (ndv - 1) / ndv chance of not being our value. Therefore the probability of a given value appearing in the sample is: 1 - ((NDV - 1) / NDV) ^ k And the number of distinct values in the sample is: (1 - ((NDV - 1) / NDV) ^ k) * NDV Query option ESTIMATE_DUPLICATE_IN_PREAGG is added to control whether to use the new estimation logic or not. Testing: - Pass core tests. Change-Id: I04c563e59421928875b340cb91654b9d4bc80b55 Reviewed-on: http://gerrit.cloudera.org:8080/22047 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 20:22:03 +00:00
Yida Wu	702131b677	IMPALA-13565: Add general AI platform support to ai_generate_text Currently only OpenAI sites are allowed for ai_generate_text(), this patch adds support for general AI platforms to the ai_generate_text function. It introduces a new flag, ai_additional_platforms, allowing Impala to access additional AI platforms. For these general AI platforms, only the openai standard is supported, and the default api credential serves as the api token for general platforms. The ai_api_key_jceks_secret parameter has been renamed to auth_credential to support passing both plain text and jceks encrypted secrets. A new impala_options parameter is added to ai_generate_text() to enable future extensions. Adds the api_standard option to impala_options, with "openai" as the only supported standard. Adds the credential_type option to impala_options for allowing the plain text as the token, by default it is set to jceks. Adds the payload option to impala_options for customized payload input. If set, the request will use the provided customized payload directly, and the response will follow the openai standard for parsing. The customized payload size must not exceed 5MB. Adding the impala_options parameter to ai_generate_text() should be fine for backward compatibility, as this is a relatively new feature. Example: 1. Add the site to ai_api_additional_platforms,like: ai_additional_platforms='new_ai.site,new_ai.com' 2. Example sql: select ai_generate_text("https://new_ai.com/v1/chat/completions", "hello", "model-name", "ai-api-token", "platform params", '{"api_standard":"openai", "credential_type":"plain", "payload":"payload content"}}') Tests: Added a new test AiFunctionsTestAdditionalSites. Manual tested the example with the Cloudera AI platform. Passed core and asan tests. Change-Id: I4ea2e1946089f262dda7ace73d5f7e37a5c98b14 Reviewed-on: http://gerrit.cloudera.org:8080/22130 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com>	2025-01-16 19:12:07 +00:00
gaurav1086	c3cbd79b56	IMPALA-13288: OAuth AuthN Support for Impala This patch added OAuth support with following functionality: * Load and parse OAuth JWKS from configured JSON file or url. * Read the OAuth Access token from the HTTP Header which is the same format as JWT Authorization Bearer token. * Verify the OAuth's signature with public key in JWKS. * Get the username out of the payload of OAuth Access token. * If kerberos or ldap is enabled, then both jwt and oauth are supported together. Else only one of jwt or oauth is supported. This has been a pre existing flow for jwt. So OAuth will follow the same policy. * Impala Shell side changes: OAuth options -a and --oauth_cmd Testing: - Added 3 custom cluster be test in test_shell_jwt_auth.py: - test_oauth_auth_valid: authenticate with valid token. - test_oauth_auth_expired: authentication failure with expired token. - test_oauth_auth_invalid_jwk: authentication failure with valid signature but expired. - Added 1 custom cluster fe test in JwtWebserverTest.java - testWebserverOAuthAuth: Basic tests for OAuth - Added 1 custom cluster fe test in LdapHS2Test.java - testHiveserver2JwtAndOAuthAuth: tests all combinations of jwt and oauth token verification with separate jwks keys. - Manually tested with a valid, invalid and expired oauth access token. - Passed core run. Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6 Reviewed-on: http://gerrit.cloudera.org:8080/21728 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 03:32:57 +00:00
Riza Suminto	23edbde7c7	IMPALA-13637: Add ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE option IMPALA-13405 adds a new tuple-analysis algorithm in AggregationNode to lower cardinality estimation when planning multi-column grouping. This patch adds a query option ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE that allows users to enable/disable the algorithm if necessary. Default is True. Testing: - Add testAggregationNoTupleAnalysis. This test is based on TpcdsPlannerTest#testQ19 but with ENABLE_TUPLE_ANALYSIS_IN_AGGREGATE set to false. Change-Id: Iabd8daa3d9414fc33d232643014042dc20530514 Reviewed-on: http://gerrit.cloudera.org:8080/22294 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-06 21:32:45 +00:00
Peter Rozsa	19110b490d	IMPALA-13362: Implement WHEN NOT MATCHED BY SOURCE syntax for MERGE statement This change adds support for a new MERGE clause that covers the condition when the source statement's rows do not match the target tables rows. Example: MERGE INTO target t using source s on t.id = s.id WHEN NOT MATCHED BY SOURCE THEN UPDATE set t.column = "a"; This change also adds support to use WHEN NOT MATCHED BY TARGET explicitly, this is equivalent to WHEN NOT MATCHED. Tests: - Parser tests for the new language elements. - Analyzer and planner test for WHEN NOT MATCHED BY SOURCE/TARGET clauses. - E2E tests for WHEN NOT MATCHED BY SOURCE clause. Change-Id: Ia0e0607682a616ef6ad9eccf499dc0c5c9278c5f Reviewed-on: http://gerrit.cloudera.org:8080/21988 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-02 15:02:20 +00:00
Andrew Sherman	de6b902581	IMPALA-12345: Add user quotas to Admission Control Allow administrators to configure per user limits on queries that can run in the Impala system. In order to do this, there are two parts. Firstly we must track the total counts of queries in the system on a per-user basis. Secondly there must be a user model that allows rules that control per-user limits on the number of queries that can be run. In a Kerberos environment the user names that are used for both the user model and at runtime are short user names, e.g. testuser when the Kerberos principal is testuser/scm@EXAMPLE.COM TPoolStats (the data that is shared between Admission Control instances) is extended to include a map from user name to a count of queries running. This (along with some derived data structures) is updated when queries are queued and when they are released from Admission Control. This lifecycle is slightly different from other TPoolStats data which usually tracks data about queries that are running. Queries can be rejected because of user quotas at submission time. This is done for two reasons: (1) queries can only be admitted from the front of the queue and we do not want to block other queries due to quotas, and (2) it is easy for users to understand what is going on when queries are rejected at submission time. Note that when running in configurations without an Admission Daemon then Admission Control does not have perfect information about the system and over-admission is possible for User-Level Admission Quotas in the same way that it is for other Admission Control controls. The User Model is implemented by extending the format of the fair-scheduler.xml file. The rules controlling the per-user limits are specified in terms of user or group names. Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to the fair-scheduler.xml file. These elements can be placed on the root configuration, which applies to all pools, or the pool configuration. The ‘userQueryLimit’ element has 2 child elements: "user" and "totalCount". The 'user' element contains the short names of users, and can be repeated, or have the value "*" for a wildcard name which matches all users. The ‘groupQueryLimit’ element has 2 child elements: "group" and "totalCount". The 'group' element contains group names. The root level rules and pool level rules must both be passed for a new query to be queued. The rules dictate a maximum number of queries that can run by a user. When evaluating rules at either the root level, or at the pool level, when a rule matches a user then there is no more evaluation done. To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the RequestPoolService is enhanced. If user quotas are enabled for a pool then a list of the users with running or queued queries in that pool is visible on the coordinator webui admission control page. More comprehensive documentation of the user model will be provided in IMPALA-12943 TESTING New end-to-end tests are added to test_admission_controller.py, and admission-controller-test is extended to provide unit tests for the user model. Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41 Reviewed-on: http://gerrit.cloudera.org:8080/21616 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-16 06:38:38 +00:00
Noemi Pap-Takacs	88e0e4e8ba	IMPALA-13334: Fix test_sort.py DCHECK hit when max_sort_run_size>0 test_sort.py declared 'max_sort_run_size' query option, but it silently did not exercise it. Fixing the query option declaration in IMPALA-13349 using helper function add_exec_option_dimension() revealed a DCHECK failure in sorter.cc. In some cases the length of an in-memory run could exceed 'max_sort_run_size' by 1 page. This patch fixed the DCHECK failure by strictly enforcing the 'max_sort_run_size' limit. Memory limits were also adjusted in test_sort.py according to the memory usage of the different sort run sizes. Additionally, the 'MAX_SORT_RUN_SIZE' query option's valid range was relaxed. Instead of throwing an error, negative values also disable the run size limitation, just as the default: '0'. Testing: - E2E tests in sort.py - set test Change-Id: I943d8edcc87df168448a174d6c9c6b46fe960eae Reviewed-on: http://gerrit.cloudera.org:8080/21777 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-05 15:06:52 +00:00
jasonmfehr	7b6ccc644b	IMPALA-12737: Query columns in workload management tables. Adds "Select Columns", "Where Columns", "Join Columns", "Aggregate Columns", and "OrderBy Columns" to the query profile and the workload management active/completed queries tables. These fields are presented as comma separate strings containing the fully qualified column name in the format database.table_name.column_name. Aggregate columns include all columns in the order by and having clauses. Since new columns are being added, the workload management init process is also being modified to allow for one-way upgrades of the table schemas if necessary. Additionally, workload management can be set up to run under a schema version that is not the latest. This ability will be useful during troubleshooting. To enable these upgrades, the workload management initialization that manages the structure of the tables has been moved to the catalogd. The changes in this patch must be backwards compatible so that Impala clusters running previous workload management code can co-exist with Impala clusters running this workload management code. To enable that backwards compatibility, a new table property named 'wm_schema_version' is now used to track the schema version of the workload management tables. Thus, the old property 'schema_version' will always be set to '1.0.0' since modifying that property value causes Impala running previous workload management code to error at startup. Testing accomplished by * Adding/updating workload and custom cluster tests to assert the new columns and the workload management upgrade process. * JUnit tests added to verify the new workload management columns are being correctly parsed. * GTests added to ensure the workload management columns are correctly defined and in the correct order. Change-Id: I78f3670b067c0c192ee8a212fba95466fbcb51d7 Reviewed-on: http://gerrit.cloudera.org:8080/21142 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2024-10-31 17:06:43 +00:00
Daniel Becker	b05b408f17	IMPALA-13247: Support Reading Puffin files for the current snapshot This change adds support for reading NDV statistics from Puffin files when they are available for the current snapshot. Puffin files or blobs that were written for other snapshots than the current one are ignored. Because this behaviour is different from what we have for HMS stats and may therefore be unintuitive for users, reading Puffin stats is disabled by default; set the "--disable_reading_puffin_stats" startup flag to false to enable it. When Puffin stats reading is enabled, the NDV values read from Puffin files take precedence over NDV values stored in the HMS. This is because we only read Puffin stats for the current snapshot, so these values are always up-to-date, while the values in the HMS may be stale. Note that it is currently not possible to drop Puffin stats from Impala. For this reason, this patch also introduces two ways of disabling the reading of Puffin stats: - globally, with the aforementioned "--disable_reading_puffin_stats" startup flag: when it is set to true, Impala will never read Puffin stats - for specific tables, by setting the "impala.iceberg_disable_reading_puffin_stats" table property to true. Note that this change is only about reading Puffin files, Impala does not yet support writing them. Testing: - created the PuffinDataGenerator tool which can generate Puffin files and metadata.json files for different scenarios (e.g. all stats are in the same Puffin file; stats for different columns are in different Puffin files; some Puffin files are corrupt etc.). The generated files are under the "testdata/ice_puffin/generated" directory. - The new custom cluster test class 'test_iceberg_with_puffin.py::TestIcebergTableWithPuffinStats' uses the generated data to test various scenarios. - Added custom cluster tests that test the 'disable_reading_puffin_stats' startup flag. Change-Id: I50c1228988960a686d08a9b2942e01e366678866 Reviewed-on: http://gerrit.cloudera.org:8080/21605 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-19 22:14:59 +00:00
Peter Rozsa	df3a38096e	IMPALA-12861: Fix mixed file format listing for Iceberg tables This change fixes file format information collection for Iceberg tables. Previously, all file descriptor's file formats were collected from getSampledOrRawPartitions() in HdfsScanNode for Iceberg tables, now the collection part is extracted as a method and it's overridden in IcebergScanNode. Now, only the to-be-scanned file descriptor's file format is recorded, showing the correct file formats for each SCAN nodes in the plans. Tests: - Planner tests added for mixed file format table with partitioning. Change-Id: Ifae900914a0d255f5a4d9b8539361247dfeaad7b Reviewed-on: http://gerrit.cloudera.org:8080/21871 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-18 13:24:42 +00:00
Joe McDonnell	7a1fe97ae1	IMPALA-13412: Use the BYTES type for tuple cache entries-in-use-bytes The impala.tuple-cache.entries-in-use-bytes metric currently uses the UNIT type. This switches it to the BYTES type so that the entry in the WebUI displays it in a clearer way (e.g. GBs). This switches impala.codegen-cache.entries-in-use-bytes the same way. Testing: - Ran a core job Change-Id: Ib02bbc1c7e796c8d83c58fa12d5c18012753e300 Reviewed-on: http://gerrit.cloudera.org:8080/21939 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-16 23:21:49 +00:00
Yida Wu	0767ae065a	IMPALA-12908: (Addendum) use RUNTIME_FILTER_WAIT_TIME_MS for tuple cache TPC testing When runtime filters arrive after tuple caching has occurred, they can't filter the cached results. This can lead to larger tuple caching result sets than expected, causing correctness check failures in TPC tests. While other solutions may exist, extending RUNTIME_FILTER_WAIT_TIME_MS is a simple fix by ensuring runtime filters are applied before tuple caching. Also set the query option enable_tuple_cache_verification to false by default, as the filter arrival time may affect the correctness check. To avoid flaky tests, change to use a more conservative approach and only enable the correctness check when explicitly specified by the testcase. Tests: Verified TPC tests pass correctness checks with increased runtime filter wait time. Change-Id: Ie70a87344c436ce8e2073575df5c5bf762ef562d Reviewed-on: http://gerrit.cloudera.org:8080/21898 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-15 23:38:03 +00:00
Michael Smith	b6b953b48e	IMPALA-13186: Tag query option scope for tuple cache Constructs a hash of non-default query options that are relevant to query results; by default query options are included in the hash. Passes this hash to the frontend for inclusion in the tuple cache key on plan leaf nodes (which will be included in parent key calculation). Modifies MurmurHash3 to be re-entrant so the backend can construct a hash incrementally. This is slightly slower but more memory efficient than accumulating all hash inputs in a contiguous array first. Adds TUPLE_CACHE_EXEMPT_QUERY_OPT_FN to mark query options that can be ignored when calculating a tuple cache hash. Adds startup flag 'tuple_cache_exempt_query_options' as a safety valve for query options that might be important to exempt that we missed. Removes duplicate printing logic for query options from child-query.cc in favor of re-using TQueryOptionsToMap, which does the same thing. Cleans up query-options.cc helpers so they're static and reduces duplicate printing logic. Adds test that different values for a relevant query option use different cache entries. Adds startup flag 'tuple_cache_ignore_query_options' to omit query options for testing certain tuple cache failure modes, where we need to use debug actions. Change-Id: I1f4802ad9548749cd43df8848b6f46dca3739ae7 Reviewed-on: http://gerrit.cloudera.org:8080/21698 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-09 17:21:10 +00:00
Yida Wu	f11172a4a2	IMPALA-12908: Add correctness check for tuple cache The patch adds a feature to the automated correctness check for tuple cache. The purpose of this feature is to enable the verification of the correctness of the tuple cache by comparing caches with the same key across different queries. The feature consists of two main components: cache dumping and runtime correctness validation. During the cache dumping phase, if a tuple cache is detected, we retrieve the cache from the global cache and dump it to a subdirectory as a reference file within the specified debug dumping directory. The subdirectory is using the cache key as its name. Additionally, data from the child is also read and dumped to a separate file in the same directory. We expect these two files to be identical, assuming the results are deterministic. For non-deterministic cases like TOP-N or others, we may detect them and exclude them from dumping later. Furthermore, the cache data will be transformed into a human-readable text format on a row-by-row basis before dumping. This approach allows for easier investigation and later analysis. The verification process starts by comparing the entire file content sharing with the same key. If the content matches, the verification is considered successful. However, if the content doesn't match, we enter a slower mode where we compare all the rows individually. In the slow mode, we will create a hash map from the reference cache file, then iterate the current cache file row by row and search if every row exists in the hash map. Additionally, a counter is integrated into the hash map to handle scenarios involving duplicated rows. Once verification is complete, if no discrepancies are found, both files will be removed. If discrepancies are detected, the files will be kept and appended with a '.bad' postfix. New start flags: Added a starting flag tuple_cache_debug_dump_dir for specifying the directory for dumping the result caches. if tuple_cache_debug_dump_dir is empty, the feature is disabled. Added a query option enable_tuple_cache_verification to enable or disable the tuple cache verification. Default is true. Only valid when tuple_cache_debug_dump_dir is specified. Tests: Ran the testcase test_tuple_cache_tpc_queries and caught known inconsistencies. Change-Id: Ied074e274ebf99fb57e3ee41a13148725775b77c Reviewed-on: http://gerrit.cloudera.org:8080/21754 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2024-09-30 16:25:19 +00:00
Riza Suminto	93c64e7e9a	IMPALA-13376: Add docs for AGG_MEM_CORRELATION_FACTOR etc This patch adds documentation for AGG_MEM_CORRELATION_FACTOR and LARGE_AGG_MEM_THRESHOLD option introduced in Apache Impala 4.4.0. IMPALA-12548 fix behavior of AGG_MEM_CORRELATION_FACTOR. Higher value will lower memory estimation, while lower value will result in higher memory estimation. The documentation in ImpalaService.thrift, however, says the opposite. This patch fix documentation in thrift file as well. Testing: - Run "make plain-html" in docs/ dir and confirm the output. - Manually check with comments in PlannerTest.testAggNodeMaxMemEstimate() Change-Id: I00956a50fb7616ca3c3ea2fd75fd11239a6bcd90 Reviewed-on: http://gerrit.cloudera.org:8080/21793 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2024-09-24 17:10:34 +00:00
wzhou-code	d2cd9b51a0	IMPALA-13388: fix unit-tests of Statestore HA for UBSAN builds Sometimes in UBSAN builds, unit-tests of Statestore HA failed due to Thrift RPC receiving timeout. Standby statestored failed to send heartbeats to its subscribers so that failover was not triggered. The Thrift RPC failures still happened after increasing TCP timeout for Thrift RPCs between statestored and its subscribers. This patch adds a metric for number of subscribers which recevied heartbeats from statestored in a monitoring period. Unit-tests of Statestored HA for UBSAN build will be skipped if statestored failed to send heartbeats to more than half of subscribers. For other builds, throw exception with error message which complain Thrift RPC failure if statestored failed to send heartbeats to more than half of subscribers. Also fixed a bug which calls SecondsSinceHeartbeat() but compares the retutned value with time value in milli-seconds. Filed following up JIRA IMPALA-13399 to track the very root cause. Testing: - Looped to run test_statestored_ha.py for 100 times in UBSAN build without failed case, but 4 iterations out of 100 have skipped test cases. - Verified that the issue did not happen for ASAN build by running test_statestored_ha.py for 100 times in ASAN build. - Passed core test. Change-Id: Ie59d1e93c635411723f7044da52e4ab19c7d2fac Reviewed-on: http://gerrit.cloudera.org:8080/21820 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-24 05:39:14 +00:00
stiga-huang	58fd45f20c	IMPALA-12876: Add catalogVersion and loaded timestamp in query profiles When debugging stale metadata, it'd be helpful to know what catalog version of the tables are used and what's the time when catalogd loads those versions. This patch exposes these info in the query profile for each referenced table. E.g. Original Table Versions: tpch.customer, 2249, 1726052668932, Wed Sep 11 19:04:28 CST 2024 tpch.nation, 2255, 1726052790140, Wed Sep 11 19:06:30 CST 2024 tpch.orders, 2257, 1726052803258, Wed Sep 11 19:06:43 CST 2024 tpch.lineitem, 2254, 1726052785384, Wed Sep 11 19:06:25 CST 2024 tpch.supplier, 2256, 1726052794235, Wed Sep 11 19:06:34 CST 2024 Each line consists of the table name, catalog version, loaded timestamp and the timestamp string. Implementation: The loaded timestamp is updated whenever a CatalogObject updates its catalog version in catalogd. It's passed to impalads with the TCatalogObject broadcasted by statestore, or in DDL/DML responses. Currently, the loaded timestamp is added for table, view, function, data source, and hdfs cache pool in catalogd. However, only those of table and view are applied used in impalad. For the loaded timestamp of other types, users can check them in the /catalog WebUI of catalogd. Tests: - Adds e2e test Change-Id: I94b2fd59ed5aca664d6db4448c61ad21a88a4f98 Reviewed-on: http://gerrit.cloudera.org:8080/21782 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-16 21:51:56 +00:00
Csaba Ringhofer	c7ce233679	IMPALA-12594: Add flag to tune KrpcDataStreamSender mem estimate The way the planner estimates mem usage for KrpcDataStreamSender is very different than how the backend actually uses memory - the planner assumes that batch_size number of rows are sent at a time while the BE tries to limit it to data_stream_sender_buffer_size_ (but doesn't consider var len data). The Jira has more detail about differences and issues. This change adds flag data_stream_sender_buffer_size_used_by_planner. If this is set to 16K (data_stream_sender_buffer_size_ default) then the estimation will work similarly to BE. Tested that this can improve both under and overestimations: peak mem / mem estimate of the first sender: select distinct * from tpch_parquet.lineitem limit 100000 default: 284.04 KB 2.75 MB --data_stream_sender_buffer_size_used_by_planner=16384: 282.04 KB 283.39 KB select distinct l_comment from tpch_parquet.lineitem limit 100000; default: 747.71 KB 509.94 KB --data_stream_sender_buffer_size_used_by_planner=16384: 740.71 KB 627.46 KB The default is not changed to avoid side effects. I would like to change it once BE's handling of var len data is fixed, which is a prerequisity to use mem reservation in KrpcDataStreamSender. Change-Id: I1e4b1db030be934cece565e3f2634ee7cbdb7c4f Reviewed-on: http://gerrit.cloudera.org:8080/21797 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-14 20:24:41 +00:00
Mihaly Szjatinya	22723d0f27	IMPALA-7086: Cache timezone in _utc_timestamp() Added Prepare - Close routine around from/to_utc standard functions. This gives a consistent time improvement for constant timezones. Given sample table with 600M timestamp rows, on all-default environment the query below gives a stable 2-3 seconds improvement. SELECT count() FROM a_table where from_utc_timestamp(ts, "a_timezone") > "a_date"; Averaged results for Release, SET MT_OP=1, SET DISABLE_CODEGEN=TRUE: from_utc: 16,53s -> 12,53s to_utc: 14,02 - > 11,53 Change-Id: Icdf5ff82c5d0554333aef1bc3bba034a4cf48230 Reviewed-on: http://gerrit.cloudera.org:8080/21735 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-06 18:08:26 +00:00
Peter Rozsa	a0aaf338ae	IMPALA-12732: Add support for MERGE statements for Iceberg tables MERGE statement is a DML command that allows users to perform conditional insert, update, or delete operations on a target table based on the results of a join with a source table. This change adds MERGE statement parsing and an Iceberg-specific semantic analysis, planning, and execution. The parsing grammar follows the SQL standard, it accepts the same syntax as Hive, Spark, and Trino by supporting arbitrary number of WHEN clauses, with conditions or without and accepting inline views as source. Example: 'MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED AND t.id < 100 THEN UPDATE SET column1 = s.column1 WHEN MATCHED AND t.id > 100 THEN DELETE WHEN MATCHED THEN UPDATE SET column1 = "value" WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.column1);' The Iceberg-specific analysis, planning, and execution are based on a concept that was previously used for UPDATE: The analyzer creates a SELECT statement with all target and source columns (including Iceberg's virtual columns) and a 'row_present' column that defines whether the source, the target, or both rows are present in the result set after joining the two table references by the ON clause. The join condition should be an equi-join, as it is a FULL OUTER JOIN, and Impala currently supports only equi-joins in this case. The joining order is forced by a query hint, this guarantees that the target table is always on the left side. A new, IcebergMergeNode is added at planning phase, this node does the row-level filtering for each MATCHED/ NOT MATCHED cases. The 'row_present' column decides which case group will be evaluated; if both sides are available, the matched cases, if only the source side matches then the not matched cases and their filter expressions will be evaluated over the row. If one of the cases match, then the execution evaluates the result expressions into the output row batch, and an auxiliary tuple will store the merge action. The merge action is a flag for the newly added IcebergMergeSink; this sink will route each incoming row from IcebergMergeNode to their respective destination. Each row could go to the delete sink, insert sink, or to both sinks. Target-side duplicate records are filtered during IcebergMergeNode's execution, if one target table-side duplicate is detected, the whole statement's execution is stopped and the error is reported back to the user. Added tests: - Parser tests - Analyzer tests - Unit test for WHEN NOT MATCHED INSERT column collation - Planner tests for partitioned/sorted cases - Authorization tests - E2E tests Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268 Reviewed-on: http://gerrit.cloudera.org:8080/21423 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-05 01:01:05 +00:00
Joe McDonnell	831585c2f5	IMPALA-12906: Incorporate scan range information into the tuple cache key This change is accomplishing two things: 1. It incorporates scan range information into the tuple cache key. 2. It reintroduces deterministic scheduling as an option for mt_dop and uses it for HdfsScanNodes that feed into a TupleCacheNode. The combination of these two things solves several problems: 1. When a table is modified, the list of scan ranges will change, and this will naturally change the cache keys. 2. When accessing a partitioned table, two queries may have different predicates on the partition columns. Since the predicates can be satisfied via partition pruning, they are not included at runtime. This means that two queries may have identical compile-time keys with only the scan ranges being different due to different partition pruning. 3. Each fragment instance processes different scan ranges, so each will have a unique cache key. To incorporate scan range information, this introduces a new per-fragment-instance cache key. At compile time, the planner now keeps track of which HdfsScanNodes feed into a TupleCacheNode. This is passed over to the runtime as a list of plan node ids that contain scan ranges. At runtime, the fragment instance walks through the list of plan nodes ids and hashes any scan ranges associated with them. This hash is the per-fragment-instance cache key. The combination of the compile-time cache key produced by the planner and the per-fragment-instance cache key is a unique identifier of the result. Deterministic scheduling for mt_dop was removed via IMPALA-9655 with the introduction of the shared queue. This revives some of the pre-IMPALA-9655 scheduling logic as an option. Since the TupleCacheNode knows which HdfsScanNodes feed into it, the TupleCacheNode turns on deterministic scheduling for all of those HdfsScanNodes. Since this only applies to HdfsScanNodes that feed into a TupleCacheNode, it means that any HdfsScanNode that doesn't feed into a TupleCacheNode continues using the current algorithm. The pre-IMPALA-9655 code is modified to make it more deterministic about how it assigns scan ranges to instances. Testing: - Added custom cluster tests for the scan range information including modifying a table, selecting from a partitioned table, and verifying that fragment instances have unique keys - Added basic frontend test to verify that deterministic scheduling gets set on the HdfsScanNode that feed into the TupleCacheNode. - Restored the pre-IMPALA-9655 backend test to cover the LPT code Change-Id: Ibe298fff0f644ce931a2aa934ebb98f69aab9d34 Reviewed-on: http://gerrit.cloudera.org:8080/21541 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Yida Wu <wydbaggio000@gmail.com>	2024-09-04 16:31:31 +00:00
Noemi Pap-Takacs	9e0649b9ce	IMPALA-12867: Filter files to OPTIMIZE based on file size The OPTIMIZE TABLE statement is currently used to rewrite the entire Iceberg table. With the 'FILE_SIZE_THRESHOLD_MB' option, the user can specify a file size limit to rewrite only small files. Syntax: OPTIMIZE TABLE <table_name> [(FILE_SIZE_THRESHOLD_MB=<value>)]; The value of the threshold is the file size in MBs. It must be a non-negative integer. Data files larger than the given limit will only be rewritten if they are referenced from delete files. If only 1 file is selected in a partition, it will not be rewritten. If the threshold is 0, only the delete files and the referenced data files will be rewritten. IMPALA-12839: 'Optimizing empty table should be no-op' is also resolved in this patch. With the file selection option, the OPTIMIZE operation can operate in 3 different modes: - REWRITE_ALL: The entire table is rewritten. Either because the compaction was triggered by a simple 'OPTIMIZE TABLE' command without a specified 'FILE_SIZE_THRESHOLD_MB' parameter, or because all files of the table are deletes/referenced by deletes or are smaller than the limit. - PARTIAL: If the value of 'FILE_SIZE_THRESHOLD_MB' parameter is specified then only the small data files without deletes are selected and the delete files are merged. Large data files without deletes are kept to avoid unnecessary resource consuming writes. - NOOP: When no files qualify for the selection criteria, there is no need to rewrite any files. This is a no-operation. Testing: - Parser test - FE unit tests - E2E tests Change-Id: Icfbb589513aacdb68a86c1aec4a0d39b12091820 Reviewed-on: http://gerrit.cloudera.org:8080/21388 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-08-28 14:19:41 +00:00
stiga-huang	db0f0dadf1	IMPALA-13240: Add gerrit comments for Thrift/FlatBuffers changes Adds gerrit comments for changes in Thrift/FlatBuffers files that could break the communication between impalad and catalogd/statestore during upgrade. Basically, only new optional fields can be added in Thrift protocol. For Flatbuffers schemas, we should only add new fields at the end of a table definition. Adds a new option (--revision) for critique-gerrit-review.py to specify the revision (HEAD or a commit, branch, etc). Also adds an option (--base-revision) to specify the base revision for comparison. To test the script locally, prepare a virtual env with the virtualenv package installed: virtualenv venv source venv/bin/activate pip install virtualenv Then run the script with --dryrun: python bin/jenkins/critique-gerrit-review.py --dryrun --revision `effc9df93` Limitations - False positive in cases that add new Thrift structs with required fields and only use those new structs in new optional fields, e.g. `effc9df93` and `72732da9d`. - Might have false positive results on reformat changes due to simple string checks, e.g. `91d8a8f62`. - Can't check incompatible changes in FlatBuffers files. Just add general file level comments. We can integrate DUPCheck in the future to parse the Thrift/FlatBuffers files to AST and compare the AST instead. https://github.com/jwjwyoung/DUPChecker Tests: - Verified incompatible commits like `012996a06` and `65094a74f`. - Verified posting Gerrit comments from local env using my username. Change-Id: Ib35fafa50bfd38631312d22464df14d426f55346 Reviewed-on: http://gerrit.cloudera.org:8080/21646 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Quanlong Huang <huangquanlong@gmail.com>	2024-08-15 22:04:14 +00:00
Joe McDonnell	d8a8412c2b	IMPALA-13294: Add support for long polling to avoid client side wait Currently, Impala does an execute call, then the client polls waiting for the operation to finish (or error out). The client sleeps between polls, and this sleep time can be a substantial percentage of a short query's execution time. To reduce this client side sleep, this implements long polling to provide an option to wait for query completion on the server side. This is controlled by the long_polling_time_ms query option. If set to greater than zero, status RPCs will wait for query completion for up to that amount of time. This defaults to off (0ms). Both Beeswax and HS2 add a wait for query completion in their get status calls (get_state for Beeswax, GetOperationStatus for HS2). This doesn't wait in the execute RPC calls (e.g. query for Beeswax, ExecuteStatement for HS2), because neither includes the query status in the response. The client will always need to do a separate status RPC. This modifies impala-shell and the beeswax client to avoid doing a sleep if the get_state/GetOperationStatus calls take longer than they would have slept. In other words, if they would have slept 50ms, then they skip that sleep if the RPC to the server took longer than 50ms. This allows the client to maintain its sleep behavior with older Impalas that don't use long polling while adapting properly to systems that do have long polling. This has the added benefit that it also adjusts for high latency to the server as well. This does not change any of the sleep times. Testing: - This adds a test case in test_hs2.py to verify the long polling behavior Change-Id: I72ca595c5dd8a33b936f078f7f7faa5b3f0f337d Reviewed-on: http://gerrit.cloudera.org:8080/19205 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-08-13 03:49:45 +00:00
Fang-Yu Rao	589dbd6f1a	IMPALA-13276: Revise the documentation of 'RUNTIME_FILTER_WAIT_TIME_MS' This patch revises the documentation of the query option 'RUNTIME_FILTER_WAIT_TIME_MS' as well as the code comment for the same query option to make its meaning clearer. Change-Id: Ic98e23a902a65e4fa41a628d4a3edb1894660fb4 Reviewed-on: http://gerrit.cloudera.org:8080/21644 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2024-08-09 17:49:03 +00:00
Noemi Pap-Takacs	05585c19bf	IMPALA-12857: Add flag to enable merge-on-read even if tables are configured with copy-on-write Impala can only modify an Iceberg table via 'merge-on-read'. The 'iceberg_always allow_merge_on_read_operations' backend flag makes it possible to execute 'merge-on-read' operations (DELETE, UPDATE, MERGE) even if the table property is 'copy-on-write'. Testing: - custom cluster test - negative E2E test Change-Id: I3800043e135beeedfb655a238c0644aaa0ef11f4 Reviewed-on: http://gerrit.cloudera.org:8080/21578 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-07-22 14:13:02 +00:00
Eyizoha	ec59578106	IMPALA-12786: Optimize count() for JSON scans When performing zero slots scans on a JSON table for operations like count(), we don't require specific data from the JSON, we only need the number of top-level JSON objects. However, the current JSON parser based on rapidjson still decodes and copies specific data from the JSON, even in zero slots scans. Skipping these steps can significantly improve scan performance. This patch introduces a JSON skipper to conduct zero slots scans on JSON data. Essentially, it is a simplified version of a rapidjson parser, removing specific data decoding and copying operations, resulting in faster parsing of the number of JSON objects. The skipper retains the ability to recognize malformed JSON and provide specific error codes same as the rapidjson parser. Nevertheless, as it bypasses specific data parsing, it cannot identify string encoding errors or numeric overflow errors. Despite this, these data errors do not impact the counting of JSON objects, so it is acceptable to ignore them. The TEXT scanner exhibits similar behavior. Additionally, a new query option, disable_optimized_json_count_star, has been added to disable this optimization and revert to the old behavior. In the performance test of TPC-DS with a format of json/none and a scale of 10GB, the performance optimization is shown in the following tables: +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| TPCDS(10) \| TPCDS-Q_COUNT_UNOPTIMIZED \| json / none / none \| 6.78 \| 6.88 \| -1.46% \| 4.93% \| 3.63% \| 9 \| -1.51% \| -0.74 \| -0.72 \| \| TPCDS(10) \| TPCDS-Q_COUNT_ZERO_SLOT \| json / none / none \| 2.42 \| 6.75 \| I -64.20% \| 6.44% \| 4.58% \| 9 \| I -177.75% \| -3.36 \| -37.55 \| \| TPCDS(10) \| TPCDS-Q_COUNT_OPTIMIZED \| json / none / none \| 2.42 \| 7.03 \| I -65.63% \| 3.93% \| 4.39% \| 9 \| I -194.13% \| -3.36 \| -42.82 \| +-----------+---------------------------+--------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_ZERO_SLOT [json / none / none] (6.75s -> 2.42s [-64.20%]) +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ \| Operator \| % of Query \| Avg \| Base Avg \| Delta(Avg) \| StdDev(%) \| Max \| Base Max \| Delta(Max) \| #Hosts \| #Inst \| #Rows \| Est #Rows \| +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ \| 01:AGGREGATE \| 2.58% \| 54.85ms \| 58.88ms \| -6.85% \| * 14.43% * \| 115.82ms \| 133.11ms \| -12.99% \| 3 \| 3 \| 3 \| 1 \| \| 00:SCAN HDFS \| 97.41% \| 2.07s \| 6.07s \| -65.84% \| 5.87% \| 2.43s \| 6.95s \| -65.01% \| 3 \| 3 \| 28.80M \| 143.83M \| +--------------+------------+---------+----------+------------+------------+----------+----------+------------+--------+-------+--------+-----------+ (I) Improvement: TPCDS(10) TPCDS-Q_COUNT_OPTIMIZED [json / none / none] (7.03s -> 2.42s [-65.63%]) +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ \| Operator \| % of Query \| Avg \| Base Avg \| Delta(Avg) \| StdDev(%) \| Max \| Base Max \| Delta(Max) \| #Hosts \| #Inst \| #Rows \| Est #Rows \| +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ \| 00:SCAN HDFS \| 99.35% \| 2.07s \| 6.49s \| -68.15% \| 4.83% \| 2.37s \| 7.49s \| -68.32% \| 3 \| 3 \| 28.80M \| 143.83M \| +--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+--------+-----------+ Testing: - Added new test cases in TestQueriesJsonTables to verify that query results are consistent before and after optimization. - Passed existing JSON scanning-related tests. Change-Id: I97ff097661c3c577aeafeeb1518408ce7a8a255e Reviewed-on: http://gerrit.cloudera.org:8080/21039 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-07-10 14:37:19 +00:00
stiga-huang	00d0b0dda1	IMPALA-9441,IMPALA-13170: Ops listing dbs/tables should handle db not exists We have some operations listing the dbs/tables in the following steps: 1. Get the db list 2. Do something on the db which could fail if the db no longer exists For instance, when authorization is enabled, SHOW DATABASES would need a step-2 to get the owner of each db. This is fine in the legacy catalog mode since the whole Db object is cached in the coordinator side. However, in the local catalog mode, the msDb could be missing in the local cache. Coordinator then triggers a getPartialCatalogObject RPC to load it from catalogd. If the db no longer exists in catalogd, such step will fail. The same in GetTables HS2 requests when listing all tables in all dbs. In step-2 we list the table names for a db. Though it exists when we get the db list, it could be dropped when we start listing the table names in it. This patch adds codes to handle the exceptions due to db no longer exists. Also improves GetSchemas to not list the table names to get rid of the same issue. Tests: - Add e2e tests Change-Id: I2bd40d33859feca2bbd2e5f1158f3894a91c2929 Reviewed-on: http://gerrit.cloudera.org:8080/21546 Reviewed-by: Yida Wu <wydbaggio000@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-07-01 06:56:40 +00:00
Csaba Ringhofer	a99de990b0	IMPALA-12370: Allow converting timestamps to UTC when writing to Kudu Before this commit, only read support was implemented (convert_kudu_utc_timestamps=true). This change adds write support: if write_kudu_utc_timestamps=true, then timestamps are converted from local time to UTC during DMLs to Kudu. In case of ambiguous conversions (DST changes) the earlier possible UTC timestamp is written. All DMLs supported with Kudu tables are affected: INSERT, UPSERT, UPDATE, DELETE To be able to read back Kudu tables written by Impala correctly convert_kudu_utc_timestamps and write_kudu_utc_timestamps need to have the same value. Having the same value in the two query option is also critical for UPDATE/DELETE if the primary key contains a timestamp column - these operations do a scan first (affected by convert_kudu_utc_timestamps) and then use the keys from the scan to select updated/deleted rows (affected by write_kudu_utc_timestamps). The conversion is implemented by adding to_utc_timestamp() to inserted timestamp expressions during planning. This allows doing the same conversion during the pre-insert sorting and partitioning. Read support is implemented differently - in that case the plan is not changed and the scanner does the conversion. Other changes: - Before this change, verification of tests with TIMESTAMP results were skipped when the file format is Kudu. This shouldn't be necessary so the skipping was removed. Change-Id: Ibb4995a64e042e7bb261fcc6e6bf7ffce61e9bd1 Reviewed-on: http://gerrit.cloudera.org:8080/21492 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Peter Rozsa <prozsa@cloudera.com>	2024-06-19 10:51:56 +00:00
Riza Suminto	5d1bd80623	IMPALA-13152: Avoid NaN, infinite, and negative ProcessingCost TOP-N cost will turn into NaN if inputCardinality is equal to 0 due to Math.log(inputCardinality). This patch fix the issue by avoiding Math.log(0) and replace it with 0 instead. After this patch, Instantiating BaseProcessingCost with NaN, infinite, or negative totalCost will throw IllegalArgumentException. In BaseProcessingCost.getDetails(), "total-cost" is renamed to "raw-cost" to avoid confusion with "cost-total" in ProcessingCost.getDetails(). Testing: - Add testcase that run TOP-N query over empty table. - Compute ProcessingCost in most FE and EE test even when COMPUTE_PROCESSING_COST option is not enabled by checking if RuntimeEnv.INSTANCE.isTestEnv() is True or TEST_REPLAN option is enabled. - Pass core test. Change-Id: Ib49c7ae397dadcb2cb69fde1850d442d33cdf177 Reviewed-on: http://gerrit.cloudera.org:8080/21504 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-15 23:36:46 +00:00
Michael Smith	4681666e93	IMPALA-12800: Add cache for isTrueWithNullSlots() evaluation isTrueWithNullSlots() can be expensive when it has to query the backend. Many of the expressions will look similar, especially in large auto-generated expressions. Adds a cache based on the nullified expression to avoid querying the backend for expressions with identical structure. With DEBUG logging enabled for the Analyzer, computes and logs stats about the null slots cache. Adds 'use_null_slots_cache' query option to disable caching. Documents the new option. Change-Id: Ib63f5553284f21f775d2097b6c5d6bbb63699acd Reviewed-on: http://gerrit.cloudera.org:8080/21484 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-12 12:27:05 +00:00
wzhou-code	70b7b6a78d	IMPALA-13134: DDL hang with SYNC_DDL enabled when Catalogd is changed to standby status Catalogd waits for SYNC_DDL version when it processes a DDL with SYNC_DDL enabled. If the status of Catalogd is changed from active to standby when CatalogServiceCatalog.waitForSyncDdlVersion() is called, the standby catalogd does not receive catalog topic updates from statestore, hence catalogd thread waits indefinitely. This patch fixed the issue by re-generating service id when Catalogd is changed to standby status and throwing exception if its service id has been changed when waiting for SYNC_DDL version. Testing: - Added unit-test code for CatalogD HA to run DDL with SYNC_DDL enabled and injected delay when waiting SYNC_DLL version, then verify that DDL query fails due to catalog failover. - Passed test_catalogd_ha.py. Change-Id: I2dcd628cff3c10d2e7566ba2d9de0b5886a18fc1 Reviewed-on: http://gerrit.cloudera.org:8080/21480 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-06 01:39:36 +00:00
ttttttz	f67f5f1815	IMPALA-12705: Add /catalog_ha_info page on Statestore to show catalog HA information This patch adds /catalog_ha_info page on Statestore to show catalog HA information. The page contains the following information: Active Node, Standby Node, and Notified Subscribers table. In the Notified Subscribers table, include the following information items: -- Id, -- Address, -- Registration ID, -- Subscriber Type, -- Catalogd Version, -- Catalogd Address, -- Last Update Catalogd Time Change-Id: If85f6a827ae8180d13caac588b92af0511ac35e3 Reviewed-on: http://gerrit.cloudera.org:8080/21418 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-06-05 22:19:11 +00:00

1 2 3 4 5 ...

1515 Commits