impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 09:58:28 -05:00

Author	SHA1	Message	Date
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
Riza Suminto	f28a32fbc3	IMPALA-13916: Change BaseTestSuite.default_test_protocol to HS2 This is the final patch to move all Impala e2e and custom cluster tests to use HS2 protocol by default. Only beeswax-specific test remains testing against beeswax protocol by default. We can remove them once Impala officially remove beeswax support. HS2 error message formatting in impala-hs2-server.cc is adjusted a bit to match with formatting in impala-beeswax-server.cc. Move TestWebPageAndCloseSession from webserver/test_web_pages.py to custom_cluster/test_web_pages.py to disable glog log buffering. Testing: - Pass exhaustive tests, except for some known and unrelated flaky tests. Change-Id: I42e9ceccbba1e6853f37e68f106265d163ccae28 Reviewed-on: http://gerrit.cloudera.org:8080/22845 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Jason Fehr <jfehr@cloudera.com>	2025-05-20 14:32:10 +00:00
Riza Suminto	e73e2d40da	IMPALA-13864: Implement ImpylaHS2ResultSet.exec_summary This patch implement building exec summary table for ImpylaHS2Connection. It adds fetch_exec_summary argument in ImpalaConnection.execute(). If this argument is True, an exec summary table will be added into the returned result object. fetch_exec_summary is also implemented for BeeswaxConnection. Thus, BeeswaxConnection will not fetch exec summary by default all the time. Tests that validate exec summary table is updated to set fetch_exec_summary=True and migrated to test against hs2 protocol. Change TestExecutorGroup._set_query_options() to do query option setting through hs2_client iconfig instead of SET query. Some flake8 issues are addressed as well. Move build_exec_summary_table to separate exec_summary.py file. Tweak it a bit to return early if given TExecSummary is empty. Fixed bug in ImpalaBeeswaxClient.fetch_results() where fetch will not happen at all if discard_result argument is True. Testing: - Run and pass affected tests locally. Change-Id: I7d88f78e58eeda29ce21e7828884c7a129d7efe6 Reviewed-on: http://gerrit.cloudera.org:8080/22626 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-24 22:34:20 +00:00
Riza Suminto	ad124d1dba	IMPALA-13872: Deflake test_query_cpu_count_divisor_default assertion IMPALA-13333 adds test that assert the value of "Cluster Memory Admitted" counter. However, this counter can have slightly different value depending of target filesystem (HDFS, Ozone, S3). This cause flakiness in test_query_cpu_count_divisor_default. This patch remove such assertion from test_query_cpu_count_divisor_default. The remaining assertion is sufficient to ensure correctness of system under test. Testing: - Run and pass test_query_cpu_count_divisor_default. Change-Id: I676ee31728de2886acc72d11b8ece14f0238814b Reviewed-on: http://gerrit.cloudera.org:8080/22636 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-18 21:19:09 +00:00
Riza Suminto	7a5adc15ca	IMPALA-13333: Limit memory estimation if PlanNode can spill SortNode, AggregationNode, and HashJoinNode (the build side) can spill to disk. However, their memory estimation does not consider this capability and assumes it will hold all rows in memory. This causes memory overestimation if cardinality is also overestimated. In reality, the whole query execution in a single host is often subject to much lower memory upper-bound and not allowed to exceed it. This upper-bound is dictated by, but not limited to: - MEM_LIMIT - MEM_LIMIT_COORDINATORS - MEM_LIMIT_EXECUTORS - MAX_MEM_ESTIMATE_FOR_ADMISSION - impala.admission-control.max-query-mem-limit.<pool_name> from admission control. This patch adds SpillableOperator interface that defines an alternative of either PlanNode.computeNodeResourceProfile() or DataSink.computeResourceProfile() if a lower memory upper-bound can be reasoned about from configs mentioned above. This interface is applied to SortNode, AggregationNode, HashJoinNode, and JoinBuildSink. The in-memory vs spill-to-disk bias is controlled through MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR option. A scale between [0.0,1.0] to control estimate peak memory of query operator that has spill-to-disk capabilities. Setting value closer to 1.0 will make Planner bias towards keeping as much rows as possible in memory, while setting value closer to 0.0 will make Planner bias towards spilling rows to disk under memory pressure. Note that lowering MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR can make query that previously rejected by Admission Controller becomes admittable, but also may spill-to-disk more and have higher risk to exhaust scratch space more than before. There are some caveats on this memory bounding patch: - It checks if spill-to-disk is enabled in the coordinator, but individual backend executors might not have it configured. Mismatch of spill-to-disk configs between the coordinator and backend executor, however, is rare and can be considered as misconfiguration. - It does not check the actual total scratch space available to spill-to-disk. However, query execution will be forced to spill anyway if memory usage exceeds the three memory configs above. Raising MEM_LIMIT / MEM_LIMIT_EXECUTORS option can help increase memory estimation and increase the likelihood for the query to get assigned to a larger executor group set, which usually has a bigger total scratch space. - The memory bound is divided evenly among all instances of a fragment kind in a single host. But in theory, they should be able to share and grow their memory usage independently beyond memory estimate as long max memory reservation is not set. - This does not consider other memory-related configs such as clamp_query_mem_limit_backend_mem_limit or disable_pool_mem_limits flag. But the admission controller will still enforce them if set. Testing: - Pass FE and custom cluster tests with core exploration. Change-Id: I290c4e889d4ab9e921e356f0f55a9c8b11d0854e Reviewed-on: http://gerrit.cloudera.org:8080/21762 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-03-13 18:38:27 +00:00
Riza Suminto	71feb617e4	IMPALA-13835: Remove reference to protocol-specific states With IMPALA-13682 merged, checking for query state can be done via wait_for_impala_state(), wait_for_any_impala_state() and other helper methods of ImpalaConnection. This patch remove all reference to protocol-specific states such as BeeswaxService.QueryState. Also fix flake8 errors and unused variable in modified test files. Testing: - Run and pass all affected tests. Change-Id: Id6b56024fbfcea1ff005c34cd146d16e67cb6fa1 Reviewed-on: http://gerrit.cloudera.org:8080/22586 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-09 00:04:05 +00:00
Xuebin Su	242095ac8a	IMPALA-13729: Accept error messages not starting with prompt Previously, error_msg_expected() only accepted error messages starting with the following error prompt: ``` Query <query_id> failed:\n ``` However, for some tests using the Beeswax protocol, the error prompt may appear in the middle of the error message instead of at its beginning. Therefore, this patch adapts error_msg_expected() to accept error messages not starting with the error prompt. The error_msg_expected() function is renamed to error_msg_startswith() to better describe its behavior. Change-Id: Iac3e68bcc36776f7fd6cc9c838dd8da9c3ecf58b Reviewed-on: http://gerrit.cloudera.org:8080/22468 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-02-26 15:29:36 +00:00
Riza Suminto	c298c54262	IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting getPerInstanceNdvForCpuCosting is a method to estimate the number of distinct values of exprs per fragment instance when accounting for the likelihood of duplicate keys across fragment instances. It borrows the probabilistic model described in IMPALA-2945. This method is exclusively used by AggregationNode only. getPerInstanceNdvForCpuCosting run the probabilistic formula individually for each grouping expression and then multiply it together. That match with how we estimate group NDV in the past where we simply do NDV multiplication of each grouping expression. Recently, we adds tuple-based analysis to lower cardinality estimate for all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086). All of the bounding happens in AggregationNode.computeStats(), where we call estimateNumGroups() function that returns globalNdv estimate for specific aggregation class. To take advantage from that more precise globalNdv, this patch replace getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that apply the probabilistic formula over this single globalNdv number rather than the old way where it often return an overestimated number from NDV multiplication method. Its use is still limited only to calculate ProcessingCost. Using it for preagg output cardinality will be done by IMPALA-2945. estimatePreaggCardinality is skipped if data partition of input is a subset of grouping expression. Testing: - Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True. ProcessingCost changes, but all cardinality number stays. - Add CardinalityTest#testEstimatePreaggCardinality. - Update test_executor_groups.py. Enable v2 profile as well for easier runtime profile debugging. Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1 Reviewed-on: http://gerrit.cloudera.org:8080/22320 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-17 10:05:09 +00:00
Riza Suminto	4553e89ed4	IMPALA-13477: Set request_pool in QueryStateRecord for CTAS query Resource Pool information for CTAS query is missing from /queries page of WebUI. This is because CTAS query has TExecRequest.stmt_type = DDL. However, CTAS also has TQueryExecRequest.stmt_type = DML and subject to AdmissionControl. Therefore, its request pool must be recorded into QueryStateRecord and displayed at /queries page of WebUI. Testing: - Update assertion in test_executor_groups.py to reflect this change. - Pass core tests. Change-Id: I6885192f1c2d563e58670f142b3c0df528032a6e Reviewed-on: http://gerrit.cloudera.org:8080/21975 Reviewed-by: Jason Fehr <jfehr@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-26 04:54:36 +00:00
Riza Suminto	b07bd6ddeb	IMPALA-13469: Deflake test_query_cpu_count_on_insert A new test case from IMPALA-13445 reveals a pre-existing bug where cost-based planning may increase expectedNumInputInstance greater than inputFragment.getNumInstances(), which leads to precondition violation. The following scenario all happened when the Precondition was hit: 1. The environment is either Erasure Coded HDFS or Ozone. 2. The source table does not have stats nor numRows table property. 3. There is only one fragment consisting of a ScanNode in the plan tree before the addition of DML fragment. 4. Byte-based cardinality estimation logic kicks in. 5. Byte-based cardinality causes high scan cost, which leads to maxScanThread exceeding inputFragment.getPlanRoot(). 6. expectedNumInputInstance is assigned equal to maxScanThread. 7. Precondition expectedNumInputInstance < inputFragment.getPlanRoot() is violated. This scenario triggers a special condition that attempts to lower expectedNumInputInstance. But instead of lowering expectedNumInputInstance, the special logic increases it due to higher byte-based cardinality estimation. There is also a new bug where DistributedPlanner.java mistakenly passes root.getInputCardinality() instead of root.getCardinality(). This patch fixes both issues and does minor refactoring to change variable names into camel cases. Relaxed validation of the last test case of test_query_cpu_count_on_insert to let it pass in Erasure Coded HDFS and Ozone setup. Testing: - Make several assertions in test_executor_groups.py more verbose. - Pass test_executor_groups.py in Erasure Coded HDFS and Ozone setup. - Added new Planner tests with unknown cardinality estimation. - Pass core tests in regular setup. Change-Id: I834eb6bf896752521e733cd6b77a03f746e6a447 Reviewed-on: http://gerrit.cloudera.org:8080/21966 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-24 20:45:21 +00:00
Riza Suminto	9ca3ccb347	IMPALA-13445: Ignore num partition for unpartitioned writes When cost-based planning is used, writer parallelism is limited by the number of partitions. In the unpartitioned insert scenario, there will be just single partitions. That leads to a single fs writer only, which causes slow writes. This patch fixes the issue by distinguishing between partitioned insert and unpartitioned insert and cause following BEHAVIOR CHANGE if COMPUTE_PROCESSING_COST=1: 1. If the insert is unpartitioned, use the byte-based estimate fully. Shuffling should only happen if num writers is less than num input fragment instances. 2. If the insert is partitioned, try to plan at least one writer for each shuffling executor nodes, but do not exceed number of partitions. However, if number of partition is 1, try force writer colocation with input fragment. Both partitioned and unpartitioned insert still respect MAX_FS_WRITER option. This patch also does minor cleanup in DistributedPlanner.java. Testing: - In test_executor_groups.py, move insert tests from test_query_cpu_count_divisor_default into separate test_query_cpu_count_on_insert. Add some new insert test cases there. - Add and pass CardinalityTest.testByteBasedNumWriters(). - Add new planner tests under TpcdsCpuCostPlannerTest. - Pass test_executor_groups.py. Change-Id: I51ab8fc35a5489351a88d372b28642b35449acfc Reviewed-on: http://gerrit.cloudera.org:8080/21927 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-19 10:22:23 +00:00
Xuebin Su	ad868b9947	IMPALA-13115: Add query id to error messages This patch adds the query id to the error messages in both - the result of the `get_log()` RPC, and - the error message in an RPC response before they are returned to the client, so that the users can easily figure out the errored queries on the client side. To achieve this, the query id of the thread debug info is set in the RPC handler method, and is retrieved from the thread debug info each time the error reporting function or `get_log()` gets called. Due to the change of the error message format, some checks in the impala-shell.py are adapted to keep them valid. Testing: - Added helper function `error_msg_expected()` to check whether an error message is expected. It is stricter than only using the `in` operator. - Added helper function `error_msg_equal()` to check if two error messages are equal regardless of the query ids. - Various test cases are adapted to match the new error message format. - `ImpalaBeeswaxException`, which is used in tests only, is simplified so that it has the same error message format as the exceptions for HS2. - Added an assertion to the case of killing and restarting a worker in the custom cluster test to ensure that the query id is in the error message in the client log retrieved with `get_log()`. Change-Id: I67e659681e36162cad1d9684189106f8eedbf092 Reviewed-on: http://gerrit.cloudera.org:8080/21587 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-08-08 14:11:04 +00:00
Riza Suminto	29e4186793	IMPALA-13024: Ignore slots if using default pool and empty group Slot based admission should not be enabled when using default pool. There is a bug where coordinator-only query still does slot based admission because executor group name set to ClusterMembershipMgr::EMPTY_GROUP_NAME ("empty group (using coordinator only)"). This patch adds check to recognize coordinator-only query in default pool and skip slot based admission for it. Testing: - Add BE test AdmissionControllerTest.CanAdmitRequestSlotsDefault. - In test_executor_groups.py, split test_coordinator_concurrency to test_coordinator_concurrency_default and test_coordinator_concurrency_two_exec_group_cluster to show the behavior change. - Pass core tests in ASAN build. Change-Id: I0b08dea7ba0c78ac6b98c7a0b148df8fb036c4d0 Reviewed-on: http://gerrit.cloudera.org:8080/21340 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-26 21:18:42 +00:00
David Rorke	25a8d70664	IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator This patch improves the accuracy of the CPU ProcessingCost estimates for several of the CPU intensive operators by basing the costs on benchmark data. The general approach for a given operator was to run a set of queries that exercised the operator under various conditions (e.g. large vs small row sizes and row counts, varying NDV, different file formats, etc) and capture the CPU time spent per unit of work (the unit of work might be measured as some number of rows, some number of bytes, some number of predicates evaluated, or some combination of these). The data was then analyzed in an attempt to fit a simple model that would allow us to predict CPU consumption of a given operator based on information available at planning time. For example, the CPU ProcessingCost for a Parquet scan is estimated as: TotalCost = (0.0144 * BytesMaterialized) + (0.0281 * Rows * Predicate Count) The coefficients (0.0144 and 0.0281) are derived from benchmarking scans under a variety of conditions. Similar cost functions and coefficients were derived for all of the benchmarked operators. The coefficients for all the operators are normalized such that a single unit of cost equates to roughly 100 nanoseconds of CPU time on a r5d.4xlarge instance. So we would predict an operator with a cost of 10,000,000 would complete in roughly one second on a single core. Limitations: * Costing only addresses CPU time spent and doesn't account for any IO or other wait time. * Benchmarking scenarios didn't provide comprehensive coverage of the full range of data types, distributions, etc. More thorough benchmarking could improve the costing estimates further. * This initial patch only covers a subset of the operators, focusing on those that are most common and most CPU intensive. Specifically the following operators are covered by this patch. All others continue to use the previous ProcessingCost code: AggregationNode DataStreamSink (exchange sender) ExchangeNode HashJoinNode HdfsScanNode HdfsTableSink NestedLoopJoinNode SortNode UnionNode Benchmark-based costing of the remaining operators will be covered by a future patch. Future patches will automate the collection and analysis of the benchmark data and the computation of the cost coefficients to simplify maintenance of the costing as performance changes over time. Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca Reviewed-on: http://gerrit.cloudera.org:8080/21279 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-20 06:48:58 +00:00
Riza Suminto	d437334e53	IMPALA-12988: Calculate an unbounded version of CpuAsk Planner calculates CpuAsk through a recursive call beginning at Planner.computeBlockingAwareCores(), which is called after Planner.computeEffectiveParallelism(). It does blocking operator analysis over the selected degree of parallelism that was decided during computeEffectiveParallelism() traversal. That selected degree of parallelism, however, is already bounded by min and max parallelism config, derived from PROCESSING_COST_MIN_THREADS and MAX_FRAGMENT_INSTANCES_PER_NODE options accordingly. This patch calculates an unbounded version of CpuAsk that is not bounded by min and max parallelism config. It is purely based on the fragment's ProcessingCost and query plan relationship constraint (for example, the number of JOIN BUILDER fragments should equal the number of destination JOIN fragments for partitioned join). Frontend will receive both bounded and unbounded CpuAsk values from TQueryExecRequest on each executor group set selection round. The unbounded CpuAsk is then scaled down once using a nth root based sublinear-function, controlled by the total cpu count of the smallest executor group set and the bounded CpuAsk number. Another linear scaling is then applied on both bounded and unbounded CpuAsk using QUERY_CPU_COUNT_DIVISOR option. Frontend then compare the unbounded CpuAsk after scaling against CpuMax to avoid assigning a query to a small executor group set too soon. The last executor group set stays as the "catch-all" executor group set. After this patch, setting COMPUTE_PROCESSING_COST=True will show following changes in query profile: - The "max-parallelism" fields in the query plan will all be set to maximum parallelism based on ProcessingCost. - The CpuAsk counter is changed to show the unbounded CpuAsk after scaling. - A new counter CpuAskBounded shows the bounded CpuAsk after scaling. If QUERY_CPU_COUNT_DIVISOR=1 and PLANNER_CPU_ASK slot counting strategy is selected, this CpuAskBounded is also the minimum total admission slots given to the query. - A new counter MaxParallelism shows the unbounded CpuAsk before scaling. - The EffectiveParallelism counter remains unchanged, showing bounded CpuAsk before scaling. Testing: - Update and pass FE test TpcdsCpuCostPlannerTest and PlannerTest#testProcessingCost. - Pass EE test tests/query_test/test_tpcds_queries.py - Pass custom cluster test tests/custom_cluster/test_executor_groups.py Change-Id: I5441e31088f90761062af35862be4ce09d116923 Reviewed-on: http://gerrit.cloudera.org:8080/21277 Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-20 00:28:53 +00:00
Riza Suminto	6abfdbc56c	IMPALA-12980: Translate CpuAsk into admission control slots Impala has a concept of "admission control slots" - the amount of parallelism that should be allowed on an Impala daemon. This defaults to the number of processors per executor and can be overridden with -–admission_control_slots flag. Admission control slot accounting is described in IMPALA-8998. It computes 'slots_to_use' for each backend based on the maximum number of instances of any fragment on that backend. This can lead to slot underestimation and query overadmission. For example, assume an executor node with 48 CPU cores and configured with -–admission_control_slots=48. It is assigned 4 non-blocking query fragments, each has 12 instances scheduled in this executor. IMPALA-8998 algorithm will request the max instance (12) slots rather than the sum of all non-blocking fragment instances (48). With the 36 remaining slots free, the executor can still admit another fragment from a different query but will potentially have CPU contention with the one that is currently running. When COMPUTE_PROCESSING_COST is enabled, Planner will generate a CpuAsk number that represents the cpu requirement of that query over a particular executor group set. This number is an estimation of the largest number of query fragment instances that can run in parallel without waiting, given by the blocking operator analysis. Therefore, the fragment trace that sums into that CpuAsk number can be translated into 'slots_to_use' as well, which will be a closer resemblance of maximum parallel execution of fragment instances. This patch adds a new query option called SLOT_COUNT_STRATEGY to control which admission control slot accounting to use. There are two possible values: - LARGEST_FRAGMENT, which is the original algorithm from IMPALA-8998. This is still the default value for the SLOT_COUNT_STRATEGY option. - PLANNER_CPU_ASK, which will follow the fragment trace that contributes towards CpuAsk number. This strategy will schedule more or equal admission control slots than the LARGEST_FRAGMENT strategy. To do the PLANNER_CPU_ASK strategy, the Planner will mark fragments that contribute to CpuAsk as dominant fragments. It also passes max_slot_per_executor information that it knows about the executor group set to the scheduler. AvgAdmissionSlotsPerExecutor counter is added to describe what Planner thinks the average 'slots_to_use' per backend will be, which follows this formula: AvgAdmissionSlotsPerExecutor = ceil(CpuAsk / num_executors) Actual 'slots_to_use' in each backend may differ than AvgAdmissionSlotsPerExecutor, depending on what is scheduled on that backend. 'slots_to_use' will be shown as 'AdmissionSlots' counter under each executor profile node. Testing: - Update test_executors.py with AvgAdmissionSlotsPerExecutor assertion. - Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost. - Add EE test test_processing_cost.py. - Add FE test PlannerTest#testProcessingCostPlanAdmissionSlots. Change-Id: I338ca96555bfe8d07afce0320b3688a0861663f2 Reviewed-on: http://gerrit.cloudera.org:8080/21257 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-18 21:58:13 +00:00
Riza Suminto	ac8ffa9125	IMPALA-12654: Add query option QUERY_CPU_COUNT_DIVISOR IMPALA-11604 adds a hidden backend flag named query_cpu_count_divisor to allow oversubscribing CPU cores more than what is available in the executor group set. This patch adds a query option with the same name and function so that CPU core matching can be tuned for individual queries. The query option takes precedence over the flag. Testing: - Add test case in test_executor_groups.py and query-options-test.cc Change-Id: I34ab47bd67509a02790c3caedb3fde4d1b6eaa78 Reviewed-on: http://gerrit.cloudera.org:8080/20819 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-01-04 08:18:10 +00:00
Riza Suminto	04f31aea71	IMPALA-12567: Deflake test_75_percent_availability TestExecutorGroups.test_75_percent_availability can fail in certains build/test setup for not starting the last impalad within 5s delay injection. This patch simplifies the test by launching fewer impalad in totals (reduced from 8 to 5, excluding coordinator) and increases the delay injection to ensure test query run at all five executors. The test renamed to test_partial_availability accordingly. Testing: - Run and pass the test against HDFS and S3. Change-Id: I2e70f1dde10045c32c2bb4f6f78e8a707c9cd97d Reviewed-on: http://gerrit.cloudera.org:8080/20712 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-11-17 19:06:49 +00:00
Riza Suminto	89a48b80a2	IMPALA-12444: Fix minimum parallelism bug in scan fragment Scan fragment did not follow PROCESSING_COST_MIN_THREADS set by user even if total scan ranges allow to do so. This patch fix the issue by exposing ScanNode.maxScannerThreads_ to PlanFragment.adjustToMaxParallelism(). By using ScanNode.maxScannerThreads_ as an upper bound, ScanNode does not need to artificially lower ProcessingCost if maxScannerThreads_ is lower than minimum parallelism dictated by the original ProcessingCost. Thus, the synthetic ProcessingCost logic in ScanNode class is revised to only apply if input cardinality is unknown (-1). This patch also does the following adjustments: - Remove some dead codes in Frontend.java and PlanFragment.java. - Add sanity check such that PROCESSING_COST_MIN_THREADS <= MAX_FRAGMENT_INSTANCES_PER_NODE. - Tidy up test_query_cpu_count_divisor_default to reduce number of SET query. Testing: - Update test_query_cpu_count_divisor_default to ensure that PROCESSING_COST_MIN_THREADS is respected by scan fragment and error is returned if PROCESSING_COST_MIN_THREADS is greater than MAX_FRAGMENT_INSTANCES_PER_NODE. - Pass test_executor_groups.py. Change-Id: I69e5a80146d4ac41de5ef406fc2bdceffe3ec394 Reviewed-on: http://gerrit.cloudera.org:8080/20475 Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2023-10-02 16:36:29 +00:00
Abhishek Rawat	b73bc49ea7	IMPALA-12400: Test expected executors used for planning when no executor groups are healthy Added a custom cluster test for testing number of executors used for planning when no executor groups are healthy. Planner should use num executors from 'num_expected_executors' or 'expected_executor_group_sets' when executor groups aren't healthy. Change-Id: Ib71ca0a5402c74d07ee875878f092d6d3827c6b7 Reviewed-on: http://gerrit.cloudera.org:8080/20419 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-09-01 00:07:46 +00:00
Riza Suminto	0c8fc997ef	IMPALA-12395: Override scan cardinality for optimized count star The cardinality estimate in HdfsScanNode.java for count queries does not account for the fact that the count optimization only scans metadata and not the actual columns. Optimized count star scan will return only 1 row per parquet row group. This patch override the scan cardinality with total number of files, which is the closest estimate to number of row group. Similar override already exist in IcebergScanNode.java. Testing: - Add count query testcases in test_query_cpu_count_divisor_default - Pass core tests Change-Id: Id5ce967657208057d50bd80adadac29ebb51cbc5 Reviewed-on: http://gerrit.cloudera.org:8080/20406 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-08-28 20:32:13 +00:00
Riza Suminto	0d3fc33bbb	IMPALA-12300: (addendum) Remove HDFS specific assertion test_75_percent_availability fail against Ozone and S3 test environment because test expects the string "SCAN HDFS" to be found in the profile. Instead of it there's "SCAN OZONE" and "SCAN S3" for Ozone and S3 test environment respectively. This patch fix the test by removing that assertion from test_75_percent_availability. The remaining assertion is enough to verify that FE planner and BE scheduler can see cluster membership change. Change-Id: Id14934d2fce0f6cf03242c36c0142bc697b4180e Reviewed-on: http://gerrit.cloudera.org:8080/20259 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-07-24 21:35:47 +00:00
Riza Suminto	0bee563073	IMPALA-12300: Turn CheckEffectiveInstanceCount to print warning Scheduler::CheckEffectiveInstanceCount was added to check consistency between FE planning and BE scheduling if COMPUTE_PROCESSING_COST=true. This consistency can be broken if there is a cluster membership change (new executor becomes online) between FE planning and BE scheduling. Say, in executor group size 10 with 90% health threshold, admission-controller is allowed to run a query when only 9 executor is available. If 10th executor is online during the time between FE planning and BE scheduling, CheckEffectiveInstanceCount can fail and return error. This patch turn two error status in CheckEffectiveInstanceCount into warning, either to query profile as InfoString or WARNING log. MAX_FRAGMENT_INSTANCES_PER_NODE violation check stays to return error. Testing: - Add test_75_percent_availability - Pass test_executors.py Change-Id: Ieaf6a46c4f12dbf8b03d1618c2f090ab4f2ac665 Reviewed-on: http://gerrit.cloudera.org:8080/20231 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-07-21 00:37:06 +00:00
Riza Suminto	c9aed342f5	IMPALA-12281: Disallow unsetting REQUEST_POOL if it is set by client IMPALA-12056 enable child query to unset REQUEST_POOL if it is set by Frontend.java as part of executor group selection. However, the implementation miss to setRequest_pool_set_by_frontend(false) if REQUEST_POOL is explicitly set by client request through impala-shell configuration. This cause child query to always unset REQUEST_POOL if parent query was executed via impala-shell. This patch fix the issue by checking query options that comes from client. This patch also tidy up null and empty REQUEST_POOL checking by using StringUtils.isNotEmpty(). Testing: - Add testcase in test_query_cpu_count_divisor_default Change-Id: Ib5036859d51bc64f568da405f730c8f3ffebb742 Reviewed-on: http://gerrit.cloudera.org:8080/20189 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2023-07-13 22:40:41 +00:00
Riza Suminto	7a94adbc30	IMPALA-12192: Fix scaling bug in scan fragment IMPALA-12091 has a bug where scan fragment parallelism will always be limited solely by the ScanNode cost. If ScanNode is colocated with other query node operators that have higher processing costs, Planner will not scale it up beyond what is allowed by the ScanNode cost. This patch fixes the problem in two aspects. The first is to allow a scan fragment to scale up higher as long as it is within the total fragment cost and the number of effective scan ranges. The second is to add missing Math.max() in CostingSegment.java which causes lower fragment parallelism even when the total fragment cost is high. IMPALA-10287 optimization is re-enabled to reduce regression in TPC-DS Q78. Ideally, the broadcast vs partitioned costing formula during distributed planning should not rely on numInstance. But enabling this optimization ensures consistent query plan shape when comparing against MT_DOP plan. This optimization can still be disabled by specifying USE_DOP_FOR_COSTING=false. This patch also does some cleanup including: - Fix "max-parallelism" value in explain string. - Make a constant in ScanNode.rowMaterializationCost() into a backend flag named scan_range_cost_factor for experimental purposes. - Replace all references to ProcessingCost.isComputeCost() to queryOptions.isCompute_processing_cost() directly. - Add Precondition in PlanFragment.getNumInstances() to verify that the fragment's num instance is not modified anymore after the costing algorithm finish. Testing: - Manually run TPCDS Q84 over tpcds10_parquet and confirm that the leftmost scan fragment parallelism is raised from 12 (before the patch) to 18 (after the patch). - Add test in PlannerTest.testProcessingCost that reproduces the issue. - Update compute stats test in test_executor_groups.py to maintain test assertion. - Pass core tests. Change-Id: I7010f6c3bc48ae3f74e8db98a83f645b6c157226 Reviewed-on: http://gerrit.cloudera.org:8080/20024 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-24 02:24:41 +00:00
Riza Suminto	dbddb08447	IMPALA-12120: Limit output writer parallelism based on write volume The new processing cost-based planner changes (IMPALA-11604, IMPALA-12091) will impact output writer parallelism for insert queries, with the potential for more small files if the processing cost-based planning results in too many writer fragments. It can further exacerbate a problem introduced by MT_DOP (see IMPALA-8125). The MAX_FS_WRITERS query option can help mitigate this. But even without the MAX_FS_WRITERS set, the default output writer parallelism should avoid creating excessive writer parallelism for partitioned and unpartitioned inserts. This patch implements such a limit when using the cost-based planner. It limits the number of writer fragments such that each writer fragment writes at least 256MB of rows. This patch also allows CTAS (a kind of DDL query) to be eligible for auto-scaling. This patch also remove comments about NUM_SCANNER_THREADS added by IMPALA-12029, since it does not applies anymore after IMPALA-12091. Testing: - Add test cases in test_query_cpu_count_divisor_default - Add test_processing_cost_writer_limit in test_insert.py - Pass test_insert.py::TestInsertHdfsWriterLimit - Pass test_executor_groups.py Change-Id: I289c6ffcd6d7b225179cc9fb2f926390325a27e0 Reviewed-on: http://gerrit.cloudera.org:8080/19880 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-26 00:08:02 +00:00
Riza Suminto	1d0b111bcf	IMPALA-12091: Control scan parallelism by its processing cost Before this patch, Impala still relies on MT_DOP option to decide the degree of parallelism of the scan fragment when a query runs with COMPUTE_PROCESSING_COST=1. This patch adds the scan node's processing cost as another consideration to raise scan parallelism beyond MT_DOP. Scan node cost is now adjusted to also consider the number of effective scan ranges. Each scan range is given a weight of (0.5% * min_processing_per_thread), which roughly means that one scan node instance can handle at most 200 scan ranges. Query option MAX_FRAGMENT_INSTANCES_PER_NODE is added as an upper bound on scan parallelism if COMPUTE_PROCESSING_COST=true. If the number of scan ranges is fewer than the maximum parallelism allowed by the scan node's processing cost, that processing cost will be clamped down to (min_processing_per_thread / number of scan ranges). Lowering MAX_FRAGMENT_INSTANCES_PER_NODE can also clamp down the scan processing cost in a similar way. For interior fragments, a combination of MAX_FRAGMENT_INSTANCES_PER_NODE, PROCESSING_COST_MIN_THREADS, and the number of available cores per node is accounted to determine maximum fragment parallelism per node. For scan fragment, only the first two are considered to encourage Frontend to choose a larger executor group as needed. Two new static state is added into exec-node.h: is_mt_fragment_ and num_instances_per_node_. The backend code that refers to the MT_DOP option is replaced with either is_mt_fragment_ or num_instances_per_node_. Two new criteria are added during effective parallelism calculation in PlanFragment.adjustToMaxParallelism(): - If a fragment has UnionNode, its parallelism is the maximum between its input fragments and its collocated ScanNode's expected parallelism. - If a fragment only has a single ScanNode (and no UnionNode), its parallelism is calculated in the same fashion as the interior fragment but will not be lowered anymore since it will not have any child fragment to compare with. Admission control slots remain unchanged. This may cause a query to fail admission if Planner selects scan parallelism that is higher than the configured admission control slots value. Setting MAX_FRAGMENT_INSTANCES_PER_NODE equal to or lower than configured admission control slots value can help lower scan parallelism and pass the admission controller. The previous workaround to control scan parallelism by IMPALA-12029 is now removed. This patch also disables IMPALA-10287 optimization if COMPUTE_PROCESSING_COST=true. This is because IMPALA-10287 relies on a fixed number of fragment instances in DistributedPlanner.java. However, effective parallelism calculation is done much later and may change the final number of instances of hash join fragment, rendering DistributionMode selected by IMPALA-10287 inaccurate. This patch is benchmarked using single_node_perf_run.py with the following parameters: args="-gen_experimental_profile=true -default_query_options=" args+="mt_dop=4,compute_processing_cost=1,processing_cost_min_threads=1 " ./bin/single_node_perf_run.py --num_impalads=3 --scale=10 \ --workloads=tpcds --iterations=5 --table_formats=parquet/none/none \ --impalad_args="$args" \ --query_names=TPCDS-Q3,TPCDS-Q14-1,TPCDS-Q14-2,TPCDS-Q23-1,TPCDS-Q23-2,TPCDS-Q49,TPCDS-Q76,TPCDS-Q78,TPCDS-Q80A \ "IMPALA-12091~1" IMPALA-12091 The benchmark result is as follows: +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| TPCDS(10) \| TPCDS-Q23-1 \| parquet / none / none \| 4.62 \| 4.54 \| +1.92% \| 0.23% \| 1.59% \| 5 \| +2.32% \| 1.15 \| 2.67 \| \| TPCDS(10) \| TPCDS-Q14-1 \| parquet / none / none \| 5.82 \| 5.76 \| +1.08% \| 5.27% \| 3.89% \| 5 \| +2.04% \| 0.00 \| 0.37 \| \| TPCDS(10) \| TPCDS-Q23-2 \| parquet / none / none \| 4.65 \| 4.58 \| +1.38% \| 1.97% \| 0.48% \| 5 \| +0.81% \| 0.87 \| 1.51 \| \| TPCDS(10) \| TPCDS-Q49 \| parquet / none / none \| 1.49 \| 1.48 \| +0.46% \| * 36.02% * \| * 34.95% * \| 5 \| +1.26% \| 0.58 \| 0.02 \| \| TPCDS(10) \| TPCDS-Q14-2 \| parquet / none / none \| 3.76 \| 3.75 \| +0.39% \| 1.67% \| 0.58% \| 5 \| -0.03% \| -0.58 \| 0.49 \| \| TPCDS(10) \| TPCDS-Q78 \| parquet / none / none \| 2.80 \| 2.80 \| -0.04% \| 1.32% \| 1.33% \| 5 \| -0.42% \| -0.29 \| -0.05 \| \| TPCDS(10) \| TPCDS-Q80A \| parquet / none / none \| 2.87 \| 2.89 \| -0.51% \| 1.33% \| 0.40% \| 5 \| -0.01% \| -0.29 \| -0.82 \| \| TPCDS(10) \| TPCDS-Q3 \| parquet / none / none \| 0.18 \| 0.19 \| -1.29% \| * 15.26% * \| * 15.87% * \| 5 \| -0.54% \| -0.87 \| -0.13 \| \| TPCDS(10) \| TPCDS-Q76 \| parquet / none / none \| 1.08 \| 1.11 \| -2.98% \| 0.92% \| 1.70% \| 5 \| -3.99% \| -2.02 \| -3.47 \| +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Pass PlannerTest.testProcessingCost - Pass test_executor_groups.py - Reenable test_tpcds_q51a in TestTpcdsQueryWithProcessingCost with MAX_FRAGMENT_INSTANCES_PER_NODE set to 5 - Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost - Pass core tests Change-Id: If948e45455275d9a61a6cd5d6a30a8b98a7c729a Reviewed-on: http://gerrit.cloudera.org:8080/19807 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-11 22:46:31 +00:00
Riza Suminto	7ca20b3c94	Revert "IMPALA-11123: Optimize count(star) for ORC scans" This reverts commit `f932d78ad0`. The commit is reverted because it cause significant regression for non-optimized counts star query in parquet format. There are several conflicts that need to be resolved manually: - Removed assertion against 'NumFileMetadataRead' counter that is lost with the revert. - Adjust the assertion in test_plain_count_star_optimization, test_in_predicate_push_down, and test_partitioned_insert of test_iceberg.py due to missing improvement in parquet optimized count star code path. - Keep the "override" specifier in hdfs-parquet-scanner.h to pass clang-tidy - Keep python3 style of RuntimeError instantiation in test_file_parser.py to pass check-python-syntax.sh Change-Id: Iefd8fd0838638f9db146f7b706e541fe2aaf01c1 Reviewed-on: http://gerrit.cloudera.org:8080/19843 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-05-06 22:55:05 +00:00
Riza Suminto	1830abcc7d	IMPALA-12056: Let child queries unset REQUEST_POOL if auto-scaling 'Compute Stats' queries gets scheduled on the smallest executor group set since these queries don't do any real work. However their child queries also gets scheduled on the smallest executor group. This may not be ideal for cases where the child query does NDVs and Counts on a big wide table. This patch let child queries to unset REQUEST_POOL query option if that option is set by frontend planner rather than client. With REQUEST_POOL unset, child query can select the executor group that best-fit its workload. Testing: - Add test in test_query_cpu_count_divisor_default Change-Id: I6dc559aa161a27a7bd5d3034788cc6241490d3b5 Reviewed-on: http://gerrit.cloudera.org:8080/19832 Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-04 15:54:39 +00:00
Riza Suminto	f0b9340983	IMPALA-12041: Select first executor group if query not auto-scalable In multiple executor groups setup, some trivial queries like "select 1;" fail admission with "No mapping found for request" error message. This patch fixes a bug where the Frontend does not set group name prefix when query is not auto-scalable. In cases like trivial query run, correct executor group name prefix is still needed for backend to correctly resolve the target pool. Testing: - Pass test_executor_groups.py Change-Id: I89497c8f67bfd176c2b60fa1b70fe53f905bbab0 Reviewed-on: http://gerrit.cloudera.org:8080/19691 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-04-06 07:51:30 +00:00
wzhou-code	ac23deab4d	IMPALA-12036: Fix Web UI to show right resource pools Web queries site shows no resource pool unless it is specified with query option. The Planner could set TQueryCtx.request_pool in TQueryExecRequest when auto scaling is enabled. But the backend ignores the TQueryCtx.request_pool in TQueryExecRequest when getting resource pools for Web UI. This patch fixes the issue in ClientRequestState::request_pool() by checking TQueryCtx.request_pool in TQueryExecRequest. It also removes the error path in RequestPoolService::ResolveRequestPool() if requested_pool is empty string. Testing: - Updated TestExecutorGroups::test_query_cpu_count_divisor_default, TestExecutorGroups::test_query_cpu_count_divisor_two, and TestExecutorGroups::test_query_cpu_count_divisor_fraction to verify resource pools on Web queries site and Web admission site. - Updated expected error message in TestAdmissionController::test_set_request_pool. - Passed core test. Change-Id: Iceacb3a8ec3bd15a8029ba05d064bbbb81e3a766 Reviewed-on: http://gerrit.cloudera.org:8080/19688 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com>	2023-04-05 15:45:06 +00:00
Riza Suminto	b2bc488402	IMPALA-12029: Relax scan fragment parallelism on first planning In a setup with multiple executor group set, Frontend will try to match a query with the smallest executor group set that can fit the memory and cpu requirement of the compiled query. There are kinds of query where the compiled plan will fit to any executor group set but not necessarily deliver the best performance. An example for this is Impala's COMPUTE STATS query. It does full table scan and aggregate the stats, have fairly simple query plan shape, but can benefit from higher scan parallelism. This patch relaxes the scan fragment parallelism on first round of query planning. This allows scan fragment to increase its parallelism based on its ProcessingCost estimation. If the relaxed plan fit in an executor group set, we replan once again with that executor group set but with scan fragment parallelism returned back to MT_DOP. This one extra round of query planning adds couple millisecond overhead depending on the complexity of the query plan, but necessary since the backend scheduler still expect at most MT_DOP amount of scan fragment instances. We can remove the extra replanning in the future once we can fully manage scan node parallelism without MT_DOP. This patch also adds some improvement, including: - Tune computeScanProcessingCost() to guard against scheduling too many scan fragments by comparing with the actual scan range count that Planner knows. - Use NUM_SCANNER_THREADS as a hint to cap scan node cost during the first round of planning. - Multiply memory related counters by num executors to make it per group set rather than per node. - Fix bug in doCreateExecRequest() about selection of num executors for planning. Testing: - Pass test_executor_groups.py - Add test cases in test_min_processing_per_thread_small. - Raised impala.admission-control.max-query-mem-limit.root.small from 64MB to 70MB in llama-site-3-groups.xml so that the new grouping query can fit in root.small pool. Change-Id: I7a2276fbd344d00caa67103026661a3644b9a1f9 Reviewed-on: http://gerrit.cloudera.org:8080/19656 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-04-04 17:37:05 +00:00
zhangyifan27	35fe1f37f5	IMPALA-8731: Balance queries across multiple executor groups This patch adds a new admission control policy that attempts to balance queries across multiple executor groups belonging to the same request pool based on available memory and slots in each executor group. This feature can be enabled by setting the startup flag '-balance_queries_across_executor_groups=true'. The setting is off by default. Testing: - Add e2e tests to verify the default policy and the new policy. Change-Id: I25e851fb57c1d820c25cef5316f4ed800e4c6ac5 Reviewed-on: http://gerrit.cloudera.org:8080/19630 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-03-25 21:52:18 +00:00
Riza Suminto	1eb6ed6f29	IMPALA-12023: Skip resource checking on last executor group set This patch adds flag skip_resource_checking_on_last_executor_group_set. If this backend flag is set to true, memory and cpu resource checking will be skipped when a query is being planned against the last (largest) executor group set. Setting true will ensure that query will always get admitted into the last executor group set if it does not fit in any other group set. Testing - Tune test_query_cpu_count_divisor_fraction to run two test case: cpu within limit, and cpu outside limit. - Add test_no_skip_resource_checking Change-Id: I5848e4f67939d3dd2fb105c1ae4ca8e15f2e556f Reviewed-on: http://gerrit.cloudera.org:8080/19649 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-24 08:32:14 +00:00
Riza Suminto	f2b01c1ddb	IMPALA-12017: Skip memory and cpu limit check if REQUEST_POOL is set Memory and cpu limit checking in executor group selection (Frontend.java) should be skipped if REQUEST_POOL query option is set. Setting REQUEST_POOL means user is specifying pool to run the query regardless of memory and cpu limit. Testing: - Add test cases in test_query_cpu_count_divisor_default Change-Id: I14bf7fe71e2dda1099651b3edf62480e1fdbf845 Reviewed-on: http://gerrit.cloudera.org:8080/19645 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2023-03-23 13:01:52 +00:00
Riza Suminto	5c1693d03d	IMPALA-12005: Describe executor group set selection in query profile This patch adds new profile counters under the Frontend profile node to describe executor group set selection during query planning. It modifies FrontendProfile.java to allow one level of TRuntimeProfileNode nesting under the Frontend profile node. This makes it possible to group profile counters specific to each executor group set in consideration. "fragment-costs" hint is renamed to "segment-costs". A new "cpu-comparison-result" hint is added after "segment-costs" to help navigate how cpu sizing decision is made. This patch also adds some function overloading in runtime-profile.cc to hide TotalTime and InactiveTotalTime that is meaningless for anything under the Frontend profile node. Additional context also added into AnalysisException threw when none of the executor group sets fits the query requirement. This is how the Frontend profile node looks like after running TestExecutorGroups::test_query_cpu_count_divisor_fraction Frontend: Referenced Tables: tpcds_parquet.store_sales - CpuCountDivisor: 0.20 - ExecutorGroupsConsidered: 3 (3) Executor group 1 (root.tiny): Verdict: not enough cpu cores - CpuAsk: 15 (15) - CpuMax: 2 (2) - EffectiveParallelism: 3 (3) - MemoryAsk: 36.83 MB (38617088) - MemoryMax: 64.00 MB (67108864) Executor group 2 (root.small): Verdict: not enough cpu cores - CpuAsk: 25 (25) - CpuMax: 16 (16) - EffectiveParallelism: 5 (5) - MemoryAsk: 36.83 MB (38624004) - MemoryMax: 64.00 MB (67108864) Executor group 3 (root.large): Verdict: Match - CpuAsk: 35 (35) - CpuMax: 192 (192) - EffectiveParallelism: 7 (7) - MemoryAsk: 36.84 MB (38633570) - MemoryMax: 8388608.00 GB (9007199254740992) Testing: - Pass core tests Change-Id: I6c0ac7f5216d631e4439fe97702e21e06d2eda8a Reviewed-on: http://gerrit.cloudera.org:8080/19628 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2023-03-23 13:01:52 +00:00
Joe McDonnell	0c7c6a335e	IMPALA-11977: Fix Python 3 broken imports and object model differences Python 3 changed some object model methods: - __nonzero__ was removed in favor of __bool__ - func_dict / func_name were removed in favor of __dict__ / __name__ - The next() function was deprecated in favor of __next__ (Code locations should use next(iter) rather than iter.next()) - metaclasses are specified a different way - Locations that specify __eq__ should also specify __hash__ Python 3 also moved some packages around (urllib2, Queue, httplib, etc), and this adapts the code to use the new locations (usually handled on Python 2 via future). This also fixes the code to avoid referencing exception variables outside the exception block and variables outside of a comprehension. Several of these seem like false positives, but it is better to avoid the warning. This fixes these pylint warnings: bad-python3-import eq-without-hash metaclass-assignment next-method-called nonzero-method exception-escape comprehension-escape Testing: - Ran core tests - Ran release exhaustive tests Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee Reviewed-on: http://gerrit.cloudera.org:8080/19592 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Riza Suminto	dafc0fb7a8	IMPALA-11604 (part 2): Compute Effective Parallelism of Query Part 1 of IMPALA-11604 implements the ProcessingCost model for each PlanNode and DataSink. This second part builds on top of ProcessingCost model by adjusting the number of instances for each fragment after considering their production-consumption ratio, and then finally returns a number representing an ideal CPU core count required for a query to run efficiently. A more detailed explanation of the CPU costing algorithm can be found in the three steps below. I. Compute the total ProcessingCost of a fragment. The costing algorithm splits a query fragment into several segments divided by blocking PlanNode/DataSink boundary. Each fragment segment is a subtree of PlanNodes/DataSink in the fragment with a DataSink or blocking PlanNode as root and non-blocking leaves. All other nodes in the segment are non-blocking. PlanNodes or DataSink that belong to the same segment will have their ProcessingCost summed. A new CostingSegment class is added to represent this segment. A fragment that has a blocking PlanNode or blocking DataSink is called a blocking fragment. Currently, only JoinBuildSink is considered as blocking DataSink. A fragment without any blocking nodes is called a non-blocking fragment. Step III discuss further about blocking and non-blocking fragment. Take an example of the following fragment plant, which is blocking since it has 3 blocking PlanNode: 12:AGGREGATE, 06:SORT, and 08:TOP-N. F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=6 (adjusted from 12) fragment-costs=[34974657, 2159270, 23752870, 22] 08:TOP-N [LIMIT=100] \| cost=900 \| 07:ANALYTIC \| cost=23751970 \| 06:SORT \| cost=2159270 \| 12:AGGREGATE [FINALIZE] \| cost=34548320 \| 11:EXCHANGE [HASH(i_class)] cost=426337 In bottom-up direction, there exist four segments in F03: Blocking segment 1: (11:EXCHANGE, 12:AGGREGATE) Blocking segment 2: 06:SORT Blocking segment 3: (07:ANALYTIC, 08:TOP-N) Non-blocking segment 4: DataStreamSink of F03 Therefore we have: PC(segment 1) = 426337+34548320 PC(segment 2) = 2159270 PC(segment 3) = 23751970+900 PC(segment 4) = 22 These per-segment costs stored in a CostingSegment tree rooted at PlanFragment.rootSegment_, and are [34974657, 2159270, 23752870, 22] respectively after the post-order traversal. This is implemented in PlanFragment.computeCostingSegment() and PlanFragment.collectCostingSegmentHelper(). II. Compute the effective degree of parallelism (EDoP) of fragments. The costing algorithm walks PlanFragments of the query plan tree in post-order traversal. Upon visiting a PlanFragment, the costing algorithm attempts to adjust the number of instances (effective parallelism) of that fragment by comparing the last segment's ProcessingCost of its child and production-consumption rate between its adjacent segments from step I. To simplify this initial implementation, the parallelism of PlanFragment containing EmptySetNode, UnionNode, or ScanNode will remain unchanged (follow MT_DOP). This step is implemented at PlanFragment.traverseEffectiveParallelism(). III. Compute the EDoP of the query. Effective parallelism of a query is the maximum upper bound of CPU core count that can parallelly work on a query when considering the overlapping between fragment execution and blocking operators. We compute this in a similar post-order traversal as step II and split the query tree into blocking fragment subtrees similar to step I. The following is an example of a query plan from TPCDS-Q12. F04:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 PLAN-ROOT SINK \| 13:MERGING-EXCHANGE [UNPARTITIONED] \| F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=3 (adjusted from 12) 08:TOP-N [LIMIT=100] \| 07:ANALYTIC \| 06:SORT \| 12:AGGREGATE [FINALIZE] \| 11:EXCHANGE [HASH(i_class)] \| F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=12 05:AGGREGATE [STREAMING] \| 04:HASH JOIN [INNER JOIN, BROADCAST] \| \|--F05:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 \| JOIN BUILD \| \| \| 10:EXCHANGE [BROADCAST] \| \| \| F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 \| 02:SCAN HDFS [tpcds10_parquet.date_dim, RANDOM] \| 03:HASH JOIN [INNER JOIN, BROADCAST] \| \|--F06:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 \| JOIN BUILD \| \| \| 09:EXCHANGE [BROADCAST] \| \| \| F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 \| 01:SCAN HDFS [tpcds10_parquet.item, RANDOM] \| 00:SCAN HDFS [tpcds10_parquet.web_sales, RANDOM] A blocking fragment is a fragment that has a blocking PlanNode or blocking DataSink in it. The costing algorithm splits the query plan tree into blocking subtrees divided by blocking fragment boundary. Each blocking subtree has a blocking fragment as a root and non-blocking fragments as the intermediate or leaf nodes. From the TPCDS-Q12 example above, the query plan is divided into five blocking subtrees of [(F05, F02), (F06, F01), F00, F03, F04]. A CoreCount is a container class that represents the CPU core requirement of a subtree of a query or the query itself. Each blocking subtree will have its fragment's adjusted instance count summed into a single CoreCount. This means that all fragments within a blocking subtree can run in parallel and should be assigned one core per fragment instance. The CoreCount for each blocking subtree in the TPCDS-Q12 example is [4, 4, 12, 3, 1]. Upon visiting a blocking fragment, the maximum between current CoreCount (rooted at that blocking fragment) vs previous blocking subtrees CoreCount is taken and the algorithm continues up to the next ancestor PlanFragment. The final CoreCount for the TPCDS-Q12 example is 12. This step is implemented at Planner.computeBlockingAwareCores() and PlanFragment.traverseBlockingAwareCores(). The resulting CoreCount at the root PlanFragment is then taken as the ideal CPU core count / EDoP of the query. This number will be compared against the total CPU count of an Impala executor group to determine if it fits to run in that set or not. A backend flag query_cpu_count_divisor is added to help scale down/up the EDoP of a query if needed. Two query options are added to control the entire computation of EDoP. 1. COMPUTE_PROCESSING_COST Control whether to enable this CPU costing algorithm or not. Must also set MT_DOP > 0 for this query option to take effect. 2. PROCESSING_COST_MIN_THREADS Control the minimum number of fragment instances (threads) that the costing algorithm is allowed to adjust. The costing algorithm is in charge of increasing the fragment's instance count beyond this minimum number through producer-consumer rate comparison. The maximum number of fragment is max between PROCESSING_COST_MIN_THREADS, MT_DOP, and number of cores per executor. This patch also adds three backend flags to tune the algorithm. 1. query_cpu_count_divisor Divide the CPU requirement of a query to fit the total available CPU in the executor group. For example, setting value 2 will fit the query with CPU requirement 2X to an executor group with total available CPU X. Note that setting with a fractional value less than 1 effectively multiplies the query CPU requirement. A valid value is > 0.0. The default value is 1. 2. processing_cost_use_equal_expr_weight If true, all expression evaluations are weighted equally to 1 during the plan node's processing cost calculation. If false, expression cost from IMPALA-2805 will be used. Default to true. 3. min_processing_per_thread Minimum processing load (in processing cost unit) that a fragment instance needs to work on before planner considers increasing instance count based on the processing cost rather than the MT_DOP setting. The decision is per fragment. Setting this to high number will reduce parallelism of a fragment (more workload per fragment), while setting to low number will increase parallelism (less workload per fragment). Actual parallelism might still be constrained by the total number of cores in selected executor group, MT_DOP, or PROCESSING_COST_MIN_THREAD query option. Must be a positive integer. Currently default to 10M. As an example, the following are additional ProcessingCost information printed to coordinator log for Q3, Q12, and Q15 ran on TPCDS 10GB scale, 3 executors, MT_DOP=4, PROCESSING_COST_MAX_THREADS=4, and processing_cost_use_equal_expr_weight=false. Q3 CoreCount={total=12 trace=F00:12} Q12 CoreCount={total=12 trace=F00:12} Q15 CoreCount={total=15 trace=N07:3+F00:12} There are a few TODOs which will be done in follow up tasks: 1. Factor in row width in ProcessingCost calcuation (IMPALA-11972). 2. Tune the individual expression cost from IMPALA-2805. 3. Benchmark and tune min_processing_per_thread with an optimal value. 4. Revisit cases where cardinality is not available (getCardinality() or getInputCardinality() return -1). 5. Bound SCAN and UNION fragments by ProcessingCost as well (need to address IMPALA-8081). Testing: - Add TestTpcdsQueryWithProcessingCost, which is a similar run of TestTpcdsQuery, but with COMPUTE_PROCESSING_COST=1 and MT_DOP=4. Setting log level TRACE for PlanFragment and manually running TestTpcdsQueryWithProcessingCost in minicluster shows several fragment instance count reduction from 12 to either of 9, 6, or 3 in coordinator log. - Add PlannerTest#testProcessingCost Adjusted fragment count is indicated by "(adjusted from 12)" in the query profile. - Add TestExecutorGroups::test_query_cpu_count_divisor. Co-authored-by: Qifan Chen <qchen@cloudera.com> Change-Id: Ibb2a796fdf78336e95991955d89c671ec82be62e Reviewed-on: http://gerrit.cloudera.org:8080/19593 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2023-03-08 15:32:28 +00:00
zhangyifan27	253423f99c	IMPALA-11891: Remove empty executor groups This patch removes executor groups from cluster membership after they have no executors, so that executor groups' configurations can be updated without restarting all impalads in the cluster. Testing: - Added an e2e test to verify the new functionality. Change-Id: I480b84b26a780d345216004f1a4657c7b95dda45 Reviewed-on: http://gerrit.cloudera.org:8080/19468 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-02-04 01:11:17 +00:00
wzhou-code	43928b190b	IMPALA-11617: Pool service should be made aware of cpu core limit IMPALA-11604 enables the planner to compute CPU usage for certain queries and to select suitable executor groups to run. The CPU usage is expressed as the CPU cores required to process a query. This patch added the CPU core limit, which is the maximum CPU core available per node and coordinator for each executor group, to the pool service. Testing: - Passed core run. - Verified that CPU cores were shown on the admission and metrics pages of the Impala debug web server. Change-Id: Id4c5ee519ce7c329b06ac821283e215a3560f525 Reviewed-on: http://gerrit.cloudera.org:8080/19366 Reviewed-by: Andrew Sherman <asherman@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-01-28 01:34:14 +00:00
Yida Wu	839a25c89b	IMPALA-11786: Preserve memory for codegen cache IMPALA-11470 adds support for codegen cache, however the admission controller is not aware of the memory usage of the codegen cache, while the codegen cache is actually using the memory quota from the query memory. It could result in query failures when running heavy workloads and admission controller has fully admitted queries. This patch subtracts the codegen cache capacity from the admission memory limit during initialization, therefore preserving the memory consumption of codegen cache from the beginning, and treating it as a separate memory independent to the query memory reservation. Also reduces the max codegen cache memory from 20 percent to 10 percent, and changes some failed testcases due to the reduction of the admit memory limit. Tests: Passed exhaustive tests. Change-Id: Iebdc04ba1b91578d74684209a11c815225b8505a Reviewed-on: http://gerrit.cloudera.org:8080/19377 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-01-06 06:28:07 +00:00
Wenzhe Zhou	1b59d32eff	Revert "IMPALA-11617: Pool service should be made aware of processing cost limit" This reverts commit `1d62bddb84`. Change-Id: I1ebf5ff9685072079e18497d869d06b2c55153fe Reviewed-on: http://gerrit.cloudera.org:8080/19139 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2022-10-14 04:01:06 +00:00
wzhou-code	1d62bddb84	IMPALA-11617: Pool service should be made aware of processing cost limit IMPALA-11604 enables the planner to compute CPU usage for certain queries and to select suitable executor groups to run. The CPU usage is expressed as the total amount of data to be processed per query. This patch added the processing cost limit, which is the total amount of data that each executor group can handle, to the pool service. Testing: - Passed core run. - Verified that processing costs were shown on the admission and metrics pages of the Impala debug web server. Change-Id: I9bd2a7284eda47a969ef91e4be19f96d2af53449 Reviewed-on: http://gerrit.cloudera.org:8080/19121 Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-12 23:46:57 +00:00
Michael Smith	11157a8701	IMPALA-10889: Allow extra 1MB for fragment cancellation After queries are cancelled, it can take some time (>30s in some instances) to fully cancel all fragment instances and fully reclaim reserved memory. The test and query limits were exactly matched, so any extra reservation would prevent scheduling, causing the test to frequently time out. With the fix, a 1MB of extra memory is reserved to break the tie thus avoiding the time out. The extra 1MB of memory can be seen in logs printing agg_mem_reserved. Rather than extend timeouts and make the test run longer, add a small buffer to the admission limit to allow for fragment instance cleanup while the test runs. Change-Id: Iaee557ad87d3926589b30d6dcdd850e9af9b3476 Reviewed-on: http://gerrit.cloudera.org:8080/19092 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-10-11 00:18:07 +00:00
Riza Suminto	f932d78ad0	IMPALA-11123: Optimize count(star) for ORC scans This patch provides count(star) optimization for ORC scans, similar to the work done in IMPALA-5036 for Parquet scans. We use the stripes num rows statistics when computing the count star instead of materializing empty rows. The aggregate function changed from a count to a special sum function initialized to 0. This count(star) optimization is disabled for the full ACID table because the scanner might need to read and validate the 'currentTransaction' column in table's special schema. This patch drops 'parquet' from names related to the count star optimization. It also improves the count(star) operation in general by serving the result just from the file's footer stats for both Parquet and ORC. We unify the optimized count star and zero slot scan functions into HdfsColumnarScanner. The following table shows a performance comparison before and after the patch. primitive_count_star query target tpch10_parquet.lineitem table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc query is a modified primitive_count_star query that targets tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly. +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| tpch_parquet \| count_star_parq \| parquet / none / none \| 0.06 \| 0.07 \| -10.45% \| 2.87% \| * 25.51% * \| 9 \| -1.47% \| -1.26 \| -1.22 \| \| tpch_orc_def \| count_star_orc \| orc / def / none \| 0.06 \| 0.08 \| -22.37% \| 6.22% \| * 30.95% * \| 9 \| -1.85% \| -1.16 \| -2.14 \| \| TARGETED-PERF(10) \| primitive_count_star \| parquet / none / none \| 0.06 \| 0.08 \| I -30.40% \| 2.68% \| * 29.63% * \| 9 \| I -7.20% \| -2.42 \| -3.07 \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Add PlannerTest.testOrcStatsAgg - Add TestAggregationQueries::test_orc_count_star_optimization - Exercise count(star) in TestOrc::test_misaligned_orc_stripes - Pass core tests Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Reviewed-on: http://gerrit.cloudera.org:8080/18327 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-05 13:27:10 +00:00
Qifan Chen	07a3e6e0df	IMPALA-10992 Planner changes for estimate peak memory This patch provides replan support for multiple executor group sets. Each executor group set is associated with a distinct number of nodes and a threshold for estimated memory per host in bytes that can be denoted as [<group_name_prefix>:<#nodes>, <threshold>]. In the patch, a query of type EXPLAIN, QUERY or DML can be compiled more than once. In each attempt, per host memory is estimated and compared with the threshold of an executor group set. If the estimated memory is no more than the threshold, the iteration process terminates and the final plan is determined. The executor group set with the threshold is selected to run the query. A new query option 'enable_replan', default to 1 (enabled), is added. It can be set to 0 to disable this patch and to generate the distributed plan for the default executor group. To avoid long compilation time, the following enhancement is enabled. Note 1) can be disabled when relevant meta-data change is detected. 1. Authorization is performed only for the 1st compilation; 2. openTransaction() is called for transactional queries in 1st compilation and the saved transactional info is used in subsequent compilations. Similar logic is applied to Kudu transactional queries. To facilitate testing, the patch imposes an artificial two executor group setup in FE as follows. 1. [regular:<#nodes>, 64MB] 2. [large:<#nodes>, 8PB] This setup is enabled when a new query option 'test_replan' is set to 1 in backend tests, or RuntimeEnv.INSTANCE.isTestEnv() is true as in most frontend tests. This query option is set to 0 by default. Compilation time increases when a query is compiled in several iterations, as shown below for several TPCDs queries. The increase is mostly due to redundant work in either single node plan creation or recomputing value transfer graph phase. For small queries, the increase can be avoided if they can be compiled in single iteration by properly setting the smallest threshold among all executor group sets. For example, for the set of queries listed below, the smallest threshold can be set to 320MB to catch both q15 and q21 in one compilation. Compilation time (ms) Queries Estimated Memory 2-iterations 1-iteration Percentage of increase q1 408MB 60.14 25.75 133.56% q11 1.37GB 261.00 109.61 138.11% q10a 519MB 139.24 54.52 155.39% q13 339MB 143.82 60.08 139.38% q14a 3.56GB 762.68 312.92 143.73% q14b 2.20GB 522.01 245.13 112.95% q15 314MB 9.73 4.28 127.33% q21 275MB 16.00 8.18 95.59% q23a 1.50GB 461.69 231.78 99.19% q23b 1.34GB 461.31 219.61 110.05% q4 2.60GB 218.05 105.07 107.52% q67 5.16GB 694.59 334.24 101.82% Testing: 1. Almost all FE and BE tests are now run in the artificial two executor setup except a few where a specific cluster configuration is desirable; 2. Ran core tests successfully; 3. Added a new observability test and a new query assignment test; 4. Disabled concurrent insert test (test_concurrent_inserts) and failing inserts (test_failing_inserts) test in local catalog mode due to flakiness. Reported both in IMPALA-11189 and IMPALA-11191. Change-Id: I75cf17290be2c64fd4b732a5505bdac31869712a Reviewed-on: http://gerrit.cloudera.org:8080/18178 Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Qifan Chen <qchen@cloudera.com>	2022-03-21 20:17:28 +00:00
Bikramjeet Vig	0aedf6021f	IMPALA-11063: Add metrics to expose state of each executor group set This adds metrics for each executor group set that expose the number of executor groups, the number of healthy executor groups and the total number of backends associated with that group set. Testing: Added an e2e test to verify metrics are updated correctly. Change-Id: Ib39f940de830ef6302785aee30eeb847fa5deeba Reviewed-on: http://gerrit.cloudera.org:8080/18142 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-20 09:46:32 +00:00
Bikramjeet Vig	65c3b784a2	IMPALA-11033: Add support for specifying multiple executor group sets This patch introduces the concept of executor group sets. Each group set specifies an executor group name prefix and an expected group size (the number of executors in each group). Every executor group that is a part of this set will have the same prefix which will also be equivalent to the resource pool name that it maps to. These sets are specified via a startup flag 'expected_executor_group_sets' which is a comma separated list in the format <executor_group_name_prefix>:<expected_group_size>. Testing: - Added unit tests Change-Id: I9e0a3a5fe2b1f0b7507b7c096b7a3c373bc2e684 Reviewed-on: http://gerrit.cloudera.org:8080/18093 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-08 09:26:37 +00:00

1 2

62 Commits