impala

mirror of https://github.com/apache/impala.git synced 2025-12-20 02:20:11 -05:00

Author	SHA1	Message	Date
Riza Suminto	f932d78ad0	IMPALA-11123: Optimize count(star) for ORC scans This patch provides count(star) optimization for ORC scans, similar to the work done in IMPALA-5036 for Parquet scans. We use the stripes num rows statistics when computing the count star instead of materializing empty rows. The aggregate function changed from a count to a special sum function initialized to 0. This count(star) optimization is disabled for the full ACID table because the scanner might need to read and validate the 'currentTransaction' column in table's special schema. This patch drops 'parquet' from names related to the count star optimization. It also improves the count(star) operation in general by serving the result just from the file's footer stats for both Parquet and ORC. We unify the optimized count star and zero slot scan functions into HdfsColumnarScanner. The following table shows a performance comparison before and after the patch. primitive_count_star query target tpch10_parquet.lineitem table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc query is a modified primitive_count_star query that targets tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly. +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| tpch_parquet \| count_star_parq \| parquet / none / none \| 0.06 \| 0.07 \| -10.45% \| 2.87% \| * 25.51% * \| 9 \| -1.47% \| -1.26 \| -1.22 \| \| tpch_orc_def \| count_star_orc \| orc / def / none \| 0.06 \| 0.08 \| -22.37% \| 6.22% \| * 30.95% * \| 9 \| -1.85% \| -1.16 \| -2.14 \| \| TARGETED-PERF(10) \| primitive_count_star \| parquet / none / none \| 0.06 \| 0.08 \| I -30.40% \| 2.68% \| * 29.63% * \| 9 \| I -7.20% \| -2.42 \| -3.07 \| +-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Add PlannerTest.testOrcStatsAgg - Add TestAggregationQueries::test_orc_count_star_optimization - Exercise count(star) in TestOrc::test_misaligned_orc_stripes - Pass core tests Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091 Reviewed-on: http://gerrit.cloudera.org:8080/18327 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-05 13:27:10 +00:00
Qifan Chen	07a3e6e0df	IMPALA-10992 Planner changes for estimate peak memory This patch provides replan support for multiple executor group sets. Each executor group set is associated with a distinct number of nodes and a threshold for estimated memory per host in bytes that can be denoted as [<group_name_prefix>:<#nodes>, <threshold>]. In the patch, a query of type EXPLAIN, QUERY or DML can be compiled more than once. In each attempt, per host memory is estimated and compared with the threshold of an executor group set. If the estimated memory is no more than the threshold, the iteration process terminates and the final plan is determined. The executor group set with the threshold is selected to run the query. A new query option 'enable_replan', default to 1 (enabled), is added. It can be set to 0 to disable this patch and to generate the distributed plan for the default executor group. To avoid long compilation time, the following enhancement is enabled. Note 1) can be disabled when relevant meta-data change is detected. 1. Authorization is performed only for the 1st compilation; 2. openTransaction() is called for transactional queries in 1st compilation and the saved transactional info is used in subsequent compilations. Similar logic is applied to Kudu transactional queries. To facilitate testing, the patch imposes an artificial two executor group setup in FE as follows. 1. [regular:<#nodes>, 64MB] 2. [large:<#nodes>, 8PB] This setup is enabled when a new query option 'test_replan' is set to 1 in backend tests, or RuntimeEnv.INSTANCE.isTestEnv() is true as in most frontend tests. This query option is set to 0 by default. Compilation time increases when a query is compiled in several iterations, as shown below for several TPCDs queries. The increase is mostly due to redundant work in either single node plan creation or recomputing value transfer graph phase. For small queries, the increase can be avoided if they can be compiled in single iteration by properly setting the smallest threshold among all executor group sets. For example, for the set of queries listed below, the smallest threshold can be set to 320MB to catch both q15 and q21 in one compilation. Compilation time (ms) Queries Estimated Memory 2-iterations 1-iteration Percentage of increase q1 408MB 60.14 25.75 133.56% q11 1.37GB 261.00 109.61 138.11% q10a 519MB 139.24 54.52 155.39% q13 339MB 143.82 60.08 139.38% q14a 3.56GB 762.68 312.92 143.73% q14b 2.20GB 522.01 245.13 112.95% q15 314MB 9.73 4.28 127.33% q21 275MB 16.00 8.18 95.59% q23a 1.50GB 461.69 231.78 99.19% q23b 1.34GB 461.31 219.61 110.05% q4 2.60GB 218.05 105.07 107.52% q67 5.16GB 694.59 334.24 101.82% Testing: 1. Almost all FE and BE tests are now run in the artificial two executor setup except a few where a specific cluster configuration is desirable; 2. Ran core tests successfully; 3. Added a new observability test and a new query assignment test; 4. Disabled concurrent insert test (test_concurrent_inserts) and failing inserts (test_failing_inserts) test in local catalog mode due to flakiness. Reported both in IMPALA-11189 and IMPALA-11191. Change-Id: I75cf17290be2c64fd4b732a5505bdac31869712a Reviewed-on: http://gerrit.cloudera.org:8080/18178 Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Qifan Chen <qchen@cloudera.com>	2022-03-21 20:17:28 +00:00
Bikramjeet Vig	0aedf6021f	IMPALA-11063: Add metrics to expose state of each executor group set This adds metrics for each executor group set that expose the number of executor groups, the number of healthy executor groups and the total number of backends associated with that group set. Testing: Added an e2e test to verify metrics are updated correctly. Change-Id: Ib39f940de830ef6302785aee30eeb847fa5deeba Reviewed-on: http://gerrit.cloudera.org:8080/18142 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-20 09:46:32 +00:00
Bikramjeet Vig	65c3b784a2	IMPALA-11033: Add support for specifying multiple executor group sets This patch introduces the concept of executor group sets. Each group set specifies an executor group name prefix and an expected group size (the number of executors in each group). Every executor group that is a part of this set will have the same prefix which will also be equivalent to the resource pool name that it maps to. These sets are specified via a startup flag 'expected_executor_group_sets' which is a comma separated list in the format <executor_group_name_prefix>:<expected_group_size>. Testing: - Added unit tests Change-Id: I9e0a3a5fe2b1f0b7507b7c096b7a3c373bc2e684 Reviewed-on: http://gerrit.cloudera.org:8080/18093 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-08 09:26:37 +00:00
Bikramjeet Vig	4d97895c6c	IMPALA-10943: Add test to verify support for multiple resource and executor pools This patch adds a test to verify that admission control accounting works when using multiple coordinators and multiple executor groups mapped to different resource pools and having different sizes. Change-Id: If76d386d8de5730da937674ddd9a69aa1aa1355e Reviewed-on: http://gerrit.cloudera.org:8080/17891 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-11-10 02:53:19 +00:00
Bikramjeet Vig	06c9016a37	IMPALA-8762: Track host level admission stats across all coordinators This patch adds the ability to share the per-host stats for locally admitted queries across all coordinators. This helps to get a more consolidated view of the cluster for stats like slots_in_use and mem_admitted when making local admission decisions. Testing: Added e2e py test Change-Id: I2946832e0a89b077d0f3bec755e4672be2088243 Reviewed-on: http://gerrit.cloudera.org:8080/17683 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-28 05:33:16 +00:00
Bikramjeet Vig	004e3c897e	IMPALA-8830: Fix executor group assignment of coordinator only queries With this fix, coordinator only queries are submitted to a pseudo executor group named "empty group (using coordinator only)" which is empty. This allows running coordinator only queries regardless of the presence of any healthy executor groups. Testing: Added a custom cluster test and modified tests that relied on coordinator only queries to be queued in absence of executor groups. Change-Id: I8fe098032744aa20bbbe4faddfc67e7a46ce03d5 Reviewed-on: http://gerrit.cloudera.org:8080/14183 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-19 01:35:29 +00:00
Bikramjeet Vig	85e138d3f0	IMPALA-9202: Fix flakiness in test_executor_groups Some tests in test_executor_groups immediately tried fetching the query profile after executing it asynchronously to verify if the query was queued. However there is a small window between the exec rpc returning and the query being queued during which the query profile does not contain any info about the query being queued. This was causing some asserts in the test to fail. Change-Id: I47070045250a12d86c99f9a30a956a268be5fa7e Reviewed-on: http://gerrit.cloudera.org:8080/14810 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-04 23:29:43 +00:00
Lars Volker	69a9ac102d	IMPALA-9151: Maintain cluster size in ExecutorMembershipSnapshot This change improves the cluster membership snapshot we maintain in the frontend in cases where all executors have been shut down or none have started yet. Prior to this change when configuring Impala with executor groups, the planner might see a ExecutorMembershipSnapshot that has no executors in it. This could happen if the first executor group had not started up yet, or if all executor groups had been shutdown. If this happened, the planner would make sub-optimal decisions, e.g. decide on a broadcast join vs a partitioned hash join. With this change if no executors have been registered so far, the planner will use the expected number of executors which can be set using the -num_expected_executors flag and is 20 by default. After executors come online, the planner will use the size of the largest healthy executor group, and it will hold on to the group's size even if it shuts down or becomes unhealthy. This allows the planner to work on the assumption that a healthy executor group of the same size will eventually come online to execute the query. Change-Id: Ib6b05326c82fb3ca625c015cfcdc38f891f5d4f9 Reviewed-on: http://gerrit.cloudera.org:8080/14756 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-04 00:40:04 +00:00
Tim Armstrong	eb8f415f7c	IMPALA-9073: fix test_executor_concurrency flakiness The test was checking the incorrect invariant - the slot mechanism only prevents more than than number of queries running on a backend. More queries can run on a cluster since the query's backends are freed up before the query itself finishes. It was a little tricky picking an appropriate metric since there is no strong consistency between the metrics, e.g. decrementing a metric after a backend finishes may race with admitting the next query. So I simply used the same metric used by the admission controller in making decisions, which should be strongly consistent w.r.t. admission control decissions. Also remove the concurrency limit on the coordinator, which seemed inconsistent with the purpose of the test, because we only want concurrency to be limited by the executors. Testing: Looped the test for a bit. Change-Id: I910028919f248a3bf5de345e9eade9dbc4353ebd Reviewed-on: http://gerrit.cloudera.org:8080/14606 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-01 21:12:10 +00:00
Tim Armstrong	b0c6740fae	IMPALA-8998: admission control accounting for mt_dop This integrates mt_dop with the "slots" mechanism that's used for non-default executor groups. The idea is simple - the degree of parallelism on a backend determines the number of slots consumed. The effective degree of parallelism is used, not the raw mt_dop setting. E.g. if the query only has a single input split and executes only a single fragment instance on a host, we don't want to count the full mt_dop value for admission control. --admission_control_slots is added as a new flag that replaces --max_concurrent_queries, since the name better reflects the concept. --max_concurrent_queries is kept for backwards compatibility and has the same meaning as --admission_control_slots. The admission control logic is extended to take this into account. We also add an immediate rejection code path since it is now possible for queries to not be admittable based on the # of available slots. We only factor in the "width" of the plan - i.e. the number of instances of fragments. We don't account for the number of distinct fragments, since they may not actually execute in parallel with each other because of dependencies. This number is added to the per-host profile as the "AdmissionSlots" counter. Testing: Added unit tests for rejection and queue/admit checks. Also includes a fix for IMPALA-9054 where we increase the timeout. Added end-to-end tests: * test_admission_slots in test_mt_dop.py that checks the admission slot calculation via the profile. * End-to-end admission test that exercises the admit immediately and queueing code paths. Added checks to test_verify_metrics (which runs after end-to-end tests) to ensure that the per-backend slots in use goes to 0 when the cluster is quiesced. Change-Id: I7b6b6262ef238df26b491352656a26e4163e46e5 Reviewed-on: http://gerrit.cloudera.org:8080/14357 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-10-16 20:26:11 +00:00
Bikramjeet Vig	d99bc14936	IMPALA-8858: Add metrics tracking num queries running on executor groups With this patch, all executor groups with at least one executor will have a metric added that displays the number of queries (admitted by the local coordinator) running on them. The metric is removed only when the group has no executors in it. It gets updated when either the cluster membership changes or a query gets admitted or released by the admission controller. Also adds the ability to delete metrics from a metric group after registration. Testing: - Added a custom cluster test and a BE metric test. - Had to modify some metric tests that relied on ordering of metrics by their name. Change-Id: I58cde8699c33af8b87273437e9d8bf6371a34539 Reviewed-on: http://gerrit.cloudera.org:8080/14103 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-20 02:43:58 +00:00
Lars Volker	576f205bff	IMPALA-8936: Improve queuing reason for unhealthy executor groups In some situations, users might actually expect not having a healthy executor group around, e.g. when they're starting one and it takes a while to come online. This change makes the queuing reason more generic and drops the "unhealthy" concept from it to reduce confusion. Change-Id: Idceab7fb56335bab9d787b0f351a41e6efd7dd59 Reviewed-on: http://gerrit.cloudera.org:8080/14210 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-11 08:04:05 +00:00
Sahil Takiar	df365a8e30	IMPALA-8803: Coordinator should release admitted memory per-backend Changes the Coordinator to release admitted memory when each Backend completes, rather than waiting for the entire query to complete before releasing admitted memory. When the Coordinator detects that a Backend has completed (via ControlService::ReportExecStatus) it updates the state of the Backend in Coordinator::BackendResourceState. BackendResourceState tracks the state of the admitted resources for each Backend and decides when the resources for a group of Backend states should be released. BackendResourceState defines a state machine to help coordinate the state of the admitted memory for each Backend. It guarantees that by the time the query is shutdown, all Backends release their admitted memory. BackendResourceState implements three rules to control the rate at which the Coordinator releases admitted memory from the AdmissionController: * Resources are released at most once every 1 second, this prevents short lived queries from causing high load on the admission controller * Resources are released at most O(log(num_backends)) times; the BackendResourceStates can release multiple BackendStates from the AdmissionController at a time * All pending resources are released if the only remaining Backend is the Coordinator Backend; this is useful for result spooling where all Backends may complete, except for the Coordinator Backend Exposes the following hidden startup flags to help tune the heuristics above: --batched_release_decay_factor * Defaults to 2 * Controls the base value for the O(log(num_backends)) bound when batching the release of Backends. --release_backend_states_delay_ms * Defaults to 1000 milliseconds * Controls how often Backends can release their resources. Testing: * Ran core tests * Added new tests to test_result_spooling.py and test_admission_controller.py Change-Id: I88bb11e0ede7574568020e0277dd8ac8d2586dc9 Reviewed-on: http://gerrit.cloudera.org:8080/14104 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-07 21:15:52 +00:00
Bikramjeet Vig	b5193f36de	IMPALA-8806: Add metrics to improve observability of executor groups This patch adds 3 metrics under a new metric group called "cluster-membership" that keep track of the number of executor groups that have at least one live executor, number of executor groups that are in a healthy state and the number of backends registered with the statestore. Testing: Modified tests to use these metrics for verification. Change-Id: I7745ea1c7c6778d3fb5e59adbc873697beb0f3b9 Reviewed-on: http://gerrit.cloudera.org:8080/13979 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-06 02:33:59 +00:00
Lars Volker	2397ae5590	IMPALA-8484: Run queries on disjoint executor groups This change adds support for running queries inside a single admission control pool on one of several, disjoint sets of executors called "executor groups". Executors can be configured with an executor group through the newly added '--executor_groups' flag. Note that in anticipation of future changes, the flag already uses the plural form, but only a single executor group may be specified for now. Each executor group specification can optionally contain a minimum size, separated by a ':', e.g. --executor_groups default-pool-1:3. Only when the cluster membership contains at least that number of executors for the groups will it be considered for admission. Executor groups are mapped to resource pools by their name: An executor group can service queries from a resource pool if the pool name is a prefix of the group name separated by a '-'. For example, queries in poll poolA can be serviced by executor groups named poolA-1 and poolA-2, but not by groups name foo or poolB-1. During scheduling, executor groups are considered in alphabetical order. This means that one group is filled up entirely before a subsequent group is considered for admission. Groups also need to pass a health check before considered. In particular, they must contain at least the minimum number of executors specified. If no group is specified during startup, executors are added to the default executor group. If - during admission - no executor group for a pool can be found and the default group is non-empty, then the default group is considered. The default group does not have a minimum size. This change inverts the order of scheduling and admission. Prior to this change, queries were scheduled before submitting them to the admission controller. Now the admission controller computes schedules for all candidate executor groups before each admission attempt. If the cluster membership has not changed, then the schedules of the previous attempt will be reused. This means that queries will no longer fail if the cluster membership changes while they are queued in the admission controller. This change also alters the default behavior when using a dedicated coordinator and no executors have registered yet. Prior to this change, a query would fail immediately with an error ("No executors registered in group"). Now a query will get queued and wait until executors show up, or it times out after the pools queue timeout period. Testing: This change adds a new custom cluster test for executor groups. It makes use of new capabilities added to start-impala-cluster.py to bring up additional executors into an already running cluster. Additionally, this change adds an instructional implementation of executor group based autoscaling, which can be used during development. It also adds a helper to run queries concurrently. Both are used in a new test to exercise the executor group logic and to prevent regressions to these tools. In addition to these tests, the existing tests for the admission controller (both BE and EE tests) thoroughly exercise the changed code. Some of them required changes themselves to reflect the new behavior. I looped the new tests (test_executor_groups and test_auto_scaling) for a night (110 iterations each) without any issues. I also started an autoscaling cluster with a single group and ran TPC-DS, TPC-H, and test_queries on it successfully. Known limitations: When using executor groups, only a single coordinator and a single AC pool (i.e. the default pool) are supported. Executors to not include the number of currently running queries in their statestore updates and so admission controllers are not aware of the number of queries admitted by other controllers per host. Change-Id: I8a1d0900f2a82bd2fc0a906cc094e442cffa189b Reviewed-on: http://gerrit.cloudera.org:8080/13550 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-21 04:54:03 +00:00

16 Commits