This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.
This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.
This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.
The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet | count_star_parq | parquet / none / none | 0.06 | 0.07 | -10.45% | 2.87% | * 25.51% * | 9 | -1.47% | -1.26 | -1.22 |
| tpch_orc_def | count_star_orc | orc / def / none | 0.06 | 0.08 | -22.37% | 6.22% | * 30.95% * | 9 | -1.85% | -1.16 | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06 | 0.08 | I -30.40% | 2.68% | * 29.63% * | 9 | I -7.20% | -2.42 | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests
Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Reviewed-on: http://gerrit.cloudera.org:8080/18327
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch provides replan support for multiple executor group sets.
Each executor group set is associated with a distinct number of nodes
and a threshold for estimated memory per host in bytes that can be
denoted as [<group_name_prefix>:<#nodes>, <threshold>].
In the patch, a query of type EXPLAIN, QUERY or DML can be compiled
more than once. In each attempt, per host memory is estimated and
compared with the threshold of an executor group set. If the estimated
memory is no more than the threshold, the iteration process terminates
and the final plan is determined. The executor group set with the
threshold is selected to run the query.
A new query option 'enable_replan', default to 1 (enabled), is added.
It can be set to 0 to disable this patch and to generate the distributed
plan for the default executor group.
To avoid long compilation time, the following enhancement is enabled.
Note 1) can be disabled when relevant meta-data change is
detected.
1. Authorization is performed only for the 1st compilation;
2. openTransaction() is called for transactional queries in 1st
compilation and the saved transactional info is used in
subsequent compilations. Similar logic is applied to Kudu
transactional queries.
To facilitate testing, the patch imposes an artificial two executor
group setup in FE as follows.
1. [regular:<#nodes>, 64MB]
2. [large:<#nodes>, 8PB]
This setup is enabled when a new query option 'test_replan' is set
to 1 in backend tests, or RuntimeEnv.INSTANCE.isTestEnv() is true as
in most frontend tests. This query option is set to 0 by default.
Compilation time increases when a query is compiled in several
iterations, as shown below for several TPCDs queries. The increase
is mostly due to redundant work in either single node plan creation
or recomputing value transfer graph phase. For small queries, the
increase can be avoided if they can be compiled in single iteration
by properly setting the smallest threshold among all executor group
sets. For example, for the set of queries listed below, the smallest
threshold can be set to 320MB to catch both q15 and q21 in one
compilation.
Compilation time (ms)
Queries Estimated Memory 2-iterations 1-iteration Percentage of
increase
q1 408MB 60.14 25.75 133.56%
q11 1.37GB 261.00 109.61 138.11%
q10a 519MB 139.24 54.52 155.39%
q13 339MB 143.82 60.08 139.38%
q14a 3.56GB 762.68 312.92 143.73%
q14b 2.20GB 522.01 245.13 112.95%
q15 314MB 9.73 4.28 127.33%
q21 275MB 16.00 8.18 95.59%
q23a 1.50GB 461.69 231.78 99.19%
q23b 1.34GB 461.31 219.61 110.05%
q4 2.60GB 218.05 105.07 107.52%
q67 5.16GB 694.59 334.24 101.82%
Testing:
1. Almost all FE and BE tests are now run in the artificial two
executor setup except a few where a specific cluster configuration
is desirable;
2. Ran core tests successfully;
3. Added a new observability test and a new query assignment test;
4. Disabled concurrent insert test (test_concurrent_inserts) and
failing inserts (test_failing_inserts) test in local catalog mode
due to flakiness. Reported both in IMPALA-11189 and IMPALA-11191.
Change-Id: I75cf17290be2c64fd4b732a5505bdac31869712a
Reviewed-on: http://gerrit.cloudera.org:8080/18178
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Qifan Chen <qchen@cloudera.com>
This adds metrics for each executor group set that expose the number
of executor groups, the number of healthy executor groups and the
total number of backends associated with that group set.
Testing:
Added an e2e test to verify metrics are updated correctly.
Change-Id: Ib39f940de830ef6302785aee30eeb847fa5deeba
Reviewed-on: http://gerrit.cloudera.org:8080/18142
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch introduces the concept of executor group sets. Each group
set specifies an executor group name prefix and an expected group
size (the number of executors in each group). Every executor group
that is a part of this set will have the same prefix which will
also be equivalent to the resource pool name that it maps to.
These sets are specified via a startup flag
'expected_executor_group_sets' which is a comma separated list in
the format <executor_group_name_prefix>:<expected_group_size>.
Testing:
- Added unit tests
Change-Id: I9e0a3a5fe2b1f0b7507b7c096b7a3c373bc2e684
Reviewed-on: http://gerrit.cloudera.org:8080/18093
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
executor pools
This patch adds a test to verify that admission control accounting
works when using multiple coordinators and multiple executor groups
mapped to different resource pools and having different sizes.
Change-Id: If76d386d8de5730da937674ddd9a69aa1aa1355e
Reviewed-on: http://gerrit.cloudera.org:8080/17891
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the ability to share the per-host stats for locally
admitted queries across all coordinators. This helps to get a more
consolidated view of the cluster for stats like slots_in_use and
mem_admitted when making local admission decisions.
Testing:
Added e2e py test
Change-Id: I2946832e0a89b077d0f3bec755e4672be2088243
Reviewed-on: http://gerrit.cloudera.org:8080/17683
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
With this fix, coordinator only queries are submitted to a pseudo
executor group named "empty group (using coordinator only)" which
is empty. This allows running coordinator only queries regardless
of the presence of any healthy executor groups.
Testing:
Added a custom cluster test and modified tests that relied on
coordinator only queries to be queued in absence of executor groups.
Change-Id: I8fe098032744aa20bbbe4faddfc67e7a46ce03d5
Reviewed-on: http://gerrit.cloudera.org:8080/14183
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some tests in test_executor_groups immediately tried fetching the query
profile after executing it asynchronously to verify if the query was
queued. However there is a small window between the exec rpc returning
and the query being queued during which the query profile does not
contain any info about the query being queued. This was causing some
asserts in the test to fail.
Change-Id: I47070045250a12d86c99f9a30a956a268be5fa7e
Reviewed-on: http://gerrit.cloudera.org:8080/14810
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change improves the cluster membership snapshot we maintain in the
frontend in cases where all executors have been shut down or none have
started yet.
Prior to this change when configuring Impala with executor groups, the
planner might see a ExecutorMembershipSnapshot that has no executors in
it. This could happen if the first executor group had not started up
yet, or if all executor groups had been shutdown. If this happened, the
planner would make sub-optimal decisions, e.g. decide on a broadcast
join vs a partitioned hash join.
With this change if no executors have been registered so far, the
planner will use the expected number of executors which can be set using
the -num_expected_executors flag and is 20 by default. After executors
come online, the planner will use the size of the largest healthy
executor group, and it will hold on to the group's size even if it shuts
down or becomes unhealthy. This allows the planner to work on the
assumption that a healthy executor group of the same size will
eventually come online to execute the query.
Change-Id: Ib6b05326c82fb3ca625c015cfcdc38f891f5d4f9
Reviewed-on: http://gerrit.cloudera.org:8080/14756
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The test was checking the incorrect invariant - the
slot mechanism only prevents more than than number of
queries running on a backend. More queries can run on
a cluster since the query's backends are freed up before
the query itself finishes.
It was a little tricky picking an appropriate metric
since there is no strong consistency between the
metrics, e.g. decrementing a metric after a backend
finishes may race with admitting the next query.
So I simply used the same metric used by the admission
controller in making decisions, which should be
strongly consistent w.r.t. admission control decissions.
Also remove the concurrency limit on the coordinator,
which seemed inconsistent with the purpose of the
test, because we only want concurrency to be limited
by the executors.
Testing:
Looped the test for a bit.
Change-Id: I910028919f248a3bf5de345e9eade9dbc4353ebd
Reviewed-on: http://gerrit.cloudera.org:8080/14606
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This integrates mt_dop with the "slots" mechanism that's used
for non-default executor groups.
The idea is simple - the degree of parallelism on a backend
determines the number of slots consumed. The effective
degree of parallelism is used, not the raw mt_dop setting.
E.g. if the query only has a single input split and executes
only a single fragment instance on a host, we don't want
to count the full mt_dop value for admission control.
--admission_control_slots is added as a new flag that
replaces --max_concurrent_queries, since the name better
reflects the concept. --max_concurrent_queries is kept
for backwards compatibility and has the same meaning
as --admission_control_slots.
The admission control logic is extended to take this into
account. We also add an immediate rejection code path
since it is now possible for queries to not be admittable
based on the # of available slots.
We only factor in the "width" of the plan - i.e. the number
of instances of fragments. We don't account for the number
of distinct fragments, since they may not actually execute
in parallel with each other because of dependencies.
This number is added to the per-host profile as the
"AdmissionSlots" counter.
Testing:
Added unit tests for rejection and queue/admit checks.
Also includes a fix for IMPALA-9054 where we increase
the timeout.
Added end-to-end tests:
* test_admission_slots in test_mt_dop.py that checks the
admission slot calculation via the profile.
* End-to-end admission test that exercises the admit
immediately and queueing code paths.
Added checks to test_verify_metrics (which runs after
end-to-end tests) to ensure that the per-backend
slots in use goes to 0 when the cluster is quiesced.
Change-Id: I7b6b6262ef238df26b491352656a26e4163e46e5
Reviewed-on: http://gerrit.cloudera.org:8080/14357
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
With this patch, all executor groups with at least one executor
will have a metric added that displays the number of queries
(admitted by the local coordinator) running on them. The metric
is removed only when the group has no executors in it. It gets updated
when either the cluster membership changes or a query gets admitted or
released by the admission controller. Also adds the ability to delete
metrics from a metric group after registration.
Testing:
- Added a custom cluster test and a BE metric test.
- Had to modify some metric tests that relied on ordering of metrics by
their name.
Change-Id: I58cde8699c33af8b87273437e9d8bf6371a34539
Reviewed-on: http://gerrit.cloudera.org:8080/14103
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In some situations, users might actually expect not having a healthy
executor group around, e.g. when they're starting one and it takes a
while to come online. This change makes the queuing reason more generic
and drops the "unhealthy" concept from it to reduce confusion.
Change-Id: Idceab7fb56335bab9d787b0f351a41e6efd7dd59
Reviewed-on: http://gerrit.cloudera.org:8080/14210
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Changes the Coordinator to release admitted memory when each Backend
completes, rather than waiting for the entire query to complete before
releasing admitted memory. When the Coordinator detects that a Backend
has completed (via ControlService::ReportExecStatus) it updates the
state of the Backend in Coordinator::BackendResourceState.
BackendResourceState tracks the state of the admitted resources for
each Backend and decides when the resources for a group of Backend
states should be released. BackendResourceState defines a state machine
to help coordinate the state of the admitted memory for each Backend.
It guarantees that by the time the query is shutdown, all Backends
release their admitted memory.
BackendResourceState implements three rules to control the rate at
which the Coordinator releases admitted memory from the
AdmissionController:
* Resources are released at most once every 1 second, this prevents
short lived queries from causing high load on the admission controller
* Resources are released at most O(log(num_backends)) times; the
BackendResourceStates can release multiple BackendStates from the
AdmissionController at a time
* All pending resources are released if the only remaining Backend is
the Coordinator Backend; this is useful for result spooling where all
Backends may complete, except for the Coordinator Backend
Exposes the following hidden startup flags to help tune the heuristics
above:
--batched_release_decay_factor
* Defaults to 2
* Controls the base value for the O(log(num_backends)) bound when
batching the release of Backends.
--release_backend_states_delay_ms
* Defaults to 1000 milliseconds
* Controls how often Backends can release their resources.
Testing:
* Ran core tests
* Added new tests to test_result_spooling.py and
test_admission_controller.py
Change-Id: I88bb11e0ede7574568020e0277dd8ac8d2586dc9
Reviewed-on: http://gerrit.cloudera.org:8080/14104
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds 3 metrics under a new metric group called
"cluster-membership" that keep track of the number of executor groups
that have at least one live executor, number of executor groups that are
in a healthy state and the number of backends registered with the
statestore.
Testing:
Modified tests to use these metrics for verification.
Change-Id: I7745ea1c7c6778d3fb5e59adbc873697beb0f3b9
Reviewed-on: http://gerrit.cloudera.org:8080/13979
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds support for running queries inside a single admission
control pool on one of several, disjoint sets of executors called
"executor groups".
Executors can be configured with an executor group through the newly
added '--executor_groups' flag. Note that in anticipation of future
changes, the flag already uses the plural form, but only a single
executor group may be specified for now. Each executor group
specification can optionally contain a minimum size, separated by a
':', e.g. --executor_groups default-pool-1:3. Only when the cluster
membership contains at least that number of executors for the groups
will it be considered for admission.
Executor groups are mapped to resource pools by their name: An executor
group can service queries from a resource pool if the pool name is a
prefix of the group name separated by a '-'. For example, queries in
poll poolA can be serviced by executor groups named poolA-1 and poolA-2,
but not by groups name foo or poolB-1.
During scheduling, executor groups are considered in alphabetical order.
This means that one group is filled up entirely before a subsequent
group is considered for admission. Groups also need to pass a health
check before considered. In particular, they must contain at least the
minimum number of executors specified.
If no group is specified during startup, executors are added to the
default executor group. If - during admission - no executor group for a
pool can be found and the default group is non-empty, then the default
group is considered. The default group does not have a minimum size.
This change inverts the order of scheduling and admission. Prior to this
change, queries were scheduled before submitting them to the admission
controller. Now the admission controller computes schedules for all
candidate executor groups before each admission attempt. If the cluster
membership has not changed, then the schedules of the previous attempt
will be reused. This means that queries will no longer fail if the
cluster membership changes while they are queued in the admission
controller.
This change also alters the default behavior when using a dedicated
coordinator and no executors have registered yet. Prior to this change,
a query would fail immediately with an error ("No executors registered
in group"). Now a query will get queued and wait until executors show
up, or it times out after the pools queue timeout period.
Testing:
This change adds a new custom cluster test for executor groups. It
makes use of new capabilities added to start-impala-cluster.py to bring
up additional executors into an already running cluster.
Additionally, this change adds an instructional implementation of
executor group based autoscaling, which can be used during development.
It also adds a helper to run queries concurrently. Both are used in a
new test to exercise the executor group logic and to prevent regressions
to these tools.
In addition to these tests, the existing tests for the admission
controller (both BE and EE tests) thoroughly exercise the changed code.
Some of them required changes themselves to reflect the new behavior.
I looped the new tests (test_executor_groups and test_auto_scaling) for
a night (110 iterations each) without any issues.
I also started an autoscaling cluster with a single group and ran
TPC-DS, TPC-H, and test_queries on it successfully.
Known limitations:
When using executor groups, only a single coordinator and a single AC
pool (i.e. the default pool) are supported. Executors to not include the
number of currently running queries in their statestore updates and so
admission controllers are not aware of the number of queries admitted by
other controllers per host.
Change-Id: I8a1d0900f2a82bd2fc0a906cc094e442cffa189b
Reviewed-on: http://gerrit.cloudera.org:8080/13550
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>