Commit Graph

62 Commits

Author SHA1 Message Date
Joe McDonnell
1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
Riza Suminto
f28a32fbc3 IMPALA-13916: Change BaseTestSuite.default_test_protocol to HS2
This is the final patch to move all Impala e2e and custom cluster tests
to use HS2 protocol by default. Only beeswax-specific test remains
testing against beeswax protocol by default. We can remove them once
Impala officially remove beeswax support.

HS2 error message formatting in impala-hs2-server.cc is adjusted a bit
to match with formatting in impala-beeswax-server.cc.

Move TestWebPageAndCloseSession from webserver/test_web_pages.py to
custom_cluster/test_web_pages.py to disable glog log buffering.

Testing:
- Pass exhaustive tests, except for some known and unrelated flaky
  tests.

Change-Id: I42e9ceccbba1e6853f37e68f106265d163ccae28
Reviewed-on: http://gerrit.cloudera.org:8080/22845
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Jason Fehr <jfehr@cloudera.com>
2025-05-20 14:32:10 +00:00
Riza Suminto
e73e2d40da IMPALA-13864: Implement ImpylaHS2ResultSet.exec_summary
This patch implement building exec summary table for
ImpylaHS2Connection. It adds fetch_exec_summary argument in
ImpalaConnection.execute(). If this argument is True, an exec summary
table will be added into the returned result object.

fetch_exec_summary is also implemented for BeeswaxConnection. Thus,
BeeswaxConnection will not fetch exec summary by default all the time.

Tests that validate exec summary table is updated to set
fetch_exec_summary=True and migrated to test against hs2 protocol.
Change TestExecutorGroup._set_query_options() to do query option setting
through hs2_client iconfig instead of SET query. Some flake8 issues are
addressed as well.

Move build_exec_summary_table to separate exec_summary.py file. Tweak it
a bit to return early if given TExecSummary is empty.

Fixed bug in ImpalaBeeswaxClient.fetch_results() where fetch will not
happen at all if discard_result argument is True.

Testing:
- Run and pass affected tests locally.

Change-Id: I7d88f78e58eeda29ce21e7828884c7a129d7efe6
Reviewed-on: http://gerrit.cloudera.org:8080/22626
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-24 22:34:20 +00:00
Riza Suminto
ad124d1dba IMPALA-13872: Deflake test_query_cpu_count_divisor_default assertion
IMPALA-13333 adds test that assert the value of "Cluster Memory
Admitted" counter. However, this counter can have slightly different
value depending of target filesystem (HDFS, Ozone, S3). This cause
flakiness in test_query_cpu_count_divisor_default.

This patch remove such assertion from
test_query_cpu_count_divisor_default. The remaining assertion is
sufficient to ensure correctness of system under test.

Testing:
- Run and pass test_query_cpu_count_divisor_default.

Change-Id: I676ee31728de2886acc72d11b8ece14f0238814b
Reviewed-on: http://gerrit.cloudera.org:8080/22636
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-18 21:19:09 +00:00
Riza Suminto
7a5adc15ca IMPALA-13333: Limit memory estimation if PlanNode can spill
SortNode, AggregationNode, and HashJoinNode (the build side) can spill
to disk. However, their memory estimation does not consider this
capability and assumes it will hold all rows in memory. This causes
memory overestimation if cardinality is also overestimated. In reality,
the whole query execution in a single host is often subject to much
lower memory upper-bound and not allowed to exceed it.

This upper-bound is dictated by, but not limited to:
- MEM_LIMIT
- MEM_LIMIT_COORDINATORS
- MEM_LIMIT_EXECUTORS
- MAX_MEM_ESTIMATE_FOR_ADMISSION
- impala.admission-control.max-query-mem-limit.<pool_name>
  from admission control.

This patch adds SpillableOperator interface that defines an alternative
of either PlanNode.computeNodeResourceProfile() or
DataSink.computeResourceProfile() if a lower memory upper-bound can be
reasoned about from configs mentioned above. This interface is applied
to SortNode, AggregationNode, HashJoinNode, and JoinBuildSink.

The in-memory vs spill-to-disk bias is controlled through
MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR option. A scale between
[0.0,1.0] to control estimate peak memory of query operator that has
spill-to-disk capabilities. Setting value closer to 1.0 will make
Planner bias towards keeping as much rows as possible in memory, while
setting value closer to 0.0 will make Planner bias towards spilling rows
to disk under memory pressure. Note that lowering
MEM_ESTIMATE_SCALE_FOR_SPILLING_OPERATOR can make query that previously
rejected by Admission Controller becomes admittable, but also may
spill-to-disk more and have higher risk to exhaust scratch space more
than before.

There are some caveats on this memory bounding patch:
- It checks if spill-to-disk is enabled in the coordinator, but
  individual backend executors might not have it configured. Mismatch of
  spill-to-disk configs between the coordinator and backend executor,
  however, is rare and can be considered as misconfiguration.
- It does not check the actual total scratch space available to
  spill-to-disk. However, query execution will be forced to spill anyway
  if memory usage exceeds the three memory configs above. Raising
  MEM_LIMIT / MEM_LIMIT_EXECUTORS option can help increase memory
  estimation and increase the likelihood for the query to get assigned
  to a larger executor group set, which usually has a bigger total
  scratch space.
- The memory bound is divided evenly among all instances of a fragment
  kind in a single host. But in theory, they should be able to share
  and grow their memory usage independently beyond memory estimate as
  long max memory reservation is not set.
- This does not consider other memory-related configs such as
  clamp_query_mem_limit_backend_mem_limit or disable_pool_mem_limits
  flag. But the admission controller will still enforce them if set.

Testing:
- Pass FE and custom cluster tests with core exploration.

Change-Id: I290c4e889d4ab9e921e356f0f55a9c8b11d0854e
Reviewed-on: http://gerrit.cloudera.org:8080/21762
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-03-13 18:38:27 +00:00
Riza Suminto
71feb617e4 IMPALA-13835: Remove reference to protocol-specific states
With IMPALA-13682 merged, checking for query state can be done via
wait_for_impala_state(), wait_for_any_impala_state() and other helper
methods of ImpalaConnection. This patch remove all reference to
protocol-specific states such as BeeswaxService.QueryState.

Also fix flake8 errors and unused variable in modified test files.

Testing:
- Run and pass all affected tests.

Change-Id: Id6b56024fbfcea1ff005c34cd146d16e67cb6fa1
Reviewed-on: http://gerrit.cloudera.org:8080/22586
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-09 00:04:05 +00:00
Xuebin Su
242095ac8a IMPALA-13729: Accept error messages not starting with prompt
Previously, error_msg_expected() only accepted error messages starting
with the following error prompt:
```
Query <query_id> failed:\n
```
However, for some tests using the Beeswax protocol, the error prompt may
appear in the middle of the error message instead of at its beginning.

Therefore, this patch adapts error_msg_expected() to accept error
messages not starting with the error prompt.

The error_msg_expected() function is renamed to error_msg_startswith()
to better describe its behavior.

Change-Id: Iac3e68bcc36776f7fd6cc9c838dd8da9c3ecf58b
Reviewed-on: http://gerrit.cloudera.org:8080/22468
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-02-26 15:29:36 +00:00
Riza Suminto
c298c54262 IMPALA-13644: Generalize and move getPerInstanceNdvForCpuCosting
getPerInstanceNdvForCpuCosting is a method to estimate the number of
distinct values of exprs per fragment instance when accounting for the
likelihood of duplicate keys across fragment instances. It borrows the
probabilistic model described in IMPALA-2945. This method is exclusively
used by AggregationNode only.

getPerInstanceNdvForCpuCosting run the probabilistic formula
individually for each grouping expression and then multiply it together.
That match with how we estimate group NDV in the past where we simply do
NDV multiplication of each grouping expression.

Recently, we adds tuple-based analysis to lower cardinality estimate for
all kind of aggregation node (IMPALA-13045, IMPALA-13465, IMPALA-13086).
All of the bounding happens in AggregationNode.computeStats(), where we
call estimateNumGroups() function that returns globalNdv estimate for
specific aggregation class.

To take advantage from that more precise globalNdv, this patch replace
getPerInstanceNdvForCpuCosting() with estimatePreaggCardinality() that
apply the probabilistic formula over this single globalNdv number rather
than the old way where it often return an overestimated number from NDV
multiplication method. Its use is still limited only to calculate
ProcessingCost. Using it for preagg output cardinality will be done by
IMPALA-2945.

estimatePreaggCardinality is skipped if data partition of input is a
subset of grouping expression.

Testing:
- Run and pass PlannerTest that set COMPUTE_PROCESSING_COST=True.
  ProcessingCost changes, but all cardinality number stays.
- Add CardinalityTest#testEstimatePreaggCardinality.
- Update test_executor_groups.py. Enable v2 profile as well for easier
  runtime profile debugging.

Change-Id: Iddf75833981558fe0188ea7475b8d996d66983c1
Reviewed-on: http://gerrit.cloudera.org:8080/22320
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-17 10:05:09 +00:00
Riza Suminto
4553e89ed4 IMPALA-13477: Set request_pool in QueryStateRecord for CTAS query
Resource Pool information for CTAS query is missing from /queries page
of WebUI. This is because CTAS query has TExecRequest.stmt_type = DDL.
However, CTAS also has TQueryExecRequest.stmt_type = DML and subject to
AdmissionControl. Therefore, its request pool must be recorded into
QueryStateRecord and displayed at /queries page of WebUI.

Testing:
- Update assertion in test_executor_groups.py to reflect this change.
- Pass core tests.

Change-Id: I6885192f1c2d563e58670f142b3c0df528032a6e
Reviewed-on: http://gerrit.cloudera.org:8080/21975
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-26 04:54:36 +00:00
Riza Suminto
b07bd6ddeb IMPALA-13469: Deflake test_query_cpu_count_on_insert
A new test case from IMPALA-13445 reveals a pre-existing bug where
cost-based planning may increase expectedNumInputInstance greater than
inputFragment.getNumInstances(), which leads to precondition violation.
The following scenario all happened when the Precondition was hit:

1. The environment is either Erasure Coded HDFS or Ozone.
2. The source table does not have stats nor numRows table property.
3. There is only one fragment consisting of a ScanNode in the plan tree
   before the addition of DML fragment.
4. Byte-based cardinality estimation logic kicks in.
5. Byte-based cardinality causes high scan cost, which leads to
   maxScanThread exceeding inputFragment.getPlanRoot().
6. expectedNumInputInstance is assigned equal to maxScanThread.
7. Precondition expectedNumInputInstance < inputFragment.getPlanRoot()
   is violated.

This scenario triggers a special condition that attempts to lower
expectedNumInputInstance. But instead of lowering
expectedNumInputInstance, the special logic increases it due to higher
byte-based cardinality estimation.

There is also a new bug where DistributedPlanner.java mistakenly passes
root.getInputCardinality() instead of root.getCardinality().

This patch fixes both issues and does minor refactoring to change
variable names into camel cases. Relaxed validation of the last test
case of test_query_cpu_count_on_insert to let it pass in Erasure Coded
HDFS and Ozone setup.

Testing:
- Make several assertions in test_executor_groups.py more verbose.
- Pass test_executor_groups.py in Erasure Coded HDFS and Ozone setup.
- Added new Planner tests with unknown cardinality estimation.
- Pass core tests in regular setup.

Change-Id: I834eb6bf896752521e733cd6b77a03f746e6a447
Reviewed-on: http://gerrit.cloudera.org:8080/21966
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-24 20:45:21 +00:00
Riza Suminto
9ca3ccb347 IMPALA-13445: Ignore num partition for unpartitioned writes
When cost-based planning is used, writer parallelism is limited by the
number of partitions. In the unpartitioned insert scenario, there will
be just single partitions. That leads to a single fs writer only, which
causes slow writes.

This patch fixes the issue by distinguishing between partitioned insert
and unpartitioned insert and cause following BEHAVIOR CHANGE if
COMPUTE_PROCESSING_COST=1:

1. If the insert is unpartitioned, use the byte-based estimate fully.
   Shuffling should only happen if num writers is less than num input
   fragment instances.
2. If the insert is partitioned, try to plan at least one writer for
   each shuffling executor nodes, but do not exceed number of
   partitions. However, if number of partition is 1, try force writer
   colocation with input fragment.

Both partitioned and unpartitioned insert still respect MAX_FS_WRITER
option. This patch also does minor cleanup in DistributedPlanner.java.

Testing:
- In test_executor_groups.py, move insert tests from
  test_query_cpu_count_divisor_default into separate
  test_query_cpu_count_on_insert. Add some new insert test cases there.
- Add and pass CardinalityTest.testByteBasedNumWriters().
- Add new planner tests under TpcdsCpuCostPlannerTest.
- Pass test_executor_groups.py.

Change-Id: I51ab8fc35a5489351a88d372b28642b35449acfc
Reviewed-on: http://gerrit.cloudera.org:8080/21927
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-19 10:22:23 +00:00
Xuebin Su
ad868b9947 IMPALA-13115: Add query id to error messages
This patch adds the query id to the error messages in both

- the result of the `get_log()` RPC, and
- the error message in an RPC response

before they are returned to the client, so that the users can easily
figure out the errored queries on the client side.

To achieve this, the query id of the thread debug info is set in the
RPC handler method, and is retrieved from the thread debug info each
time the error reporting function or `get_log()` gets called.

Due to the change of the error message format, some checks in the
impala-shell.py are adapted to keep them valid.

Testing:
- Added helper function `error_msg_expected()` to check whether an
  error message is expected. It is stricter than only using the `in`
  operator.
- Added helper function `error_msg_equal()` to check if two error
  messages are equal regardless of the query ids.
- Various test cases are adapted to match the new error message format.
- `ImpalaBeeswaxException`, which is used in tests only, is simplified
  so that it has the same error message format as the exceptions for
  HS2.
- Added an assertion to the case of killing and restarting a worker
  in the custom cluster test to ensure that the query id is in
  the error message in the client log retrieved with `get_log()`.

Change-Id: I67e659681e36162cad1d9684189106f8eedbf092
Reviewed-on: http://gerrit.cloudera.org:8080/21587
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-08-08 14:11:04 +00:00
Riza Suminto
29e4186793 IMPALA-13024: Ignore slots if using default pool and empty group
Slot based admission should not be enabled when using default pool.
There is a bug where coordinator-only query still does slot based
admission because executor group name set to
ClusterMembershipMgr::EMPTY_GROUP_NAME ("empty group (using coordinator
only)"). This patch adds check to recognize coordinator-only query in
default pool and skip slot based admission for it.

Testing:
- Add BE test AdmissionControllerTest.CanAdmitRequestSlotsDefault.
- In test_executor_groups.py, split test_coordinator_concurrency to
  test_coordinator_concurrency_default and
  test_coordinator_concurrency_two_exec_group_cluster to show the
  behavior change.
- Pass core tests in ASAN build.

Change-Id: I0b08dea7ba0c78ac6b98c7a0b148df8fb036c4d0
Reviewed-on: http://gerrit.cloudera.org:8080/21340
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-26 21:18:42 +00:00
David Rorke
25a8d70664 IMPALA-12657: Improve ProcessingCost of ScanNode and NonGroupingAggregator
This patch improves the accuracy of the CPU ProcessingCost estimates for
several of the CPU intensive operators by basing the costs on benchmark
data. The general approach for a given operator was to run a set of queries
that exercised the operator under various conditions (e.g. large vs small
row sizes and row counts, varying NDV, different file formats, etc) and
capture the CPU time spent per unit of work (the unit of work might be
measured as some number of rows, some number of bytes, some number of
predicates evaluated, or some combination of these). The data was then
analyzed in an attempt to fit a simple model that would allow us to
predict CPU consumption of a given operator based on information available
at planning time.

For example, the CPU ProcessingCost for a Parquet scan is estimated as:
TotalCost = (0.0144 * BytesMaterialized) + (0.0281 * Rows * Predicate Count)

The coefficients  (0.0144 and 0.0281) are derived from benchmarking
scans under a variety of conditions. Similar cost functions and coefficients
were derived for all of the benchmarked operators. The coefficients for all
the operators are normalized such that a single unit of cost equates to
roughly 100 nanoseconds of CPU time on a r5d.4xlarge instance. So we would
predict an operator with a cost of 10,000,000 would complete in roughly one
second on a single core.

Limitations:
* Costing only addresses CPU time spent and doesn't account for any IO
  or other wait time.
* Benchmarking scenarios didn't provide comprehensive coverage of the
  full range of data types, distributions, etc. More thorough
  benchmarking could improve the costing estimates further.
* This initial patch only covers a subset of the operators, focusing
  on those that are most common and most CPU intensive. Specifically
  the following operators are covered by this patch. All others
  continue to use the previous ProcessingCost code:
  AggregationNode
  DataStreamSink (exchange sender)
  ExchangeNode
  HashJoinNode
  HdfsScanNode
  HdfsTableSink
  NestedLoopJoinNode
  SortNode
  UnionNode

Benchmark-based costing of the remaining operators will be covered by
a future patch.

Future patches will automate the collection and analysis of the benchmark
data and the computation of the cost coefficients to simplify maintenance
of the costing as performance changes over time.

Change-Id: Icf1edd48d4ae255b7b3b7f5b228800d7bac7d2ca
Reviewed-on: http://gerrit.cloudera.org:8080/21279
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-20 06:48:58 +00:00
Riza Suminto
d437334e53 IMPALA-12988: Calculate an unbounded version of CpuAsk
Planner calculates CpuAsk through a recursive call beginning at
Planner.computeBlockingAwareCores(), which is called after
Planner.computeEffectiveParallelism(). It does blocking operator
analysis over the selected degree of parallelism that was decided during
computeEffectiveParallelism() traversal. That selected degree of
parallelism, however, is already bounded by min and max parallelism
config, derived from PROCESSING_COST_MIN_THREADS and
MAX_FRAGMENT_INSTANCES_PER_NODE options accordingly.

This patch calculates an unbounded version of CpuAsk that is not bounded
by min and max parallelism config. It is purely based on the fragment's
ProcessingCost and query plan relationship constraint (for example, the
number of JOIN BUILDER fragments should equal the number of destination
JOIN fragments for partitioned join).

Frontend will receive both bounded and unbounded CpuAsk values from
TQueryExecRequest on each executor group set selection round. The
unbounded CpuAsk is then scaled down once using a nth root based
sublinear-function, controlled by the total cpu count of the smallest
executor group set and the bounded CpuAsk number. Another linear scaling
is then applied on both bounded and unbounded CpuAsk using
QUERY_CPU_COUNT_DIVISOR option. Frontend then compare the unbounded
CpuAsk after scaling against CpuMax to avoid assigning a query to a
small executor group set too soon. The last executor group set stays as
the "catch-all" executor group set.

After this patch, setting COMPUTE_PROCESSING_COST=True will show
following changes in query profile:
- The "max-parallelism" fields in the query plan will all be set to
  maximum parallelism based on ProcessingCost.
- The CpuAsk counter is changed to show the unbounded CpuAsk after
  scaling.
- A new counter CpuAskBounded shows the bounded CpuAsk after scaling. If
  QUERY_CPU_COUNT_DIVISOR=1 and PLANNER_CPU_ASK slot counting strategy
  is selected, this CpuAskBounded is also the minimum total admission
  slots given to the query.
- A new counter MaxParallelism shows the unbounded CpuAsk before
  scaling.
- The EffectiveParallelism counter remains unchanged,
  showing bounded CpuAsk before scaling.

Testing:
- Update and pass FE test TpcdsCpuCostPlannerTest and
  PlannerTest#testProcessingCost.
- Pass EE test tests/query_test/test_tpcds_queries.py
- Pass custom cluster test tests/custom_cluster/test_executor_groups.py

Change-Id: I5441e31088f90761062af35862be4ce09d116923
Reviewed-on: http://gerrit.cloudera.org:8080/21277
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-20 00:28:53 +00:00
Riza Suminto
6abfdbc56c IMPALA-12980: Translate CpuAsk into admission control slots
Impala has a concept of "admission control slots" - the amount of
parallelism that should be allowed on an Impala daemon. This defaults to
the number of processors per executor and can be overridden with
-–admission_control_slots flag.

Admission control slot accounting is described in IMPALA-8998. It
computes 'slots_to_use' for each backend based on the maximum number of
instances of any fragment on that backend. This can lead to slot
underestimation and query overadmission. For example, assume an executor
node with 48 CPU cores and configured with -–admission_control_slots=48.
It is assigned 4 non-blocking query fragments, each has 12 instances
scheduled in this executor. IMPALA-8998 algorithm will request the max
instance (12) slots rather than the sum of all non-blocking fragment
instances (48). With the 36 remaining slots free, the executor can still
admit another fragment from a different query but will potentially have
CPU contention with the one that is currently running.

When COMPUTE_PROCESSING_COST is enabled, Planner will generate a CpuAsk
number that represents the cpu requirement of that query over a
particular executor group set. This number is an estimation of the
largest number of query fragment instances that can run in parallel
without waiting, given by the blocking operator analysis. Therefore, the
fragment trace that sums into that CpuAsk number can be translated into
'slots_to_use' as well, which will be a closer resemblance of maximum
parallel execution of fragment instances.

This patch adds a new query option called SLOT_COUNT_STRATEGY to control
which admission control slot accounting to use. There are two possible
values:
- LARGEST_FRAGMENT, which is the original algorithm from IMPALA-8998.
  This is still the default value for the SLOT_COUNT_STRATEGY option.
- PLANNER_CPU_ASK, which will follow the fragment trace that contributes
  towards CpuAsk number. This strategy will schedule more or equal
  admission control slots than the LARGEST_FRAGMENT strategy.

To do the PLANNER_CPU_ASK strategy, the Planner will mark fragments that
contribute to CpuAsk as dominant fragments. It also passes
max_slot_per_executor information that it knows about the executor group
set to the scheduler.

AvgAdmissionSlotsPerExecutor counter is added to describe what Planner
thinks the average 'slots_to_use' per backend will be, which follows
this formula:

  AvgAdmissionSlotsPerExecutor = ceil(CpuAsk / num_executors)

Actual 'slots_to_use' in each backend may differ than
AvgAdmissionSlotsPerExecutor, depending on what is scheduled on that
backend. 'slots_to_use' will be shown as 'AdmissionSlots' counter under
each executor profile node.

Testing:
- Update test_executors.py with AvgAdmissionSlotsPerExecutor assertion.
- Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost.
- Add EE test test_processing_cost.py.
- Add FE test PlannerTest#testProcessingCostPlanAdmissionSlots.

Change-Id: I338ca96555bfe8d07afce0320b3688a0861663f2
Reviewed-on: http://gerrit.cloudera.org:8080/21257
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-18 21:58:13 +00:00
Riza Suminto
ac8ffa9125 IMPALA-12654: Add query option QUERY_CPU_COUNT_DIVISOR
IMPALA-11604 adds a hidden backend flag named query_cpu_count_divisor to
allow oversubscribing CPU cores more than what is available in the
executor group set. This patch adds a query option with the same name
and function so that CPU core matching can be tuned for individual
queries. The query option takes precedence over the flag.

Testing:
- Add test case in test_executor_groups.py and query-options-test.cc

Change-Id: I34ab47bd67509a02790c3caedb3fde4d1b6eaa78
Reviewed-on: http://gerrit.cloudera.org:8080/20819
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-01-04 08:18:10 +00:00
Riza Suminto
04f31aea71 IMPALA-12567: Deflake test_75_percent_availability
TestExecutorGroups.test_75_percent_availability can fail in certains
build/test setup for not starting the last impalad within 5s delay
injection. This patch simplifies the test by launching fewer impalad in
totals (reduced from 8 to 5, excluding coordinator) and increases the
delay injection to ensure test query run at all five executors. The test
renamed to test_partial_availability accordingly.

Testing:
- Run and pass the test against HDFS and S3.

Change-Id: I2e70f1dde10045c32c2bb4f6f78e8a707c9cd97d
Reviewed-on: http://gerrit.cloudera.org:8080/20712
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-11-17 19:06:49 +00:00
Riza Suminto
89a48b80a2 IMPALA-12444: Fix minimum parallelism bug in scan fragment
Scan fragment did not follow PROCESSING_COST_MIN_THREADS set by user
even if total scan ranges allow to do so. This patch fix the issue by
exposing ScanNode.maxScannerThreads_ to
PlanFragment.adjustToMaxParallelism(). By using
ScanNode.maxScannerThreads_ as an upper bound, ScanNode does not need to
artificially lower ProcessingCost if maxScannerThreads_ is lower than
minimum parallelism dictated by the original ProcessingCost. Thus, the
synthetic ProcessingCost logic in ScanNode class is revised to only
apply if input cardinality is unknown (-1).

This patch also does the following adjustments:
- Remove some dead codes in Frontend.java and PlanFragment.java.
- Add sanity check such that PROCESSING_COST_MIN_THREADS <=
  MAX_FRAGMENT_INSTANCES_PER_NODE.
- Tidy up test_query_cpu_count_divisor_default to reduce number of
  SET query.

Testing:
- Update test_query_cpu_count_divisor_default to ensure that
  PROCESSING_COST_MIN_THREADS is respected by scan fragment and error
  is returned if PROCESSING_COST_MIN_THREADS is greater than
  MAX_FRAGMENT_INSTANCES_PER_NODE.
- Pass test_executor_groups.py.

Change-Id: I69e5a80146d4ac41de5ef406fc2bdceffe3ec394
Reviewed-on: http://gerrit.cloudera.org:8080/20475
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2023-10-02 16:36:29 +00:00
Abhishek Rawat
b73bc49ea7 IMPALA-12400: Test expected executors used for planning when no executor groups are healthy
Added a custom cluster test for testing number of executors used for
planning when no executor groups are healthy. Planner should use
num executors from 'num_expected_executors' or
'expected_executor_group_sets' when executor groups aren't healthy.

Change-Id: Ib71ca0a5402c74d07ee875878f092d6d3827c6b7
Reviewed-on: http://gerrit.cloudera.org:8080/20419
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-09-01 00:07:46 +00:00
Riza Suminto
0c8fc997ef IMPALA-12395: Override scan cardinality for optimized count star
The cardinality estimate in HdfsScanNode.java for count queries does not
account for the fact that the count optimization only scans metadata and
not the actual columns. Optimized count star scan will return only 1 row
per parquet row group.

This patch override the scan cardinality with total number of files,
which is the closest estimate to number of row group. Similar override
already exist in IcebergScanNode.java.

Testing:
- Add count query testcases in test_query_cpu_count_divisor_default
- Pass core tests

Change-Id: Id5ce967657208057d50bd80adadac29ebb51cbc5
Reviewed-on: http://gerrit.cloudera.org:8080/20406
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-08-28 20:32:13 +00:00
Riza Suminto
0d3fc33bbb IMPALA-12300: (addendum) Remove HDFS specific assertion
test_75_percent_availability fail against Ozone and S3 test environment
because test expects the string "SCAN HDFS" to be found in the profile.
Instead of it there's "SCAN OZONE" and "SCAN S3" for Ozone and S3 test
environment respectively. This patch fix the test by removing that
assertion from test_75_percent_availability. The remaining assertion is
enough to verify that FE planner and BE scheduler can see cluster
membership change.

Change-Id: Id14934d2fce0f6cf03242c36c0142bc697b4180e
Reviewed-on: http://gerrit.cloudera.org:8080/20259
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-07-24 21:35:47 +00:00
Riza Suminto
0bee563073 IMPALA-12300: Turn CheckEffectiveInstanceCount to print warning
Scheduler::CheckEffectiveInstanceCount was added to check consistency
between FE planning and BE scheduling if COMPUTE_PROCESSING_COST=true.
This consistency can be broken if there is a cluster membership
change (new executor becomes online) between FE planning and BE
scheduling. Say, in executor group size 10 with 90% health threshold,
admission-controller is allowed to run a query when only 9 executor is
available. If 10th executor is online during the time between FE
planning and BE scheduling, CheckEffectiveInstanceCount can fail and
return error.

This patch turn two error status in CheckEffectiveInstanceCount into
warning, either to query profile as InfoString or WARNING log.
MAX_FRAGMENT_INSTANCES_PER_NODE violation check stays to return error.

Testing:
- Add test_75_percent_availability
- Pass test_executors.py

Change-Id: Ieaf6a46c4f12dbf8b03d1618c2f090ab4f2ac665
Reviewed-on: http://gerrit.cloudera.org:8080/20231
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-07-21 00:37:06 +00:00
Riza Suminto
c9aed342f5 IMPALA-12281: Disallow unsetting REQUEST_POOL if it is set by client
IMPALA-12056 enable child query to unset REQUEST_POOL if it is set by
Frontend.java as part of executor group selection. However, the
implementation miss to setRequest_pool_set_by_frontend(false) if
REQUEST_POOL is explicitly set by client request through impala-shell
configuration. This cause child query to always unset REQUEST_POOL if
parent query was executed via impala-shell. This patch fix the issue by
checking query options that comes from client.

This patch also tidy up null and empty REQUEST_POOL checking by using
StringUtils.isNotEmpty().

Testing:
- Add testcase in test_query_cpu_count_divisor_default

Change-Id: Ib5036859d51bc64f568da405f730c8f3ffebb742
Reviewed-on: http://gerrit.cloudera.org:8080/20189
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2023-07-13 22:40:41 +00:00
Riza Suminto
7a94adbc30 IMPALA-12192: Fix scaling bug in scan fragment
IMPALA-12091 has a bug where scan fragment parallelism will always be
limited solely by the ScanNode cost. If ScanNode is colocated with other
query node operators that have higher processing costs, Planner will not
scale it up beyond what is allowed by the ScanNode cost.

This patch fixes the problem in two aspects. The first is to allow a
scan fragment to scale up higher as long as it is within the total
fragment cost and the number of effective scan ranges. The second is to
add missing Math.max() in CostingSegment.java which causes lower
fragment parallelism even when the total fragment cost is high.

IMPALA-10287 optimization is re-enabled to reduce regression in TPC-DS
Q78. Ideally, the broadcast vs partitioned costing formula during
distributed planning should not rely on numInstance. But enabling this
optimization ensures consistent query plan shape when comparing against
MT_DOP plan. This optimization can still be disabled by specifying
USE_DOP_FOR_COSTING=false.

This patch also does some cleanup including:
- Fix "max-parallelism" value in explain string.
- Make a constant in ScanNode.rowMaterializationCost() into a backend
  flag named scan_range_cost_factor for experimental purposes.
- Replace all references to ProcessingCost.isComputeCost() to
  queryOptions.isCompute_processing_cost() directly.
- Add Precondition in PlanFragment.getNumInstances() to verify that the
  fragment's num instance is not modified anymore after the costing
  algorithm finish.

Testing:
- Manually run TPCDS Q84 over tpcds10_parquet and confirm that the
  leftmost scan fragment parallelism is raised from 12 (before the
  patch) to 18 (after the patch).
- Add test in PlannerTest.testProcessingCost that reproduces the issue.
- Update compute stats test in test_executor_groups.py to maintain test
  assertion.
- Pass core tests.

Change-Id: I7010f6c3bc48ae3f74e8db98a83f645b6c157226
Reviewed-on: http://gerrit.cloudera.org:8080/20024
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-06-24 02:24:41 +00:00
Riza Suminto
dbddb08447 IMPALA-12120: Limit output writer parallelism based on write volume
The new processing cost-based planner changes (IMPALA-11604,
IMPALA-12091) will impact output writer parallelism for insert queries,
with the potential for more small files if the processing cost-based
planning results in too many writer fragments. It can further exacerbate
a problem introduced by MT_DOP (see IMPALA-8125).

The MAX_FS_WRITERS query option can help mitigate this. But even without
the MAX_FS_WRITERS set, the default output writer parallelism should
avoid creating excessive writer parallelism for partitioned and
unpartitioned inserts.

This patch implements such a limit when using the cost-based planner. It
limits the number of writer fragments such that each writer fragment
writes at least 256MB of rows. This patch also allows CTAS (a kind of
DDL query) to be eligible for auto-scaling.

This patch also remove comments about NUM_SCANNER_THREADS added by
IMPALA-12029, since it does not applies anymore after IMPALA-12091.

Testing:
- Add test cases in test_query_cpu_count_divisor_default
- Add test_processing_cost_writer_limit in test_insert.py
- Pass test_insert.py::TestInsertHdfsWriterLimit
- Pass test_executor_groups.py

Change-Id: I289c6ffcd6d7b225179cc9fb2f926390325a27e0
Reviewed-on: http://gerrit.cloudera.org:8080/19880
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-26 00:08:02 +00:00
Riza Suminto
1d0b111bcf IMPALA-12091: Control scan parallelism by its processing cost
Before this patch, Impala still relies on MT_DOP option to decide the
degree of parallelism of the scan fragment when a query runs with
COMPUTE_PROCESSING_COST=1. This patch adds the scan node's processing
cost as another consideration to raise scan parallelism beyond MT_DOP.

Scan node cost is now adjusted to also consider the number of effective
scan ranges. Each scan range is given a weight of (0.5% *
min_processing_per_thread), which roughly means that one scan node
instance can handle at most 200 scan ranges.

Query option MAX_FRAGMENT_INSTANCES_PER_NODE is added as an upper
bound on scan parallelism if COMPUTE_PROCESSING_COST=true. If the number
of scan ranges is fewer than the maximum parallelism allowed by the scan
node's processing cost, that processing cost will be clamped down
to (min_processing_per_thread / number of scan ranges). Lowering
MAX_FRAGMENT_INSTANCES_PER_NODE can also clamp down the scan processing
cost in a similar way. For interior fragments, a combination of
MAX_FRAGMENT_INSTANCES_PER_NODE, PROCESSING_COST_MIN_THREADS, and the
number of available cores per node is accounted to determine maximum
fragment parallelism per node. For scan fragment, only the first two are
considered to encourage Frontend to choose a larger executor group as
needed.

Two new static state is added into exec-node.h: is_mt_fragment_ and
num_instances_per_node_. The backend code that refers to the MT_DOP
option is replaced with either is_mt_fragment_ or
num_instances_per_node_.

Two new criteria are added during effective parallelism calculation in
PlanFragment.adjustToMaxParallelism():

- If a fragment has UnionNode, its parallelism is the maximum between
  its input fragments and its collocated ScanNode's expected
  parallelism.
- If a fragment only has a single ScanNode (and no UnionNode), its
  parallelism is calculated in the same fashion as the interior fragment
  but will not be lowered anymore since it will not have any child
  fragment to compare with.

Admission control slots remain unchanged. This may cause a query to fail
admission if Planner selects scan parallelism that is higher than the
configured admission control slots value. Setting
MAX_FRAGMENT_INSTANCES_PER_NODE equal to or lower than configured
admission control slots value can help lower scan parallelism and pass
the admission controller.

The previous workaround to control scan parallelism by IMPALA-12029 is
now removed. This patch also disables IMPALA-10287 optimization if
COMPUTE_PROCESSING_COST=true. This is because IMPALA-10287 relies on a
fixed number of fragment instances in DistributedPlanner.java. However,
effective parallelism calculation is done much later and may change the
final number of instances of hash join fragment, rendering
DistributionMode selected by IMPALA-10287 inaccurate.

This patch is benchmarked using single_node_perf_run.py with the
following parameters:

args="-gen_experimental_profile=true -default_query_options="
args+="mt_dop=4,compute_processing_cost=1,processing_cost_min_threads=1 "
./bin/single_node_perf_run.py --num_impalads=3 --scale=10 \
    --workloads=tpcds --iterations=5 --table_formats=parquet/none/none \
    --impalad_args="$args" \
    --query_names=TPCDS-Q3,TPCDS-Q14-1,TPCDS-Q14-2,TPCDS-Q23-1,TPCDS-Q23-2,TPCDS-Q49,TPCDS-Q76,TPCDS-Q78,TPCDS-Q80A \
    "IMPALA-12091~1" IMPALA-12091

The benchmark result is as follows:
+-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload  | Query       | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| TPCDS(10) | TPCDS-Q23-1 | parquet / none / none | 4.62   | 4.54        |   +1.92%   |   0.23%    |   1.59%        | 5     |   +2.32%       | 1.15    | 2.67  |
| TPCDS(10) | TPCDS-Q14-1 | parquet / none / none | 5.82   | 5.76        |   +1.08%   |   5.27%    |   3.89%        | 5     |   +2.04%       | 0.00    | 0.37  |
| TPCDS(10) | TPCDS-Q23-2 | parquet / none / none | 4.65   | 4.58        |   +1.38%   |   1.97%    |   0.48%        | 5     |   +0.81%       | 0.87    | 1.51  |
| TPCDS(10) | TPCDS-Q49   | parquet / none / none | 1.49   | 1.48        |   +0.46%   | * 36.02% * | * 34.95% *     | 5     |   +1.26%       | 0.58    | 0.02  |
| TPCDS(10) | TPCDS-Q14-2 | parquet / none / none | 3.76   | 3.75        |   +0.39%   |   1.67%    |   0.58%        | 5     |   -0.03%       | -0.58   | 0.49  |
| TPCDS(10) | TPCDS-Q78   | parquet / none / none | 2.80   | 2.80        |   -0.04%   |   1.32%    |   1.33%        | 5     |   -0.42%       | -0.29   | -0.05 |
| TPCDS(10) | TPCDS-Q80A  | parquet / none / none | 2.87   | 2.89        |   -0.51%   |   1.33%    |   0.40%        | 5     |   -0.01%       | -0.29   | -0.82 |
| TPCDS(10) | TPCDS-Q3    | parquet / none / none | 0.18   | 0.19        |   -1.29%   | * 15.26% * | * 15.87% *     | 5     |   -0.54%       | -0.87   | -0.13 |
| TPCDS(10) | TPCDS-Q76   | parquet / none / none | 1.08   | 1.11        |   -2.98%   |   0.92%    |   1.70%        | 5     |   -3.99%       | -2.02   | -3.47 |
+-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Pass PlannerTest.testProcessingCost
- Pass test_executor_groups.py
- Reenable test_tpcds_q51a in TestTpcdsQueryWithProcessingCost with
  MAX_FRAGMENT_INSTANCES_PER_NODE set to 5
- Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost
- Pass core tests

Change-Id: If948e45455275d9a61a6cd5d6a30a8b98a7c729a
Reviewed-on: http://gerrit.cloudera.org:8080/19807
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-11 22:46:31 +00:00
Riza Suminto
7ca20b3c94 Revert "IMPALA-11123: Optimize count(star) for ORC scans"
This reverts commit f932d78ad0.

The commit is reverted because it cause significant regression for
non-optimized counts star query in parquet format.

There are several conflicts that need to be resolved manually:
- Removed assertion against 'NumFileMetadataRead' counter that is lost
  with the revert.
- Adjust the assertion in test_plain_count_star_optimization,
  test_in_predicate_push_down, and test_partitioned_insert of
  test_iceberg.py due to missing improvement in parquet optimized count
  star code path.
- Keep the "override" specifier in hdfs-parquet-scanner.h to pass
  clang-tidy
- Keep python3 style of RuntimeError instantiation in
  test_file_parser.py to pass check-python-syntax.sh

Change-Id: Iefd8fd0838638f9db146f7b706e541fe2aaf01c1
Reviewed-on: http://gerrit.cloudera.org:8080/19843
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-05-06 22:55:05 +00:00
Riza Suminto
1830abcc7d IMPALA-12056: Let child queries unset REQUEST_POOL if auto-scaling
'Compute Stats' queries gets scheduled on the smallest executor group
set since these queries don't do any real work. However their child
queries also gets scheduled on the smallest executor group. This may not
be ideal for cases where the child query does NDVs and Counts on a big
wide table.

This patch let child queries to unset REQUEST_POOL query option if that
option is set by frontend planner rather than client. With REQUEST_POOL
unset, child query can select the executor group that best-fit its
workload.

Testing:
- Add test in test_query_cpu_count_divisor_default

Change-Id: I6dc559aa161a27a7bd5d3034788cc6241490d3b5
Reviewed-on: http://gerrit.cloudera.org:8080/19832
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-04 15:54:39 +00:00
Riza Suminto
f0b9340983 IMPALA-12041: Select first executor group if query not auto-scalable
In multiple executor groups setup, some trivial queries like "select 1;"
fail admission with "No mapping found for request" error message. This
patch fixes a bug where the Frontend does not set group name prefix when
query is not auto-scalable. In cases like trivial query run, correct
executor group name prefix is still needed for backend to correctly
resolve the target pool.

Testing:
- Pass test_executor_groups.py

Change-Id: I89497c8f67bfd176c2b60fa1b70fe53f905bbab0
Reviewed-on: http://gerrit.cloudera.org:8080/19691
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-04-06 07:51:30 +00:00
wzhou-code
ac23deab4d IMPALA-12036: Fix Web UI to show right resource pools
Web queries site shows no resource pool unless it is specified with
query option. The Planner could set TQueryCtx.request_pool in
TQueryExecRequest when auto scaling is enabled. But the backend
ignores the TQueryCtx.request_pool in TQueryExecRequest when getting
resource pools for Web UI.
This patch fixes the issue in ClientRequestState::request_pool() by
checking TQueryCtx.request_pool in TQueryExecRequest. It also
removes the error path in RequestPoolService::ResolveRequestPool() if
requested_pool is empty string.

Testing:
 - Updated TestExecutorGroups::test_query_cpu_count_divisor_default,
   TestExecutorGroups::test_query_cpu_count_divisor_two, and
   TestExecutorGroups::test_query_cpu_count_divisor_fraction to
   verify resource pools on Web queries site and Web admission site.
 - Updated expected error message in
   TestAdmissionController::test_set_request_pool.
 - Passed core test.

Change-Id: Iceacb3a8ec3bd15a8029ba05d064bbbb81e3a766
Reviewed-on: http://gerrit.cloudera.org:8080/19688
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
2023-04-05 15:45:06 +00:00
Riza Suminto
b2bc488402 IMPALA-12029: Relax scan fragment parallelism on first planning
In a setup with multiple executor group set, Frontend will try to match
a query with the smallest executor group set that can fit the memory and
cpu requirement of the compiled query. There are kinds of query where
the compiled plan will fit to any executor group set but not necessarily
deliver the best performance. An example for this is Impala's COMPUTE
STATS query. It does full table scan and aggregate the stats, have
fairly simple query plan shape, but can benefit from higher scan
parallelism.

This patch relaxes the scan fragment parallelism on first round of query
planning. This allows scan fragment to increase its parallelism based on
its ProcessingCost estimation. If the relaxed plan fit in an executor
group set, we replan once again with that executor group set but with
scan fragment parallelism returned back to MT_DOP. This one extra round
of query planning adds couple millisecond overhead depending on the
complexity of the query plan, but necessary since the backend scheduler
still expect at most MT_DOP amount of scan fragment instances. We can
remove the extra replanning in the future once we can fully manage scan
node parallelism without MT_DOP.

This patch also adds some improvement, including:
- Tune computeScanProcessingCost() to guard against scheduling too many
  scan fragments by comparing with the actual scan range count that
  Planner knows.
- Use NUM_SCANNER_THREADS as a hint to cap scan node cost during the
  first round of planning.
- Multiply memory related counters by num executors to make it per group
  set rather than per node.
- Fix bug in doCreateExecRequest() about selection of num executors for
  planning.

Testing:
- Pass test_executor_groups.py
- Add test cases in test_min_processing_per_thread_small.
- Raised impala.admission-control.max-query-mem-limit.root.small from
  64MB to 70MB in llama-site-3-groups.xml so that the new grouping query
  can fit in root.small pool.

Change-Id: I7a2276fbd344d00caa67103026661a3644b9a1f9
Reviewed-on: http://gerrit.cloudera.org:8080/19656
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-04-04 17:37:05 +00:00
zhangyifan27
35fe1f37f5 IMPALA-8731: Balance queries across multiple executor groups
This patch adds a new admission control policy that attempts to
balance queries across multiple executor groups belonging to the
same request pool based on available memory and slots in each
executor group. This feature can be enabled by setting the startup
flag '-balance_queries_across_executor_groups=true'. The setting is
off by default.

Testing:
  - Add e2e tests to verify the default policy and the new policy.

Change-Id: I25e851fb57c1d820c25cef5316f4ed800e4c6ac5
Reviewed-on: http://gerrit.cloudera.org:8080/19630
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-03-25 21:52:18 +00:00
Riza Suminto
1eb6ed6f29 IMPALA-12023: Skip resource checking on last executor group set
This patch adds flag skip_resource_checking_on_last_executor_group_set.
If this backend flag is set to true, memory and cpu resource checking
will be skipped when a query is being planned against the last (largest)
executor group set. Setting true will ensure that query will always get
admitted into the last executor group set if it does not fit in any
other group set.

Testing
- Tune test_query_cpu_count_divisor_fraction to run two test case:
  cpu within limit, and cpu outside limit.
- Add test_no_skip_resource_checking

Change-Id: I5848e4f67939d3dd2fb105c1ae4ca8e15f2e556f
Reviewed-on: http://gerrit.cloudera.org:8080/19649
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-24 08:32:14 +00:00
Riza Suminto
f2b01c1ddb IMPALA-12017: Skip memory and cpu limit check if REQUEST_POOL is set
Memory and cpu limit checking in executor group
selection (Frontend.java) should be skipped if REQUEST_POOL query option
is set. Setting REQUEST_POOL means user is specifying pool to run the
query regardless of memory and cpu limit.

Testing:
- Add test cases in test_query_cpu_count_divisor_default

Change-Id: I14bf7fe71e2dda1099651b3edf62480e1fdbf845
Reviewed-on: http://gerrit.cloudera.org:8080/19645
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2023-03-23 13:01:52 +00:00
Riza Suminto
5c1693d03d IMPALA-12005: Describe executor group set selection in query profile
This patch adds new profile counters under the Frontend profile node to
describe executor group set selection during query planning. It modifies
FrontendProfile.java to allow one level of TRuntimeProfileNode nesting
under the Frontend profile node. This makes it possible to group profile
counters specific to each executor group set in consideration.
"fragment-costs" hint is renamed to "segment-costs". A new
"cpu-comparison-result" hint is added after "segment-costs" to help
navigate how cpu sizing decision is made.

This patch also adds some function overloading in runtime-profile.cc to
hide TotalTime and InactiveTotalTime that is meaningless for anything
under the Frontend profile node. Additional context also added into
AnalysisException threw when none of the executor group sets fits the
query requirement.

This is how the Frontend profile node looks like after running
TestExecutorGroups::test_query_cpu_count_divisor_fraction

    Frontend:
      Referenced Tables: tpcds_parquet.store_sales
       - CpuCountDivisor: 0.20
       - ExecutorGroupsConsidered: 3 (3)
      Executor group 1 (root.tiny):
        Verdict: not enough cpu cores
         - CpuAsk: 15 (15)
         - CpuMax: 2 (2)
         - EffectiveParallelism: 3 (3)
         - MemoryAsk: 36.83 MB (38617088)
         - MemoryMax: 64.00 MB (67108864)
      Executor group 2 (root.small):
        Verdict: not enough cpu cores
         - CpuAsk: 25 (25)
         - CpuMax: 16 (16)
         - EffectiveParallelism: 5 (5)
         - MemoryAsk: 36.83 MB (38624004)
         - MemoryMax: 64.00 MB (67108864)
      Executor group 3 (root.large):
        Verdict: Match
         - CpuAsk: 35 (35)
         - CpuMax: 192 (192)
         - EffectiveParallelism: 7 (7)
         - MemoryAsk: 36.84 MB (38633570)
         - MemoryMax: 8388608.00 GB (9007199254740992)

Testing:
- Pass core tests

Change-Id: I6c0ac7f5216d631e4439fe97702e21e06d2eda8a
Reviewed-on: http://gerrit.cloudera.org:8080/19628
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2023-03-23 13:01:52 +00:00
Joe McDonnell
0c7c6a335e IMPALA-11977: Fix Python 3 broken imports and object model differences
Python 3 changed some object model methods:
 - __nonzero__ was removed in favor of __bool__
 - func_dict / func_name were removed in favor of __dict__ / __name__
 - The next() function was deprecated in favor of __next__
   (Code locations should use next(iter) rather than iter.next())
 - metaclasses are specified a different way
 - Locations that specify __eq__ should also specify __hash__

Python 3 also moved some packages around (urllib2, Queue, httplib,
etc), and this adapts the code to use the new locations (usually
handled on Python 2 via future). This also fixes the code to
avoid referencing exception variables outside the exception block
and variables outside of a comprehension. Several of these seem
like false positives, but it is better to avoid the warning.

This fixes these pylint warnings:
bad-python3-import
eq-without-hash
metaclass-assignment
next-method-called
nonzero-method
exception-escape
comprehension-escape

Testing:
 - Ran core tests
 - Ran release exhaustive tests

Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee
Reviewed-on: http://gerrit.cloudera.org:8080/19592
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
eb66d00f9f IMPALA-11974: Fix lazy list operators for Python 3 compatibility
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.

Python 2:
range(0,5) == [0,1,2,3,4]
True

Python 3:
range(0,5) == [0,1,2,3,4]
False

The fix is to wrap locations with list(). i.e.

Python 3:
list(range(0,5)) == [0,1,2,3,4]
True

Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).

Most of the changes were done via these futurize fixes:
 - libfuturize.fixes.fix_xrange_with_import
 - lib2to3.fixes.fix_map
 - lib2to3.fixes.fix_filter

This eliminates the pylint warnings:
 - xrange-builtin
 - range-builtin-not-iterating
 - map-builtin-not-iterating
 - zip-builtin-not-iterating
 - filter-builtin-not-iterating
 - reduce-builtin
 - deprecated-itertools-function

Testing:
 - Ran core job

Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Riza Suminto
dafc0fb7a8 IMPALA-11604 (part 2): Compute Effective Parallelism of Query
Part 1 of IMPALA-11604 implements the ProcessingCost model for each
PlanNode and DataSink. This second part builds on top of ProcessingCost
model by adjusting the number of instances for each fragment after
considering their production-consumption ratio, and then finally returns
a number representing an ideal CPU core count required for a query to
run efficiently. A more detailed explanation of the CPU costing
algorithm can be found in the three steps below.

I. Compute the total ProcessingCost of a fragment.

The costing algorithm splits a query fragment into several segments
divided by blocking PlanNode/DataSink boundary. Each fragment segment is
a subtree of PlanNodes/DataSink in the fragment with a DataSink or
blocking PlanNode as root and non-blocking leaves. All other nodes in
the segment are non-blocking. PlanNodes or DataSink that belong to the
same segment will have their ProcessingCost summed. A new CostingSegment
class is added to represent this segment.

A fragment that has a blocking PlanNode or blocking DataSink is called a
blocking fragment. Currently, only JoinBuildSink is considered as
blocking DataSink. A fragment without any blocking nodes is called a
non-blocking fragment. Step III discuss further about blocking and
non-blocking fragment.

Take an example of the following fragment plant, which is blocking since
it has 3 blocking PlanNode: 12:AGGREGATE, 06:SORT, and 08:TOP-N.

  F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=6 (adjusted from 12)
  fragment-costs=[34974657, 2159270, 23752870, 22]
  08:TOP-N [LIMIT=100]
  |  cost=900
  |
  07:ANALYTIC
  |  cost=23751970
  |
  06:SORT
  |  cost=2159270
  |
  12:AGGREGATE [FINALIZE]
  |  cost=34548320
  |
  11:EXCHANGE [HASH(i_class)]
     cost=426337

In bottom-up direction, there exist four segments in F03:
  Blocking segment 1: (11:EXCHANGE, 12:AGGREGATE)
  Blocking segment 2: 06:SORT
  Blocking segment 3: (07:ANALYTIC, 08:TOP-N)
  Non-blocking segment 4: DataStreamSink of F03

Therefore we have:
  PC(segment 1) = 426337+34548320
  PC(segment 2) = 2159270
  PC(segment 3) = 23751970+900
  PC(segment 4) = 22

These per-segment costs stored in a CostingSegment tree rooted at
PlanFragment.rootSegment_, and are [34974657, 2159270, 23752870, 22]
respectively after the post-order traversal.

This is implemented in PlanFragment.computeCostingSegment() and
PlanFragment.collectCostingSegmentHelper().

II. Compute the effective degree of parallelism (EDoP) of fragments.

The costing algorithm walks PlanFragments of the query plan tree in
post-order traversal. Upon visiting a PlanFragment, the costing
algorithm attempts to adjust the number of instances (effective
parallelism) of that fragment by comparing the last segment's
ProcessingCost of its child and production-consumption rate between its
adjacent segments from step I. To simplify this initial implementation,
the parallelism of PlanFragment containing EmptySetNode, UnionNode, or
ScanNode will remain unchanged (follow MT_DOP).

This step is implemented at PlanFragment.traverseEffectiveParallelism().

III. Compute the EDoP of the query.

Effective parallelism of a query is the maximum upper bound of CPU core
count that can parallelly work on a query when considering the
overlapping between fragment execution and blocking operators. We
compute this in a similar post-order traversal as step II and split the
query tree into blocking fragment subtrees similar to step I. The
following is an example of a query plan from TPCDS-Q12.

  F04:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
  PLAN-ROOT SINK
  |
  13:MERGING-EXCHANGE [UNPARTITIONED]
  |
  F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=3 (adjusted from 12)
  08:TOP-N [LIMIT=100]
  |
  07:ANALYTIC
  |
  06:SORT
  |
  12:AGGREGATE [FINALIZE]
  |
  11:EXCHANGE [HASH(i_class)]
  |
  F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=12
  05:AGGREGATE [STREAMING]
  |
  04:HASH JOIN [INNER JOIN, BROADCAST]
  |
  |--F05:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
  |  JOIN BUILD
  |  |
  |  10:EXCHANGE [BROADCAST]
  |  |
  |  F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
  |  02:SCAN HDFS [tpcds10_parquet.date_dim, RANDOM]
  |
  03:HASH JOIN [INNER JOIN, BROADCAST]
  |
  |--F06:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
  |  JOIN BUILD
  |  |
  |  09:EXCHANGE [BROADCAST]
  |  |
  |  F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
  |  01:SCAN HDFS [tpcds10_parquet.item, RANDOM]
  |
  00:SCAN HDFS [tpcds10_parquet.web_sales, RANDOM]

A blocking fragment is a fragment that has a blocking PlanNode or
blocking DataSink in it. The costing algorithm splits the query plan
tree into blocking subtrees divided by blocking fragment boundary. Each
blocking subtree has a blocking fragment as a root and non-blocking
fragments as the intermediate or leaf nodes. From the TPCDS-Q12 example
above, the query plan is divided into five blocking subtrees of
[(F05, F02), (F06, F01), F00, F03, F04].

A CoreCount is a container class that represents the CPU core
requirement of a subtree of a query or the query itself. Each blocking
subtree will have its fragment's adjusted instance count summed into a
single CoreCount. This means that all fragments within a blocking
subtree can run in parallel and should be assigned one core per fragment
instance. The CoreCount for each blocking subtree in the TPCDS-Q12
example is [4, 4, 12, 3, 1].

Upon visiting a blocking fragment, the maximum between current
CoreCount (rooted at that blocking fragment) vs previous blocking
subtrees CoreCount is taken and the algorithm continues up to the next
ancestor PlanFragment. The final CoreCount for the TPCDS-Q12 example is
12.

This step is implemented at Planner.computeBlockingAwareCores() and
PlanFragment.traverseBlockingAwareCores().

The resulting CoreCount at the root PlanFragment is then taken as the
ideal CPU core count / EDoP of the query. This number will be compared
against the total CPU count of an Impala executor group to determine if
it fits to run in that set or not. A backend flag
query_cpu_count_divisor is added to help scale down/up the EDoP of a
query if needed.

Two query options are added to control the entire computation of EDoP.
1. COMPUTE_PROCESSING_COST
   Control whether to enable this CPU costing algorithm or not.
   Must also set MT_DOP > 0 for this query option to take effect.

2. PROCESSING_COST_MIN_THREADS
   Control the minimum number of fragment instances (threads) that the
   costing algorithm is allowed to adjust. The costing algorithm is in
   charge of increasing the fragment's instance count beyond this
   minimum number through producer-consumer rate comparison. The maximum
   number of fragment is max between PROCESSING_COST_MIN_THREADS,
   MT_DOP, and number of cores per executor.

This patch also adds three backend flags to tune the algorithm.
1. query_cpu_count_divisor
   Divide the CPU requirement of a query to fit the total available CPU
   in the executor group. For example, setting value 2 will fit the
   query with CPU requirement 2X to an executor group with total
   available CPU X. Note that setting with a fractional value less than
   1 effectively multiplies the query CPU requirement. A valid value is
   > 0.0. The default value is 1.

2. processing_cost_use_equal_expr_weight
   If true, all expression evaluations are weighted equally to 1 during
   the plan node's processing cost calculation. If false, expression
   cost from IMPALA-2805 will be used. Default to true.

3. min_processing_per_thread
   Minimum processing load (in processing cost unit) that a fragment
   instance needs to work on before planner considers increasing
   instance count based on the processing cost rather than the MT_DOP
   setting. The decision is per fragment. Setting this to high number
   will reduce parallelism of a fragment (more workload per fragment),
   while setting to low number will increase parallelism (less workload
   per fragment). Actual parallelism might still be constrained by the
   total number of cores in selected executor group, MT_DOP, or
   PROCESSING_COST_MIN_THREAD query option. Must be a positive integer.
   Currently default to 10M.

As an example, the following are additional ProcessingCost information
printed to coordinator log for Q3, Q12, and Q15 ran on TPCDS 10GB scale,
3 executors, MT_DOP=4, PROCESSING_COST_MAX_THREADS=4, and
processing_cost_use_equal_expr_weight=false.

  Q3
  CoreCount={total=12 trace=F00:12}

  Q12
  CoreCount={total=12 trace=F00:12}

  Q15
  CoreCount={total=15 trace=N07:3+F00:12}

There are a few TODOs which will be done in follow up tasks:
1. Factor in row width in ProcessingCost calcuation (IMPALA-11972).
2. Tune the individual expression cost from IMPALA-2805.
3. Benchmark and tune min_processing_per_thread with an optimal value.
4. Revisit cases where cardinality is not available (getCardinality() or
   getInputCardinality() return -1).
5. Bound SCAN and UNION fragments by ProcessingCost as well (need to
   address IMPALA-8081).

Testing:
- Add TestTpcdsQueryWithProcessingCost, which is a similar run of
  TestTpcdsQuery, but with COMPUTE_PROCESSING_COST=1 and MT_DOP=4.
  Setting log level TRACE for PlanFragment and manually running
  TestTpcdsQueryWithProcessingCost in minicluster shows several fragment
  instance count reduction from 12 to either of 9, 6, or 3 in
  coordinator log.
- Add PlannerTest#testProcessingCost
  Adjusted fragment count is indicated by "(adjusted from 12)" in the
  query profile.
- Add TestExecutorGroups::test_query_cpu_count_divisor.

Co-authored-by: Qifan Chen <qchen@cloudera.com>

Change-Id: Ibb2a796fdf78336e95991955d89c671ec82be62e
Reviewed-on: http://gerrit.cloudera.org:8080/19593
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2023-03-08 15:32:28 +00:00
zhangyifan27
253423f99c IMPALA-11891: Remove empty executor groups
This patch removes executor groups from cluster membership after they
have no executors, so that executor groups' configurations can be
updated without restarting all impalads in the cluster.

Testing:
- Added an e2e test to verify the new functionality.

Change-Id: I480b84b26a780d345216004f1a4657c7b95dda45
Reviewed-on: http://gerrit.cloudera.org:8080/19468
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-02-04 01:11:17 +00:00
wzhou-code
43928b190b IMPALA-11617: Pool service should be made aware of cpu core limit
IMPALA-11604 enables the planner to compute CPU usage for certain
queries and to select suitable executor groups to run. The CPU usage is
expressed as the CPU cores required to process a query.

This patch added the CPU core limit, which is the maximum CPU core
available per node and coordinator for each executor group, to the pool
service.

Testing:
 - Passed core run.
 - Verified that CPU cores were shown on the admission and
   metrics pages of the Impala debug web server.

Change-Id: Id4c5ee519ce7c329b06ac821283e215a3560f525
Reviewed-on: http://gerrit.cloudera.org:8080/19366
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-28 01:34:14 +00:00
Yida Wu
839a25c89b IMPALA-11786: Preserve memory for codegen cache
IMPALA-11470 adds support for codegen cache, however the admission
controller is not aware of the memory usage of the codegen cache,
while the codegen cache is actually using the memory quota from
the query memory. It could result in query failures when running
heavy workloads and admission controller has fully admitted queries.

This patch subtracts the codegen cache capacity from the admission
memory limit during initialization, therefore preserving the memory
consumption of codegen cache from the beginning, and treating it as
a separate memory independent to the query memory reservation.

Also reduces the max codegen cache memory from 20 percent to 10
percent, and changes some failed testcases due to the reduction of
the admit memory limit.

Tests:
Passed exhaustive tests.

Change-Id: Iebdc04ba1b91578d74684209a11c815225b8505a
Reviewed-on: http://gerrit.cloudera.org:8080/19377
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-06 06:28:07 +00:00
Wenzhe Zhou
1b59d32eff Revert "IMPALA-11617: Pool service should be made aware of processing cost limit"
This reverts commit 1d62bddb84.

Change-Id: I1ebf5ff9685072079e18497d869d06b2c55153fe
Reviewed-on: http://gerrit.cloudera.org:8080/19139
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2022-10-14 04:01:06 +00:00
wzhou-code
1d62bddb84 IMPALA-11617: Pool service should be made aware of processing cost limit
IMPALA-11604 enables the planner to compute CPU usage for certain
queries and to select suitable executor groups to run. The CPU usage is
expressed as the total amount of data to be processed per query.

This patch added the processing cost limit, which is the total amount of
data that each executor group can handle, to the pool service.

Testing:
 - Passed core run.
 - Verified that processing costs were shown on the admission and
   metrics pages of the Impala debug web server.

Change-Id: I9bd2a7284eda47a969ef91e4be19f96d2af53449
Reviewed-on: http://gerrit.cloudera.org:8080/19121
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-12 23:46:57 +00:00
Michael Smith
11157a8701 IMPALA-10889: Allow extra 1MB for fragment cancellation
After queries are cancelled, it can take some time (>30s in some
instances) to fully cancel all fragment instances and fully reclaim
reserved memory. The test and query limits were exactly matched, so any
extra reservation would prevent scheduling, causing the test to
frequently time out. With the fix, a 1MB of extra memory is reserved to
break the tie thus avoiding the time out. The extra 1MB of memory can be
seen in logs printing agg_mem_reserved.

Rather than extend timeouts and make the test run longer, add a small
buffer to the admission limit to allow for fragment instance cleanup
while the test runs.

Change-Id: Iaee557ad87d3926589b30d6dcdd850e9af9b3476
Reviewed-on: http://gerrit.cloudera.org:8080/19092
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-11 00:18:07 +00:00
Riza Suminto
f932d78ad0 IMPALA-11123: Optimize count(star) for ORC scans
This patch provides count(star) optimization for ORC scans, similar to
the work done in IMPALA-5036 for Parquet scans. We use the stripes num
rows statistics when computing the count star instead of materializing
empty rows. The aggregate function changed from a count to a special sum
function initialized to 0.

This count(star) optimization is disabled for the full ACID table
because the scanner might need to read and validate the
'currentTransaction' column in table's special schema.

This patch drops 'parquet' from names related to the count star
optimization. It also improves the count(star) operation in general by
serving the result just from the file's footer stats for both Parquet
and ORC. We unify the optimized count star and zero slot scan functions
into HdfsColumnarScanner.

The following table shows a performance comparison before and after the
patch. primitive_count_star query target tpch10_parquet.lineitem
table (10GB scale TPC-H). Meanwhile, count_star_parq and count_star_orc
query is a modified primitive_count_star query that targets
tpch_parquet.lineitem and tpch_orc_def.lineitem table accordingly.

+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload          | Query                | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| tpch_parquet      | count_star_parq      | parquet / none / none | 0.06   | 0.07        |   -10.45%  |   2.87%    | * 25.51% *     | 9     |   -1.47%       | -1.26   | -1.22 |
| tpch_orc_def      | count_star_orc       | orc / def / none      | 0.06   | 0.08        |   -22.37%  |   6.22%    | * 30.95% *     | 9     |   -1.85%       | -1.16   | -2.14 |
| TARGETED-PERF(10) | primitive_count_star | parquet / none / none | 0.06   | 0.08        | I -30.40%  |   2.68%    | * 29.63% *     | 9     | I -7.20%       | -2.42   | -3.07 |
+-------------------+----------------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Add PlannerTest.testOrcStatsAgg
- Add TestAggregationQueries::test_orc_count_star_optimization
- Exercise count(star) in TestOrc::test_misaligned_orc_stripes
- Pass core tests

Change-Id: I0fafa1182f97323aeb9ee39dd4e8ecd418fa6091
Reviewed-on: http://gerrit.cloudera.org:8080/18327
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-04-05 13:27:10 +00:00
Qifan Chen
07a3e6e0df IMPALA-10992 Planner changes for estimate peak memory
This patch provides replan support for multiple executor group sets.
Each executor group set is associated with a distinct number of nodes
and a threshold for estimated memory per host in bytes that can be
denoted as [<group_name_prefix>:<#nodes>, <threshold>].

In the patch, a query of type EXPLAIN, QUERY or DML can be compiled
more than once. In each attempt, per host memory is estimated and
compared with the threshold of an executor group set. If the estimated
memory is no more than the threshold, the iteration process terminates
and the final plan is determined. The executor group set with the
threshold is selected to run the query.

A new query option 'enable_replan', default to 1 (enabled), is added.
It can be set to 0 to disable this patch and to generate the distributed
plan for the default executor group.

To avoid long compilation time, the following enhancement is enabled.
Note 1) can be disabled when relevant meta-data change is
detected.

 1. Authorization is performed only for the 1st compilation;
 2. openTransaction() is called for transactional queries in 1st
    compilation and the saved transactional info is used in
    subsequent compilations. Similar logic is applied to Kudu
    transactional queries.

To facilitate testing, the patch imposes an artificial two executor
group setup in FE as follows.

 1. [regular:<#nodes>, 64MB]
 2. [large:<#nodes>, 8PB]

This setup is enabled when a new query option 'test_replan' is set
to 1 in backend tests, or RuntimeEnv.INSTANCE.isTestEnv() is true as
in most frontend tests. This query option is set to 0 by default.

Compilation time increases when a query is compiled in several
iterations, as shown below for several TPCDs queries. The increase
is mostly due to redundant work in either single node plan creation
or recomputing value transfer graph phase. For small queries, the
increase can be avoided if they can be compiled in single iteration
by properly setting the smallest threshold among all executor group
sets. For example, for the set of queries listed below, the smallest
threshold can be set to 320MB to catch both q15 and q21 in one
compilation.

                              Compilation time (ms)
Queries	 Estimated Memory   2-iterations  1-iteration  Percentage of
                                                         increase
 q1         408MB              60.14         25.75       133.56%
 q11	   1.37GB             261.00        109.61       138.11%
 q10a	    519MB             139.24         54.52       155.39%
 q13	    339MB             143.82         60.08       139.38%
 q14a	   3.56GB             762.68        312.92       143.73%
 q14b	   2.20GB             522.01        245.13       112.95%
 q15	    314MB               9.73          4.28       127.33%
 q21	    275MB              16.00          8.18        95.59%
 q23a	   1.50GB             461.69        231.78        99.19%
 q23b	   1.34GB             461.31        219.61       110.05%
 q4	   2.60GB             218.05        105.07       107.52%
 q67	   5.16GB             694.59        334.24       101.82%

Testing:
 1. Almost all FE and BE tests are now run in the artificial two
    executor setup except a few where a specific cluster configuration
    is desirable;
 2. Ran core tests successfully;
 3. Added a new observability test and a new query assignment test;
 4. Disabled concurrent insert test (test_concurrent_inserts) and
    failing inserts (test_failing_inserts) test in local catalog mode
    due to flakiness. Reported both in IMPALA-11189 and IMPALA-11191.

Change-Id: I75cf17290be2c64fd4b732a5505bdac31869712a
Reviewed-on: http://gerrit.cloudera.org:8080/18178
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Qifan Chen <qchen@cloudera.com>
2022-03-21 20:17:28 +00:00
Bikramjeet Vig
0aedf6021f IMPALA-11063: Add metrics to expose state of each executor group set
This adds metrics for each executor group set that expose the number
of executor groups, the number of healthy executor groups and the
total number of backends associated with that group set.

Testing:
Added an e2e test to verify metrics are updated correctly.

Change-Id: Ib39f940de830ef6302785aee30eeb847fa5deeba
Reviewed-on: http://gerrit.cloudera.org:8080/18142
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-01-20 09:46:32 +00:00
Bikramjeet Vig
65c3b784a2 IMPALA-11033: Add support for specifying multiple executor group sets
This patch introduces the concept of executor group sets. Each group
set specifies an executor group name prefix and an expected group
size (the number of executors in each group). Every executor group
that is a part of this set will have the same prefix which will
also be equivalent to the resource pool name that it maps to.
These sets are specified via a startup flag
'expected_executor_group_sets' which is a comma separated list in
the format <executor_group_name_prefix>:<expected_group_size>.

Testing:
- Added unit tests

Change-Id: I9e0a3a5fe2b1f0b7507b7c096b7a3c373bc2e684
Reviewed-on: http://gerrit.cloudera.org:8080/18093
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-01-08 09:26:37 +00:00