153 Commits

Author SHA1 Message Date
Yida Wu
0c9fe293c3 IMPALA-14612: Add global metrics for admission state map size
We need better observability for the admission state map to warn
about potential memory leaks.

The admission state map tracks queries currently being processed or
queued. An entry is added when a query is submitted for admission.
The entry is removed when the query finishes execution, is rejected
by admission control, times out while queuing, or is cancelled. If
the removal logic is missed due to bugs, the map size grows
indefinitely, causing a memory leak. We have observed cases where
admission state entries were not released, causing memory leaks in
admissiond.

Adds the metric admission-control-service.num-queries and its high
water mark to track the number of active entries. This patch updates
GenericShardedQueryMap to support an optional
AtomicHighWaterMarkGauge. When set, the map automatically increments
or decrements the gauge during Add and Delete operations. This
ensures the metric accurately reflects the map size without requiring
manual updates at every call site.

Tests:
Updated and passed test_admission_state_map_mem_leak to verify the
metrics.

Change-Id: Ie803aabf8d91b6381c5d0d7534cd9c9fc2166a73
Reviewed-on: http://gerrit.cloudera.org:8080/23760
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-19 05:10:31 +00:00
Yida Wu
1bc7cdbff6 IMPALA-14493: Cap memory usage of global admission service
The global admission service can experience OOM errors under
high concurrency because its process memory tracker is inaccurate
and doesn't account for all memory allocations.

Ensuring memory tracker accurately accounts for every allocation
could be difficult, this patch uses a simpler solution to
introduce a hard memory cap using tcmalloc statistics, which
accurately reflect the true process memory usage. If a new query
is submitted while tcmalloc memory usage is over the process
limit, the query will be rejected immediately to protect from OOM.

Adds a new flag enable_admission_service_mem_safeguard allowing
this feature to be enabled or disabled. By default, this feature is
turned on

Tests:
Added test test_admission_service_low_mem_limit.
Passed exhaustive tests.

Change-Id: I2ee2c942a73fcd69358851fc2fdc0fc4fe531c73
Reviewed-on: http://gerrit.cloudera.org:8080/23542
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-10-23 12:13:11 +00:00
Yida Wu
859c9c1f66 IMPALA-14276: Fix memory leak by removing AdmissionState on rejection
Normally, AdmissionState entries in admissiond are cleaned up when
a query is released. However, for requests that are rejected,
releasing query is not called, and their AdmissionState was not
removed from admission_state_map_ resulting in a memory leak over
time.

This leak was less noticeable because AdmissionState entries were
relatively small. However, when admissiond is run as a standalone
process, each AdmissionState includes a profile sidecar, which
can be large, making the leak much more.

This change adds logic to remove AdmissionState entries when the
admission request is rejected.

Testing:
Add test_admission_state_map_mem_leak for regression test.

Change-Id: I9fba4f176c648ed7811225f7f94c91342a724d10
Reviewed-on: http://gerrit.cloudera.org:8080/23257
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-07 20:26:58 +00:00
Yida Wu
59fdd7169a IMPALA-10866: Add testcases for failure cases involving the admission service
The admission service uses the statestore as the only source of
truth to determine whether a coordinator is down. If the statestore
reports a coordinator is down, all running and queued queries
associated with it should be cancelled or rejected.

In IMPALA-12057, we introduced logic to reject queued queries if
the corresponding coordinator has been removed, along with tests
for that behavior.

This patch adds additional test cases to cover other failure
scenarios, such as the coordinator or the statestore going down
with running queries, and verifies that the behavior is as expected
in each case.

Tests:
Passed exhaustive tests.

Change-Id: If617326cbc6fe2567857d6323c6413d98c92d009
Reviewed-on: http://gerrit.cloudera.org:8080/23217
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-05 06:43:41 +00:00
Yida Wu
8d56eea725 IMPALA-12057: Track removed coordinators to reject queued queries early
Queries in global admission control can remain queued for a long time
if they are assigned to a coordinator that has already left the
cluster. Admissiond can't distinguish between a coordinator that
hasn’t yet been propagated via the statestore and one that has
already been removed, resulting in unnecessary waiting until timeout.
This timeout is determined by either FLAGS_queue_wait_timeout_ms or
the queue_timeout_ms in the pool config. By default,
FLAGS_queue_wait_timeout_ms is 1 minute, but in production it's
normally configured to 10 to 15 minutes.

This change tracks recently removed coordinators and rejects such
queued queries immediately using REASON_COORDINATOR_REMOVED.
To ensure the removed coordinator list remains simple and bounded,
it avoids duplicate entries and enforces FIFO eviction at
the minimum of MAX_REMOVED_COORD_SIZE (1000) and
FLAGS_cluster_membership_retained_removed_coords.

It's possible that a coordinator marked as removed comes back
with the same backend id. In that case, admissiond will see it in
current_backends and won't need to check the removed list. Even
if a coordinator briefly flaps and a request is rejected, it's not
critical, the coordinator can retry. So to keep the design simple
and safe, we keep the removed coord entry as-is.

Added a parameter is_admissiond to the ClusterMembershipMgr
constructor to indicate whether it is running within the admissiond.

Tests:
Passed exhaustive tests.
Added unit tests to verify the eviction logic and the duplicate
case.
Added regression test test_coord_not_registered_in_ac.

Change-Id: I1e0f270299f8c20975d7895c17f4e2791c3360e0
Reviewed-on: http://gerrit.cloudera.org:8080/23094
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-13 05:37:31 +00:00
Riza Suminto
013bcc127f IMPALA-14163: (Addendum) Always reset max-query-mem-limit
test_pool_config_change_while_queued now consistently pass in
TestAdmissionController and fail in
TestAdmissionControllerWithACService. The root cause of this issue is
because copy-mem-limit-test-llama-site.xml is only copied once for both
tests. TestAdmissionController left max-query-mem-limit of
invalidTestPool at 25MB without resetting it back to 0, which then cause
test failure at TestAdmissionControllerWithACService.

This patch improve the test by always setting max-query-mem-limit of
invalidTestPool at 0 both in the beginning and the end of test. Change
ResourcePoolConfig to use mem_limit_coordinators and mem_limit_executors
because, unlike mem_limit option, they are not subject to pool-level
memory clamping. Disable --clamp_query_mem_limit_backend_mem_limit flag
so that coord_backend_mem_limit is not clamped to coordinator's process
limit.

Removed make_copy parameter in test_pool_mem_limit_configs since it does
not mutate the config files.

Added more log details in admission-controller.cc to help make better
association.

Testing:
- Loop and pass the test in ARM build.

Change-Id: I41f671b8fb3eabf263041a834b54740fbacda68e
Reviewed-on: http://gerrit.cloudera.org:8080/23106
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-07-03 02:39:11 +00:00
Riza Suminto
d36df0eb88 IMPALA-14163: Raise test_pool_config_change_while_queued MEM_LIMIT
test_pool_config_change_while_queued hit timeout at
TestAdmissionControllerWithACService. When running this test locally, we
notice that some trigger query ("select 'wait_for_config_change'")
passed when it is expected to be rejected (hit EXCEPTION during
admission).

This patch increase the MEM_LIMIT higher to 128GB to ensure rejection.
It also add wait_for_admission_control() that should immediately return
once trigger query hit exception. Removed redundant
"set enable_trivial_query_for_admission=false" query in
test_pool_config_change_while_queued.

Testing:
- Loop the test couple times and confirm that all trigger query
  executions hit exception.

Change-Id: Iee808d0fc92308604ed0ee27dde795e9aa69eb5d
Reviewed-on: http://gerrit.cloudera.org:8080/23072
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-25 02:39:41 +00:00
Mihaly Szjatinya
e0cb533c25 IMPALA-13912: Use SHARED_CLUSTER_ARGS in more custom cluster tests
In addition to IMPALA-13503 which allowed having the single cluster
running for the entire test class, this attempts to minimize restarting
between the existing tests without modifying any of their code.

This changeset saves the command line with which
'start-impala-cluster.py' has been run and skips the restarting if the
command line is the same for the next test.

Some tests however do require restart due to the specific metrics being
tested. Such tests are defined with the 'force_restart' flag within the
'with_args' decorator. NOTE: there might be more tests like that
revealed after running the tests in different order resulting in test
failures.

Experimentally, this results in ~150 fewer restarts, mostly coming from
restarts between tests. As for restarts between different variants of
the same test, most of the cluster tests are restricted to single
variant, although multi-variant tests occur occasionally.

Change-Id: I7c9115d4d47b9fe0bfd9dbda218aac2fb02dbd09
Reviewed-on: http://gerrit.cloudera.org:8080/22901
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-19 17:48:25 +00:00
Joe McDonnell
71b47dfdb4 IMPALA-14103: Fix TestAdmissionControllerStress on Python 3
TestAdmissionControllerStress has an invalid except clause
where it catches Exception as well as ImpalaHiveServer2Service.
This is an error on Python 3, because ImpalaHiveServer2Service
is not an exception class. This changes the except clause to
only cause Exception.

Testing:
 - Ran TestAdmissionControllerStress locally

Change-Id: Iefe9306cd6b76bd27ca5be1d62b05aff1e5deafe
Reviewed-on: http://gerrit.cloudera.org:8080/22954
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-05-28 15:55:52 +00:00
Riza Suminto
9000c83efc IMPALA-13971: Deflake TestAdmissionController.test_user_loads_rules
TestAdmissionController.test_user_loads_rules is flaky for not failing
the last query that should exceed the user quota. The test executes
queries in a round-robin fashion across all impalad. These impalads are
expected to synchronize user quota count through statestore updates.

This patch attempts to deflake the test by raising the heartbeat wait
time from 1 heartbeat period to 3 hearbeat periods. It also changes the
reject query to a fast version of SLOW_QUERY (without sleep) so the test
can fail fast if it is not rejected.

Testing:
Loop the test 50 times and pass them all.

Change-Id: Ib2ae8e1c2edf174edbf0e351d3c2ed06a0539f08
Reviewed-on: http://gerrit.cloudera.org:8080/22787
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-04-18 15:27:22 +00:00
Riza Suminto
648209b172 IMPALA-13967: Move away from setting user parameter in execute
ImpalaConnection.execute and ImpalaConnection.execute_async have 'user'
parameter to set specific user to run the query. This is mainly legacy
of BeeswaxConnection, which allows using 1 client to run queries under
different usernames.

BeeswaxConnection and ImpylaHS2Connection actually allow specifying one
user per client. Doing so will simplify user-specific tests such as
test_ranger.py that often instantiates separate clients for admin user
and regular user. There is no need to specify 'user' parameter anymore
when calling execute() or execute_async(). Thus, reducing potential bugs
from forgetting to set one or setting it with incorrect value.

This patch applies one-user-per-client practice as much as possible for
test_ranger.py, test_authorization.py, and test_admission_controller.py.
Unused code and pytest fixtures are removed. Few flake8 issues are
addressed too. Their default_test_protocol() is overridden to return
'hs2'.

ImpylaHS2Connection.execute() and ImpylaHS2Connection.execute_async()
are slightly modified to assume ImpylaHS2Connection.__user if 'user'
parameter in None. BeeswaxConnection remains unchanged.

Extend ImpylaHS2ResultSet.__convert_result_value() to lower case boolean
return value to match beeswax result.

Testing:
Run and pass all modified tests in exhaustive exploration.

Change-Id: I20990d773f3471c129040cefcdff1c6d89ce87eb
Reviewed-on: http://gerrit.cloudera.org:8080/22782
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-04-18 15:27:22 +00:00
Joe McDonnell
c5a0ec8bdf IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package
This puts all of the thrift-generated python code into the
impala_thrift_gen package. This is similar to what Impyla
does for its thrift-generated python code, except that it
uses the impala_thrift_gen package rather than impala._thrift_gen.
This is a preparatory patch for fixing the absolute import
issues.

This patches all of the thrift files to add the python namespace.
This has code to apply the patching to the thirdparty thrift
files (hive_metastore.thrift, fb303.thrift) to do the same.

Putting all the generated python into a package makes it easier
to understand where the imports are getting code. When the
subsequent change rearranges the shell code, the thrift generated
code can stay in a separate directory.

This uses isort to sort the imports for the affected Python files
with the provided .isort.cfg file. This also adds an impala-isort
shell script to make it easy to run.

Testing:
 - Ran a core job

Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0
Reviewed-on: http://gerrit.cloudera.org:8080/20169
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-04-15 17:03:02 +00:00
Csaba Ringhofer
f98b697c7b IMPALA-13929: Make 'functional-query' the default workload in tests
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.

All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.

The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.

get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.

Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-08 07:12:55 +00:00
Riza Suminto
00dc79adf6 IMPALA-13907: Remove reference to create_beeswax_client
This patch replace create_beeswax_client() reference to
create_hs2_client() or vector-based client creation to prepare towards
hs2 test migration.

test_session_expiration_with_queued_query is changed to use impala.dbapi
directly from Impyla due to limitation in ImpylaHS2Connection.

TestAdmissionControllerRawHS2 is migrated to use hs2 as default test
protocol.

Modify test_query_expiration.py to set query option through client
instead of SET query. test_query_expiration is slightly modified due to
behavior difference in hs2 ImpylaHS2Connection.

Remove remaining reference to BeeswaxConnection.QueryState.

Fixed a bug in ImpylaHS2Connection.wait_for_finished_timeout().

Fix some easy flake8 issues caught thorugh this command:
git show HEAD --name-only | grep '^tests.*py' \
  | xargs -I {} impala-flake8 {} \
  | grep -e U100 -e E111 -e E301 -e E302 -e E303 -e F...

Testing:
- Pass exhaustive tests.

Change-Id: I1d84251835d458cc87fb8fedfc20ee15aae18d51
Reviewed-on: http://gerrit.cloudera.org:8080/22700
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-29 18:37:45 +00:00
Riza Suminto
7f38c7ed61 IMPALA-13890: Deflake test_coord_only_pool_one_quiescing_coord
test_coord_only_pool_one_quiescing_coord was flaky in UBSAN build for
not finding the pid of the quiescing coordinator. It is possible that
the quiescing coordinator was stil alive when the test method arrive
inside ImpalaCluster.graceful_shutdown_impalads() loop. This patch
attempt to deflake it by additionally call
coord_to_quiesce.wait_for_exit() at the end of test method. The test
method is renamed to test_coord_only_pool_one_coord_quiescing to share
prefix with test_coord_only_pool_one_coord_terminate.

This patch also harden test_coord_only_pool_one_coord_terminate by
applying the same change.

Testing:
Loop the following test command 10 times in UBSAN build:
impala-py.test --exploration=exhaustive \
  -k test_coord_only_pool_one_coord \
  custom_cluster/test_admission_controller.py

Change-Id: I34e369a6a6eb77ef95e7526c17028bf6b8b04172
Reviewed-on: http://gerrit.cloudera.org:8080/22673
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-26 22:28:11 +00:00
Riza Suminto
ea8f74a6ac IMPALA-13861: Standardize workload management tests
This patch standardizes tests against workload management tables
(sys.impala_query_log and sys.impala_query_live) to use a common
superclass named WorkloadManagementTestSuite. The setup_method of this
superclass waits for workload management init completion
(wait_for_wm_init_complete()), while the teardown_method waits until
impala-server.completed-queries.queued metric reaches
0 (wait_for_wm_idle()).

test_query_log.py and test_workload_mgmt_sql_details.py are refactored
to extend from WorkloadManagementTestSuite. Tests to assert the query
log table flush behavior are grouped together in TestQueryLogTableFlush.
test_workload_mgmt_sql_details.py::TestWorkloadManagementSQLDetails now
uses 1 minicluster instance for all tests.

test_workload_mgmt_init.py does not extend from
WorkloadManagementTestSuite because it is testing cluster start and
restart scenario. This patch only adds wait_for_wm_idle() at
teardown_method where it make sense to do so.

test_query_live.py does not extend from WorkloadManagementTestSuite
because most of its test method require long
--query_log_write_interval_s so that DML queries from workload
management worker does not disturb sys.impala_query_live.

workload_mgmt parameter in CustomClusterTestSuite.with_args() is
standardized to setup appropriate default flags in cluster_setup()
rather than passing it down to _start_impala_cluster():
IMPALAD_ARGS
  --enable_workload_mgmt=true --query_log_write_interval_s=1 \
  --shutdown_grace_period_s=0 --shutdown_deadline_s=60
and CATALOGD_ARGS
  --enable_workload_mgmt=true

Note that IMPALAD_ARGS and CATALOGD_ARGS flags added by workload_mgmt
and impalad_graceful_shutdown parameter are still overridable to
different value by explicitly adding it in the impalad_args and
catalogd_args parameters. Setting workload_mgmt=True now automatically
enables graceful shutdown for the test. Thus,
impalad_graceful_shutdown=True is now removed.

With beeswax protocol deprecated, this patch also changes the protocol
under test from beeswax to hs2. TestQueryLogTableBeeswax is now renamed
to TestQueryLogTableBasic.

Additionally, print total wait time in wait_for_metric_value().

Testing:
- Run modified tests and pass.

Change-Id: Iecf6452fa963304e263805ebeb017c843d17dd16
Reviewed-on: http://gerrit.cloudera.org:8080/22617
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-21 22:31:11 +00:00
Riza Suminto
95c52aa1a0 IMPALA-13874: Fix typo in test_coord_only_pool_exec_groups
Graceful shutdown following test_coord_only_pool_exec_groups completion
produce FATAL log for trying to remove non-existing backend from
per-host list. This is because the test has typo in executor group
name (should be 'large' instead of small). This patch fix the typo.

Testing:
Manually run test_coord_only_pool_exec_groups exhaustively. Verified in
all run that such fatal log or warning log from
ExecutorGroup::CheckConsistencyOrWarn does not show up again.

Change-Id: I6331e000c8c6481b1e569691f54fa2b7e3f3605a
Reviewed-on: http://gerrit.cloudera.org:8080/22637
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-18 22:55:42 +00:00
Riza Suminto
514ecef3cb IMPALA-13860: Fix DCHECK hit in cluster-membership-mgr.cc
Enabling graceful shutdown in test_coord_only_pool_exec_groups reveals a
DCHECK hit caused by two back-two-back call to RemoveExecutorAndGroup
during graceful shutdown. This patch fix it by turning the DHCECK into
if and VLOG(1).

Testing:
- Pass test_coord_only_pool_exec_groups with graceful shutdown after the
  change without any FATAL log.

Change-Id: If678ae472bade50c18842df9e98c536fb9f1fe9c
Reviewed-on: http://gerrit.cloudera.org:8080/22620
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-14 06:12:48 +00:00
Riza Suminto
1164eb5626 IMPALA-13827: Deflake test_user_loads_propagate
TestAdmissionController.test_user_loads_propagate has been flaky for not
finding the expected username metric in the first impalad. This patch
attempt to deflake the test by raising the wait time between running
query and metric check. The order of metric check is also reversed to
give slightly more time for first impalad to hear about query running in
second impalad.

Testing:
- Loop and pass the test 50 times. Before the patch, the test fail
  within 10 iteration.

Change-Id: I6920c7bc9ba1a9fc646aaf0483a1a72608a2a90e
Reviewed-on: http://gerrit.cloudera.org:8080/22584
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-06 05:50:59 +00:00
Riza Suminto
daaf73a7c2 IMPALA-13682: Implement missing capabilities in ImpylaHS2Connection
This patch implements 'wait_for_finished_timeout',
'wait_for_admission_control', and 'get_admission_result' for
ImpylaHS2Client.

This patch also changes the behavior of ImpylaHS2Connection to produce
several extra cursors aside from self.__cursor for 'execute' call that
supplies user argument and each 'execute_async' to make issuing multiple
concurrent queries possible. Note that each HS2 cursor opens its own HS2
Session. Therefore, this patch breaks the assumption that an
ImpylaHS2Connection is always under a single HS2 Session (see HIVE-11402
and HIVE-14247 on why concurrent query with shared HS2 Session is
problematic). However, they do share the same query options stored at
self.__query_options. In practice, most Impala tests do not care about
running concurrent queries under a single HS2 session but only require
them to use the same query options.

The following additions are new for both BeeswaxConnection and
ImpylaHS2Connection:
- Add method 'log_client' for convenience.
- Implement consistent query state mapping and checking through several
  accessor methods.
- Add methods 'wait_for_impala_state' and 'wait_for_any_impala_state'
  that use 'get_impala_exec_state' method internally.
- Add 'fetch_profile_after_close' parameter to 'close_query' method. If
  True, 'close_query' will return the query profile after closing the
  query.
- Add 'discard_results' parameter for 'fetch' method. This can save time
  parsing results if the test does not care about the result.

Reuse existing op_handle_to_query_id and add new
session_handle_to_session_id to parse HS2
TOperationHandle.operationId.guid and TSessionHandle.sessionId.guid
respectively.

To show that ImpylaHS2Connection is on par with BeeswaxConnection, this
patch refactors test_admission_controller.py to test using HS2 protocol
by default. Test that does raw HS2 RPC (require capabilities from
HS2TestSuite) is separated out into a new TestAdmissionControllerRawHS2
class and stays using beeswax protocol by default. All calls to
copy.copy is replaced with copy.deepcopy for safety.

Testing:
- Pass exhaustive tests.

Change-Id: I9ac07732424c16338e060c9392100b54337f11b8
Reviewed-on: http://gerrit.cloudera.org:8080/22362
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-04 06:58:23 +00:00
Riza Suminto
fc152b376c IMPALA-13816: Reduce test_queue_reasons_slots
test_queue_reasons_slots continues to be flaky in ASAN build even after
fix from IMPALA-10338.

This patch reduce / simplify the test by making the test query run
faster and raise timeout value sufficiently high in slow build flavor by:
1. Lowering MT_DOP to 2.
2. Limit scanning just 6 partitions of store_sales.
3. Reduce number of parallel query from 5 to 3.
4. Replace TIMEOUT_S with STRESS_TIMEOUT, which is equal to 90s in
   normal build and 600s in slow build.

Testing:
- Loop the test 50x in ASAN build.

Change-Id: Ic2d6d68d381d22c599d4c5cdc78cc997ddef749b
Reviewed-on: http://gerrit.cloudera.org:8080/22566
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-03-04 00:41:51 +00:00
jasonmfehr
b3b2dbaca3 IMPALA-13772: Fix Workload Management DMLs Timeouts
The insert DMLs executed by workload management to add rows to the
completed queries Iceberg table time out after 10 seconds because
that is the default FETCH_ROWS_TIMEOUT_MS value. If the DML queues up
in admission control, this timeout will quickly cause the DML to be
cancelled. The fix is to set the FETCH_ROWS_TIMEOUT_MS query option
to 0 for the workload management insert DMLs.

Even though the workload management DMLs do not retrieve any rows,
the FETCH_ROWS_TIMEOUT_MS value still applies because the internal
server functions call into the client request state's
ExecQueryOrDmlRequest() function which starts query execution and
immediately returns. Then, the BlockOnWait function in
impala-server.cc is called. This function times out based on the
FETCH_ROWS_TIMEOUT_MS value.

A new coordinator startup flag 'query_log_dml_exec_timeout_s' is
added to specify the EXEC_TIME_LIMIT_S query option on the workload
management insert DML statements. This flag ensures the DMLs will
time out if they do not complete in a reasonable timeframe.

While adding the new coordinator startup flag, a bug in the
internal-server code was discovered. This bug caused a return status
of 'ok' even when the query exec time limit was reached and the query
cancelled. This bug has also been fixed.

Testing:
  1. Added new custom cluster test that simulates a busy cluster where
       the workload management DML queues for longer than 10 seconds.
  2. Existing tests in test_query_log and test_admission_controller
       passed.
  3. One internal-server-test ctest was modified to assert for a
       returned status of error when a query is cancelled.
  4. Added a new cusom cluster test that asserts the workload
       management DML is cancelled based on the value of the new
       coordinator startup flag.

Change-Id: I0cc7fbce40eadfb253d8cff5cbb83e2ad63a979f
Reviewed-on: http://gerrit.cloudera.org:8080/22511
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-26 03:12:31 +00:00
Riza Suminto
d682e226d5 IMPALA-10338: Deflake test_queue_reasons_slots in ASAN build.
test_queue_reasons_slots has been flaky in ASAN build for not completing
all of its test queries within 60 seconds from submission time. In the
latest occurrence, the 4th out of 5 queries submitted in parallel does
not get admitted after 60 seconds. It then ends in ERROR state, while
the tests expect it to ends in FINISHED state.

This patch attempt to deflake the tests by increasing slots and MT_DOP
option from 4 to 8 in test_queue_reasons_slots. The test query is a
simple GROUP BY query that is expected to run faster with increased
degree of parallelism.

Testing:
- Loop and pass the test 100x in ASAN build.

Change-Id: I4546aa3ce66c480504685e842a4b610a9a7e01ee
Reviewed-on: http://gerrit.cloudera.org:8080/22530
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-25 03:02:56 +00:00
Riza Suminto
9cb9bae84e IMPALA-13758: Use context manager in ImpalaTestSuite.change_database
ImpalaTestSuite.change_database is responsible to point impala client to
database under test. However, it left client pointing to that database
after the test without reverting them back to default database. This
patch does the reversal by changing ImpalaTestSuite.change_database to
use context manager.

This patch change the behavior of execute_query_using_client() and
execute_query_async_using_client(). They used to change database
according to the given vector parameter, but not anymore after this
patch. In practice, this behavior change does not affect many tests
because most queries going through these functions already use fully
qualified table name. Going forward, querying through function other
than run_test_case() should try to use fully qualified table name as
much as possible.

Retain behavior of ImpalaTestSuite._get_table_location() since there are
considerable number of tests relies on it (changing database when
called).

Removed unused test fixtures and fixed several flake8 issues in modified
test files.

Testing:
- Moved nested-types-subplan-single-node.test. This allows the test
  framework to point to the right tpch_nested* database.
- Pass exhaustive test except IMPALA-13752 and IMPALA-13761. They will
  be fixed in separate patch.

Change-Id: I75bec7403cc302728a630efe3f95e852a84594e2
Reviewed-on: http://gerrit.cloudera.org:8080/22487
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-19 23:50:34 +00:00
Riza Suminto
443ca4e8ab IMPALA-13761: Fix test_coord_only_pool_exec_groups
IMPALA-13201 adds test_coord_only_pool_exec_groups. This test failed in
TestAdmissionControllerWithACService due to unaccounted AdmissionD as
the extra statestore subscriber. This patch fix the issue by using
expected_subscribers and expected_num_impalads argument for
CustomClusterTestSuite._start_impala_cluster() and relies on wait
mechanism inside it.

This patch also does some adjustments:
1. Tweak __run_assert_systables_query to not use
   execute_query_using_vector from IMPALA-13694.
2. Remove default_impala_client() call, first added by IMPALA-13668, and
   use self.client instead.
3. Fixed minor flake8 issue at test_coord_only_pool_happy_path.
These make it possible to backport IMPALA-13201 and IMPALA-13761
together to older release/maintenance branch.

Testing:
Run the test with this command:
impala-py.test --exploration=exhaustive \
  -k test_coord_only_pool custom_cluster/test_admission_controller.py

Change-Id: I00b83e706aca3325890736133b2d1dcf735b19df
Reviewed-on: http://gerrit.cloudera.org:8080/22486
Reviewed-by: Peter Rozsa <prozsa@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Jason Fehr <jfehr@cloudera.com>
2025-02-18 22:05:27 +00:00
jasonmfehr
aac67a077e IMPALA-13201: System Table Queries Execute When Admission Queues are Full
Queries that run only against in-memory system tables are currently
subject to the same admission control process as all other queries.
Since these queries do not use any resources on executors, admission
control does not need to consider the state of executors when
deciding to admit these queries.

This change adds a boolean configuration option 'onlyCoordinators'
to the fair-scheduler.xml file for specifying a request pool only
applies to the coordinators. When a query is submitted to a
coordinator only request pool, then no executors are required to be
running. Instead, all fragment instances are executed exclusively on
the coordinators.

A new member was added to the ClusterMembershipMgr::Snapshot struct
to hold the ExecutorGroup of all coordinators. This object is kept up
to date by processing statestore messages and is used when executing
queries that either require the coordinators (such as queries against
sys.impala_query_live) or that use an only coordinators request pool.

Testing was accomplished by:
1. Adding cluster membership manager ctests to assert cluster
   membership manager correctly builds the list of non-quiescing
   coordinators.
2. RequestPoolService JUnit tests to assert the new optional
   <onlyCoords> config in the fair scheduler xml file is correctly
   parsed.
3. ExecutorGroup ctests modified to assert the new function.
4. Custom cluster admission controller tests to assert queries with a
   coordinator only request pool only run on the active coordinators.

Change-Id: I5e0e64db92bdbf80f8b5bd85d001ffe4c8c9ffda
Reviewed-on: http://gerrit.cloudera.org:8080/22249
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-14 04:27:11 +00:00
Riza Suminto
4aea5d6923 IMPALA-13743: Fix setup_method calls at test_admission_controller.py
IMPALA-13694 reveals an issue in setup_method calls of
TestAdmissionControllerWithACService and
TestAdmissionControllerStressWithACService. They should be called with
their own class name instead of the superclass name.

Testing:
- Pass TestAdmissionControllerWithACService and
  TestAdmissionControllerStressWithACService in exhaustive exploration.

Change-Id: I092c4f397cba1908245ccb30111176190b2182ff
Reviewed-on: http://gerrit.cloudera.org:8080/22465
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-08 01:33:35 +00:00
Riza Suminto
6fbde72969 IMPALA-13694: Add ImpalaTestSuite.__reset_impala_clients method
This patch adds __reset_impala_clients() method in ImpalaConnection.
__reset_impala_clients() then simply clear configuration. It is called
on each setup_method() to ensure that each EE test uses clean test
client. All subclasses of ImpalaTestSuite that declare setup() method
are refactored to declare setup_method() instead, to match newer py.test
convention. Also implement teardown_method() to complement
setup_method(). See "Method and function level setup/teardown" at
https://docs.pytest.org/en/stable/how-to/xunit_setup.html.

CustomClusterTestSuite fully overrides setup_method() and
teardown_method() because it subclasses can be destructive. The custom
cluster test method often restart the whole Impala cluster, rendering
default impala clients initialized at setup_class() unusable. Each
subclass of CustomClusterTestSuite is responsible to ensure that impala
client they are using is in a good state.

This patch improve BeeswaxConnection and ImpylaHS2Connection to only
consider non-REMOVED options as its default options. They lookup for
valid (not REMOVED) query options with their own appropriate way,
memorized the option names as lowercase string and the values as string.
List values are wrapped with double quote. Log in
ImpalaConnection.set_configuration_option() is differentiated from how
SET query looks.

Note that ImpalaTestSuite.run_test_case() modify and restore query
option written at .test file by issuing SET query, not by calling
ImpalaConnection.set_configuration_option(). It is remain unchanged.

Consistently lower case query option everywhere in Impala test code
infrastructure. Fixed several tests that has been unknowingly override
'exec_option' vector dimension due to case sensitive mismatch. Also
fixed some flake8 issues.

Added convenience method execute_query_using_vector() and
create_impala_client_from_vector() in ImpalaTestSuite.

Testing:
- Pass core tests.

Change-Id: Ieb47fec9f384cb58b19fdbd10ff7aa0850ad6277
Reviewed-on: http://gerrit.cloudera.org:8080/22404
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-02-06 04:03:33 +00:00
Xuebin Su
d7ee509e93 IMPALA-12648: Add KILL QUERY statement
To support killing queries programatically, this patch adds a new
type of SQL statements, called the KILL QUERY statement, to cancel and
unregister a query on any coordinator in the cluster.

A KILL QUERY statement looks like
```
KILL QUERY '123:456';
```
where `123:456` is the query id of the query we want to kill. We follow
syntax from HIVE-17483. For backward compatibility, 'KILL' and 'QUERY'
are added as "unreserved keywords", like 'DEFAULT'. This allows the
three keywords to be used as identifiers.

A user is authorized to kill a query only if the user is an admin or is
the owner of the query. KILL QUERY statements are not affected by
admission control.

Implementation:

Since we don't know in advance which impalad is the coordinator of the
query we want to kill, we need to broadcast the kill request to all the
coordinators in the cluster. Upon receiving a kill request, each
coordinator checks whether it is the coordinator of the query:
- If yes, it cancels and unregisters the query,
- If no, it reports "Invalid or unknown query handle".

Currently, a KILL QUERY statement is not interruptible. IMPALA-13663 is
created for this.

For authorization, this patch adds a custom handler of
AuthorizationException for each statement to allow the exception to be
handled by the backend. This is because we don't know whether the user
is the owner of the query until we reach its coordinator.

To support cancelling child queries, this patch changes
ChildQuery::Cancel() to bypass the HS2 layer so that the session of the
child query will not be added to the connection used to execute the
KILL QUERY statement.

Testing:
- A new ParserTest case is added to test using "unreserved keywords" as
  identifiers.
- New E2E test cases are added for the KILL QUERY statement.
- Added a new dimension in TestCancellation to use the KILL QUERY
  statement.
- Added file tests/common/cluster_config.py and made
  CustomClusterTestSuite.with_args() composable so that common cluster
  configs can be reused in custom cluster tests.

Change-Id: If12d6e47b256b034ec444f17c7890aa3b40481c0
Reviewed-on: http://gerrit.cloudera.org:8080/21930
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
2025-01-22 22:22:54 +00:00
Riza Suminto
8c2017aa00 IMPALA-12937: (part 2) Deflake TestAdmissionControllerStress
TestAdmissionControllerStress::test_mem_limit is flaky again. One
fragment instance that expected to stay alive until query submission
loop ends actually finished early, even though clients are only fetching
1 rows every 0.5 second. This patch attempts to address the flakiness in
two ways.

First, is lowering batch_size to 10. Lower batch size is expected to
keep all running fragment instances runnning until the query admission
loop finishes.

Second, is lowering num_queries from 50 to 40 if exploration_strategy is
exhaustive. This will shorten the query submission loop, expecially when
submission_delay_ms is high (150 seconds). This is OK because, based on
the assertions, the test framework will only retain at most 15 active
queries and 10 in-queue queries once the query submission loop ends.

This patch also refactors SubmitQueryThread. Set
long_polling_time_ms=100 for all queries to get faster initial response.
The lock is removed and replaced with threading.Event to signal the end
of test. The thread client and query_handle scope is made local within
run() method for proper cleanup. Set timeout for
wait_for_admission_control instead of waiting indefinitely.

impala_connection.py is refactored so that BeeswaxConnection has
matching logging functionality as ImpylaHS2Connection. Changed
ImpylaHS2Connection._collect_profile_and_log initialization for
possibillity that experimental Calcite planner may have ability to pull
query profile and log from Impala backend.

Testing:
- Run and pass test_mem_limit in both TestAdmissionControllerStress and
  TestAdmissionControllerStressWithACService in exhaustive exploration
  10 times.
- Run and pass the whole TestAdmissionControllerStress and
  TestAdmissionControllerStressWithACService in exhaustive exploration.

Change-Id: I706e3dedce69e38103a524c64306f39eac82fac3
Reviewed-on: http://gerrit.cloudera.org:8080/22351
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-17 09:21:46 +00:00
Joe McDonnell
5b4afb4f8f IMPALA-13368: Fixup Redhat detection for Python >= 3.8
Python 3.8 removed the platform.linux_distribution() function which is
currently used to detect Redhat. This switches to using the 'distro'
package, which implements the same functionality across different
Python versions. Since Redhat 6 is no longer supported, this removes
the detection of Redhat 6 and associated skip logic.

Testing:
 - Ran a core job

Change-Id: I0dfaf798c0239f6068f29adbd2eafafdbbfd66c3
Reviewed-on: http://gerrit.cloudera.org:8080/22073
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-12-17 07:28:51 +00:00
Andrew Sherman
2280c1362e IMPALA-12943: Document Admission Control User Quotas.
Document the feature introduced in IMPALA-12345. Add a few more tests to
the QuotaExamples test which demonstrate the examples used in the
docs.

Clarify in docs and code the behavior when a user is a member of more
than one group for which there are rules. In this case the least
restrictive rule applies.

Also document the '--max_hs2_sessions_per_user' flag introduced in
IMPALA-12264.

Change-Id: I82e044adb072a463a1e4f74da71c8d7d48292970
Reviewed-on: http://gerrit.cloudera.org:8080/22100
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-12-11 02:18:18 +00:00
Andrew Sherman
de6b902581 IMPALA-12345: Add user quotas to Admission Control
Allow administrators to configure per user limits on queries that can
run in the Impala system.

In order to do this, there are two parts. Firstly we must track the
total counts of queries in the system on a per-user basis. Secondly
there must be a user model that allows rules that control per-user
limits on the number of queries that can be run.

In a Kerberos environment the user names that are used for both the user
model and at runtime are short user names, e.g. testuser when the
Kerberos principal is testuser/scm@EXAMPLE.COM

TPoolStats (the data that is shared between Admission Control instances)
is extended to include a map from user name to a count of queries
running. This (along with some derived data structures) is updated when
queries are queued and when they are released from Admission Control.
This lifecycle is slightly different from other TPoolStats data which
usually tracks data about queries that are running. Queries can be
rejected because of user quotas at submission time. This is done for
two reasons: (1) queries can only be admitted from the front of the
queue and we do not want to block other queries due to quotas, and
(2) it is easy for users to understand what is going on when queries
are rejected at submission time.

Note that when running in configurations without an Admission Daemon
then Admission Control does not have perfect information about the
system and over-admission is possible for User-Level Admission Quotas
in the same way that it is for other Admission Control controls.

The User Model is implemented by extending the format of the
fair-scheduler.xml file. The rules controlling the per-user limits are
specified in terms of user or group names.

Two new elements ‘userQueryLimit’ and ‘groupQueryLimit’ can be added to
the fair-scheduler.xml file. These elements can be placed on the root
configuration, which applies to all pools, or the pool configuration.
The ‘userQueryLimit’ element has 2 child elements: "user"
and "totalCount". The 'user' element contains the short names of users,
and can be repeated, or have the value "*" for a wildcard name which
matches all users. The ‘groupQueryLimit’ element has 2 child
elements: "group" and "totalCount". The 'group' element contains group
names.

The root level rules and pool level rules must both be passed for a new
query to be queued. The rules dictate a maximum number of queries that
can run by a user. When evaluating rules at either the root level, or
at the pool level, when a rule matches a user then there is no more
evaluation done.

To support reading the ‘userQueryLimit’ and ‘groupQueryLimit’ fields the
RequestPoolService is enhanced.

If user quotas are enabled for a pool then a list of the users with
running or queued queries in that pool is visible on the coordinator
webui admission control page.

More comprehensive documentation of the user model will be provided in
IMPALA-12943

TESTING

New end-to-end tests are added to test_admission_controller.py, and
admission-controller-test is extended to provide unit tests for the
user model.

Change-Id: I4c33f3f2427db57fb9b6c593a4b22d5029549b41
Reviewed-on: http://gerrit.cloudera.org:8080/21616
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-16 06:38:38 +00:00
Riza Suminto
95f353ac4a IMPALA-13507: Allow disabling glog buffering via with_args fixture
We have plenty of custom_cluster tests that assert against content of
Impala daemon log files while the process is still running using
assert_log_contains() and it's wrappers. The method specifically mention
about disabling glog buffering ('-logbuflevel=-1'), but not all
custom_cluster tests do that. This often result in flaky test that hard
to triage and often neglected if it does not frequently run in core
exploration.

This patch adds boolean param 'disable_log_buffering' into
CustomClusterTestSuite.with_args for test to declare intention to
inspect log files in live minicluster. If it is True, start minicluster
with '-logbuflevel=-1' for all daemons. If it is False, log WARNING on
any calls to assert_log_contains().

There are several complex custom_cluster tests that left unchanged and
print out such WARNING logs, such as:
- TestQueryLive
- TestQueryLogTableBeeswax
- TestQueryLogOtherTable
- TestQueryLogTableHS2
- TestQueryLogTableAll
- TestQueryLogTableBufferPool
- TestStatestoreRpcErrors
- TestWorkloadManagementInitWait
- TestWorkloadManagementSQLDetails

This patch also fixed some small flake8 issues on modified tests.

There is a flakiness sign at test_query_live.py where test query is
submitted to coordinator and fail because sys.impala_query_live table
has not exist yet from coordinator's perspective. This patch modify
test_query_live.py to wait for few seconds until sys.impala_query_live
is queryable.

Testing:
- Pass custom_cluster tests in exhaustive exploration.

Change-Id: I56fb1746b8f3cea9f3db3514a86a526dffb44a61
Reviewed-on: http://gerrit.cloudera.org:8080/22015
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-05 04:49:05 +00:00
Riza Suminto
9c87cf41bf IMPALA-13396: Unify tmp dir management in CustomClusterTestSuite
There are many custom cluster tests that require creating temporary
directory. The temporary directory typically live within a scope of test
method and cleaned afterwards. However, some test do create temporary
directory directly and forgot to clean them afterwards, leaving junk
dirs under /tmp/ or $LOG_DIR.

This patch unify the temporary directory management inside
CustomClusterTestSuite. It introduce new 'tmp_dir_placeholders' arg in
CustomClusterTestSuite.with_args() that list tmp dirs to create.
'impalad_args', 'catalogd_args', and 'impala_log_dir' now accept
formatting pattern that is replaceable by a temporary dir path, defined
through 'tmp_dir_placeholders'.

There are few occurrences where mkdtemp is called and not replaceable by
this work, such as tests/comparison/cluster.py. In that case, this patch
change them to supply prefix arg so that developer knows that it comes
from Impala test script.

This patch also addressed several flake8 errors in modified files.

Testing:
- Pass custom cluster tests in exhaustive mode.
- Manually run few modified tests and observe that the temporary dirs
  are created and removed under logs/custom_cluster_tests/ as the tests
  go.

Change-Id: I8dd665e8028b3f03e5e33d572c5e188f85c3bdf5
Reviewed-on: http://gerrit.cloudera.org:8080/21836
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-02 01:25:39 +00:00
Riza Suminto
4a39144295 IMPALA-12937: Deflake test_admission_controller.py
test_mem_limit in test_admission_controller.py has been flaky,
especially in ARM environment. From the test logs, some queries are
canceled early, presumably due to client being idle.

This patch attempts to fix the flakiness by fetching one row every
FETCH_INTERVAL, regardless of the ending behavior. Added and moved logs
around to help triaging if the issue surface again. Also addressed some
flake2 errors.

Testing:
- Loop the whole test_admission_controller.py 10 times in exhaustive
  mode. No flakiness observed after patch.

Change-Id: I9cf2bddcbfcd63d3a6bbc0a2014774a910f6e730
Reviewed-on: http://gerrit.cloudera.org:8080/21813
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-09-27 19:31:51 +00:00
zhangyifan27
ade98362c8 IMPALA-12834: Add number of concurrent queries to profile
This patch adds profile info string for the number of current running
queries of the executor group on which the query is scheduled, to
diagnose potential performance issues due to resource limit.

Testing:
- Add an e2e test to verify the information appears in profile

Change-Id: I8389215b60022b39e7d171d6fc2418acca7c0658
Reviewed-on: http://gerrit.cloudera.org:8080/21063
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-03-01 05:49:04 +00:00
Riza Suminto
fef3564248 IMPALA-12532: Fix bug in cancel_query_and_validate_state
test_cancellation.py was silently buggy for not exercising exec_option
combinations configured at TestCancellation::execute_cancel_test().

change_database() should not be called right after set_configuration()
in cancel_util.py::cancel_query_and_validate_state(). This is because
change_database() will erase all configuration that was previously set.

This patch fix the issue by calling change_database() first before
set_configuration(). Similar pattern also fixed in
test_admission_controller.py.

Testing:
- Manually run test_cancellation.py and confirm at coordinator log that
  query options are set properly.

Change-Id: Ie908fc1384279482892c3f8b2cfa3592e54f9d5a
Reviewed-on: http://gerrit.cloudera.org:8080/20641
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-11-01 22:22:54 +00:00
Abhishek Rawat
05bc485851 IMPALA-10860: Allow setting mem_limit for coordinators
Added support for MEM_LIMIT_COORDINATORS query option. This is
similar to exisiting MEM_LIMIT_EXECUTORS, but applies to coordinators.
There are cases where Planner generates inaccurate estimates for
coordinator fragments and would be good to be able to set mem limit
just for the coordinator, since a query's memory requirement on
coordinator tends to be much lower compared to that on executors.

If MEM_LIMIT is set, then MEM_LIMIT_COORDINATORS is ignored.

Also updated the documentation for the new query option.

Testing:
- Added new custom cluster tests which validates MEM_LIMIT_COORDINATORS
applies only on coordinator. The test also validates that both
MEM_LIMIT_EXECUTORS and MEM_LIMIT_COORDINATORS can be set together.
- Built docs and made sure that the new changes have proper formatting.

Change-Id: I2dfc9a735e82dce2fd903bdaf6bc2e46e982ef8c
Reviewed-on: http://gerrit.cloudera.org:8080/20378
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-08-23 01:55:00 +00:00
wzhou-code
ac23deab4d IMPALA-12036: Fix Web UI to show right resource pools
Web queries site shows no resource pool unless it is specified with
query option. The Planner could set TQueryCtx.request_pool in
TQueryExecRequest when auto scaling is enabled. But the backend
ignores the TQueryCtx.request_pool in TQueryExecRequest when getting
resource pools for Web UI.
This patch fixes the issue in ClientRequestState::request_pool() by
checking TQueryCtx.request_pool in TQueryExecRequest. It also
removes the error path in RequestPoolService::ResolveRequestPool() if
requested_pool is empty string.

Testing:
 - Updated TestExecutorGroups::test_query_cpu_count_divisor_default,
   TestExecutorGroups::test_query_cpu_count_divisor_two, and
   TestExecutorGroups::test_query_cpu_count_divisor_fraction to
   verify resource pools on Web queries site and Web admission site.
 - Updated expected error message in
   TestAdmissionController::test_set_request_pool.
 - Passed core test.

Change-Id: Iceacb3a8ec3bd15a8029ba05d064bbbb81e3a766
Reviewed-on: http://gerrit.cloudera.org:8080/19688
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
2023-04-05 15:45:06 +00:00
Abhishek Rawat
c810c51fa7 IMPALA-11858: Cap per backend memory estimate to its memory limit for admission
Admission controller caps memory estimates for a given query to
its physical memory. The memory estimates should instead be capped to
the backend's memory limit for admission, which is computed during
daemon initialization in ExecEnv::Init().

With this patch, for a given query schedule, the Coordinator backend's
memory limit is used for capping memory to admit on coordinator and min
of all executor backend's memory limit is used for capping mem to admit
on executors. A config option 'clamp_query_mem_limit_backend_mem_limit'
is also added to revert to the old behavior where queries requesting
more memory than backend's admission limit get rejected.

The memory requested by a query when MEM_LIMIT or MEM_LIMIT_EXECUTORS is
set is also capped to the memory limit for admission on the backends.

Also fixed the issue related to excessive logging in query profiles
when using global admission controller. If the query was queued
the remote admission controller client was logging 'Queued' status in
profile every time it checked the query status and it hadn't changed.

Testing:
- Updated existing unit tests in admission-controller-test.cc
- Added new checks in existing tests in executor-group-test.cc
- Updated custom_cluster tests in test_admission_controller.py
- Ran exhaustive tests

Change-Id: I3b1f6e530785ef832dbc831d7cc6793133f3335c
Reviewed-on: http://gerrit.cloudera.org:8080/19533
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-13 07:52:27 +00:00
Joe McDonnell
0c7c6a335e IMPALA-11977: Fix Python 3 broken imports and object model differences
Python 3 changed some object model methods:
 - __nonzero__ was removed in favor of __bool__
 - func_dict / func_name were removed in favor of __dict__ / __name__
 - The next() function was deprecated in favor of __next__
   (Code locations should use next(iter) rather than iter.next())
 - metaclasses are specified a different way
 - Locations that specify __eq__ should also specify __hash__

Python 3 also moved some packages around (urllib2, Queue, httplib,
etc), and this adapts the code to use the new locations (usually
handled on Python 2 via future). This also fixes the code to
avoid referencing exception variables outside the exception block
and variables outside of a comprehension. Several of these seem
like false positives, but it is better to avoid the warning.

This fixes these pylint warnings:
bad-python3-import
eq-without-hash
metaclass-assignment
next-method-called
nonzero-method
exception-escape
comprehension-escape

Testing:
 - Ran core tests
 - Ran release exhaustive tests

Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee
Reviewed-on: http://gerrit.cloudera.org:8080/19592
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
aa4050b4d9 IMPALA-11976: Fix use of deprecated functions/fields removed in Python 3
Python 3 moved several things around or removed deprecated
functions / fields:
 - sys.maxint was removed, but sys.maxsize provides similar functionality
 - long was removed, but int provides the same range
 - file() was removed, but open() already provided the same functionality
 - Exception.message was removed, but str(exception) is equivalent
 - Some encodings (like hex) were moved to codecs.encode()
 - string.letters -> string.ascii_letters
 - string.lowercase -> string.ascii_lowercase
 - string.strip was removed

This fixes all of those locations. Python 3 also has slightly different
rounding behavior from round(), so this changes round() to use future's
builtins.round() to get the Python 3 behavior.

This fixes the following pylint warnings:
 - file-builtin
- long-builtin
- invalid-str-codec
- round-builtin
- deprecated-string-function
- sys-max-int
- exception-message-attribute

Testing:
 - Ran cores tests

Change-Id: I094cd7fd06b0d417fc875add401d18c90d7a792f
Reviewed-on: http://gerrit.cloudera.org:8080/19591
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
c233634d74 IMPALA-11975: Fix Dictionary methods to work with Python 3
Python 3 made the main dictionary methods lazy (items(),
keys(), values()). This means that code that uses those
methods may need to wrap the call in list() to get a
list immediately. Python 3 also removed the old iter*
lazy variants.

This changes all locations to use Python 3 dictionary
methods and wraps calls with list() appropriately.
This also changes all itemitems(), itervalues(), iterkeys()
locations to items(), values(), keys(), etc. Python 2
will not use the lazy implementation of these, so there
is a theoretical performance impact. Our python code is
mostly for tests and the performance impact is minimal.
Python 2 will be deprecated when Python 3 is functional.

This addresses these pylint warnings:
dict-iter-method
dict-keys-not-iterating
dict-values-not-iterating

Testing:
 - Ran core tests

Change-Id: Ie873ece54a633a8a95ed4600b1df4be7542348da
Reviewed-on: http://gerrit.cloudera.org:8080/19590
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
eb66d00f9f IMPALA-11974: Fix lazy list operators for Python 3 compatibility
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.

Python 2:
range(0,5) == [0,1,2,3,4]
True

Python 3:
range(0,5) == [0,1,2,3,4]
False

The fix is to wrap locations with list(). i.e.

Python 3:
list(range(0,5)) == [0,1,2,3,4]
True

Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).

Most of the changes were done via these futurize fixes:
 - libfuturize.fixes.fix_xrange_with_import
 - lib2to3.fixes.fix_map
 - lib2to3.fixes.fix_filter

This eliminates the pylint warnings:
 - xrange-builtin
 - range-builtin-not-iterating
 - map-builtin-not-iterating
 - zip-builtin-not-iterating
 - filter-builtin-not-iterating
 - reduce-builtin
 - deprecated-itertools-function

Testing:
 - Ran core job

Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Yida Wu
b3e9c4a65f IMPALA-7969: Always admit trivial queries immediately
The idea of trivial query is to allow certain queries to bypass the
admission control, and therefore accelerating the query execution
even when the server resource is at capacity. It could benefit
the queries that require a fast response while consuming the
minimum resources.

This patch adds support for the trivial query detection and allows
an immediate admission for the trivial query. We define the trivial
query as a subset of the coordinator-only query, and returns no more
than one row. The definition is as below:
  - Must have PLAN ROOT SINK as the root
  - Can contain UNION and EMPTYSET nodes only
  - Results can not be over one row

Examples of a trivial query:
  - select 1;
  - select * from table limit 0;
  - select * from table limit 0 union all select 1;
  - select 1, (2 + 3);

Also, we restrict the parallelism of execution of the trivial
query, each resource pool can execute no more than three trivial
queries at the same time. If the maximum parallelism is reached,
the admission controller would try to admit the trivial query
via normal process. More precisely, if the cluster is running with
a global admission controller, the max parallelism for the trivial
query is three per resource pool, but if there is no global
admission controller, each coordinator would admit the trivial
queries based on its own local variable, therefore, the max
parallelism would be three per coordinator per resource pool in
this case.

As the first patch, we try to keep the trivial query as simple as
possible, and it could be extended in future.

Added query option enable_trivial_query_for_admission to control
whether the trivial query policy is enabled.

Tests:
Passed exhaustive tests.
Added test_trivial_query and test_trivial_query_low_mem.

Change-Id: I2a729764e3055d7eb11900c96c82ff53eb261f91
Reviewed-on: http://gerrit.cloudera.org:8080/19214
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-27 01:14:01 +00:00
Yida Wu
839a25c89b IMPALA-11786: Preserve memory for codegen cache
IMPALA-11470 adds support for codegen cache, however the admission
controller is not aware of the memory usage of the codegen cache,
while the codegen cache is actually using the memory quota from
the query memory. It could result in query failures when running
heavy workloads and admission controller has fully admitted queries.

This patch subtracts the codegen cache capacity from the admission
memory limit during initialization, therefore preserving the memory
consumption of codegen cache from the beginning, and treating it as
a separate memory independent to the query memory reservation.

Also reduces the max codegen cache memory from 20 percent to 10
percent, and changes some failed testcases due to the reduction of
the admit memory limit.

Tests:
Passed exhaustive tests.

Change-Id: Iebdc04ba1b91578d74684209a11c815225b8505a
Reviewed-on: http://gerrit.cloudera.org:8080/19377
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-06 06:28:07 +00:00
Michael Smith
a870a11e64 IMPALA-7098: Re-enable tests under EC
Re-enables tests under erasure coding, or provides more specific
exceptions.

Erasure coding uses multiple data blocks to construct a block group. Our
tests use RS-3-2-1024k, which includes 3 data blocks in a block group.
Each of these blocks is sized according to `dfs.block.size`, so block
groups by default hold up to 384MB of data.

Impala schedules work to executors based on blocks reported by HDFS,
which for EC actually represent block groups. So with default block
size, a file in EC has 1/3rd the number of schedulable blocks. In the
case of tpch.lineitem, this produces 2 parquet files instead of 3 and
reduces the number of executors scheduled to read parquet lineitem as

1. lineitem.tbl is loaded via Hive. With EC it uses 2 block groups,
   without EC it uses 6 blocks.
2. parquet lineitem is created by select/insert from lineitem.tbl.
   Impala schedules reads to executors based on available blocks, so
   with EC this gets scheduled across 2 executors instead of 3 and each
   executor writes a separate parquet file.

Change-Id: Ib452024993e35d5a8d2854c6b2085115b26e40df
Reviewed-on: http://gerrit.cloudera.org:8080/19172
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-11-04 22:13:50 +00:00
Michael Smith
1eb0510eaa IMPALA-11456: Collapse filesystem Skip logic
Combines all SkipIf* classes for different filesystems into a single
SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the
rest as filesystem-specific special cases. The 'jira' option is removed
in favor of specific flags for each issue.

Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1
Reviewed-on: http://gerrit.cloudera.org:8080/18781
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-10 22:37:08 +00:00