IMPALA-10244: Make non-scalable failures to dequeue observable.

One of the important ways to observe Impala throughput is by looking at
when queries are queued. This can be an indication that more resources
should be added to the cluster by adding more executor groups. This is
only a good strategy if adding more resources will help with the current
workload. In some situations the head of the query queue cannot be
executed because of resource constraints on the coordinator. In these
cases the coordinator is the bottleneck so adding more executor groups
will not help. This change is to make these cases observable by adding a
new counter which is incremented when a dequeue fails because of
resource constraints on the coordinator.

The two cases that cause the counter to be incremented are:
- when there are not enough admission control slots on the coordinator
- when there is not enough memory on the coordinator
but it is possible that other conditions may be added in future.

TESTING:
Added new unit tests.
Ran all end-to-end tests.

Change-Id: I3456396ac139c562ad9cd3ac1a624d8f35487518
Reviewed-on: http://gerrit.cloudera.org:8080/16613
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
Andrew Sherman
2020-10-22 15:10:30 -07:00
committed by Impala Public Jenkins
parent d459b434b6
commit 4159a8085c
5 changed files with 161 additions and 55 deletions

View File

@@ -960,6 +960,9 @@ class TestAdmissionController(TestAdmissionControllerBase, HS2TestSuite):
EXPECTED_REASON = "Latest admission queue reason: Not enough admission control " +\
"slots available on host"
NUM_QUERIES = 5
coordinator_limited_metric = \
"admission-controller.total-dequeue-failed-coordinator-limited"
original_metric_value = self.get_metric(coordinator_limited_metric)
profiles = self._execute_and_collect_profiles([STMT for i in xrange(NUM_QUERIES)],
TIMEOUT_S, config_options={"mt_dop": 4})
@@ -984,6 +987,12 @@ class TestAdmissionController(TestAdmissionControllerBase, HS2TestSuite):
verifier = MetricVerifier(impalad.service)
verifier.wait_for_backend_admission_control_state()
# The number of admission control slots on the coordinator is limited
# so the failures to dequeue should trigger a bump in the coordinator_limited_metric.
later_metric_value = self.get_metric(coordinator_limited_metric)
assert later_metric_value > original_metric_value, \
"Metric %s did not change" % coordinator_limited_metric
@pytest.mark.execute_serially
@CustomClusterTestSuite.with_args(
impalad_args=impalad_admission_ctrl_flags(max_requests=1, max_queued=10,