IMPALA-10244: Make non-scalable failures to dequeue observable.

One of the important ways to observe Impala throughput is by looking at when queries are queued. This can be an indication that more resources should be added to the cluster by adding more executor groups. This is only a good strategy if adding more resources will help with the current workload. In some situations the head of the query queue cannot be executed because of resource constraints on the coordinator. In these cases the coordinator is the bottleneck so adding more executor groups will not help. This change is to make these cases observable by adding a new counter which is incremented when a dequeue fails because of resource constraints on the coordinator. The two cases that cause the counter to be incremented are: - when there are not enough admission control slots on the coordinator - when there is not enough memory on the coordinator but it is possible that other conditions may be added in future. TESTING: Added new unit tests. Ran all end-to-end tests. Change-Id: I3456396ac139c562ad9cd3ac1a624d8f35487518 Reviewed-on: http://gerrit.cloudera.org:8080/16613 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-25 02:03:09 -05:00 · 2020-10-22 15:10:30 -07:00
parent d459b434b6
commit 4159a8085c
5 changed files with 161 additions and 55 deletions
--- a/tests/custom_cluster/test_admission_controller.py
+++ b/tests/custom_cluster/test_admission_controller.py
@@ -960,6 +960,9 @@ class TestAdmissionController(TestAdmissionControllerBase, HS2TestSuite):
    EXPECTED_REASON = "Latest admission queue reason: Not enough admission control " +\
                      "slots available on host"
    NUM_QUERIES = 5
+    coordinator_limited_metric = \
+      "admission-controller.total-dequeue-failed-coordinator-limited"
+    original_metric_value = self.get_metric(coordinator_limited_metric)
    profiles = self._execute_and_collect_profiles([STMT for i in xrange(NUM_QUERIES)],
        TIMEOUT_S, config_options={"mt_dop": 4})

@@ -984,6 +987,12 @@ class TestAdmissionController(TestAdmissionControllerBase, HS2TestSuite):
      verifier = MetricVerifier(impalad.service)
      verifier.wait_for_backend_admission_control_state()

+    # The number of admission control slots on the coordinator is limited
+    # so the failures to dequeue should trigger a bump in the coordinator_limited_metric.
+    later_metric_value = self.get_metric(coordinator_limited_metric)
+    assert later_metric_value > original_metric_value, \
+      "Metric %s did not change" % coordinator_limited_metric
+
  @pytest.mark.execute_serially
  @CustomClusterTestSuite.with_args(
    impalad_args=impalad_admission_ctrl_flags(max_requests=1, max_queued=10,