Commit Graph

5 Commits

Author SHA1 Message Date
Thomas Tauber-Marshall
014f973e92 IMPALA-9243: Add info about blacklisting decisions to the webui
This patch adds information about blacklisting decisions to the
/backends webui endpoint.

For the JSON, it adds an 'is_blacklisted' field to all backends, and
for and backends where 'is_blacklisted' is true it adds a
'blacklist_cause' field indicating the error status that led to the
backend getting blacklisted and an 'blacklist_time_remaining' field
indiciating how much longer the backend will remain on the blacklist.
It also adds counts for the number of blacklisted and quiescing
backends, if any, and the number of active (i.e. all other) backends.

For display, in order to prevent the table of backend information from
having too many columns (prior to this patch it already had 12), it
separates blacklisted, quiescing, and active backends into three
separate table, with the blacklisted and quiescing tables only getting
displayed if there are any such backends.

Additionally, tooltips are added next to the headers for the
blacklisted and quiescing tables that provide a brief explanation of
what it means for a backend to appear on there lists.

Using separate tables also facilitates having state-specific columns -
the blacklisted table displays columns for the blacklist cause and
time remaining. Future work could consider adding columns to the
quiescing table, such as time until the grace period and deadline
expires.

Testing:
- Manually ran various quiescing/blacklisting scenarios and confirmed
  the /backends page displays as expected.
- Added cases to test_web_pages (to verify the new fields when nothing
  is blacklisted) and test_blacklist.

Change-Id: Ia0c309315b142a50be102dcb516b36ec6cb3cf47
Reviewed-on: http://gerrit.cloudera.org:8080/15178
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-08 05:22:48 +00:00
Sahil Takiar
55efe5caca IMPALA-9262: Bump statestore_heartbeat_frequency_ms in test_kill_impalad_with_running_queries
CustomClusterTestSuite sets statestore_heartbeat_frequency_ms to 50,
overriding the default value of 1000. This means that if a node does not
respond to heartbeats for 500 milliseconds, it will time out and be
removed from the cluster (vs the default of 10 seconds).

A low value for statestore_heartbeat_frequency_ms is problematic for
test_kill_impalad_with_running_queries because if a node is removed from
the cluster membership, it is removed from the blacklist as well. The
test asserts that "Blacklisted Executors" shows up in the runtime
profile of a query immediately after running a query that causes a node
to be blacklisted. Thus, there is a race condition between running the
test query vs. the node being removed from the cluster membership.
Increasing the value of statestore_heartbeat_frequency_ms should
significantly reduce the chances of such a race.

Testing:
* Ran test_kill_impalad_with_running_queries locally
* Not actually able to re-produce the flakiness locally

Change-Id: I84e884efab35649b63db1a7a3b8c49b95b0b4648
Reviewed-on: http://gerrit.cloudera.org:8080/15131
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-30 00:45:22 +00:00
Sahil Takiar
915e811c2c IMPALA-9262: De-flake TestBlacklist::test_kill_impalad_with_running_queries
test_blacklist.py::TestBlacklist::test_kill_impalad_with_running_queries
runs a query asynchronously, waits for it to reach the RUNNING or FINISHED
state, kills an impalad, and then expects a fetch results request for
the query to fail. The test is flaky because it is possible the query can
finish successfully before an impalad is successfully killed.

The fix is to make the query slower using a debug action.

Testing:
* Looped the test for several hours

Change-Id: I8129323a7eb62cef61f1c6c34da06f08cf6d4b06
Reviewed-on: http://gerrit.cloudera.org:8080/14985
Reviewed-by: Sahil Takiar <stakiar@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-13 23:14:21 +00:00
Sahil Takiar
8a4fececcf IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails
Introduces a new optional field to FragmentInstanceExecStatusPB:
AuxErrorInfoPB. AuxErrorInfoPB contains optional metadata associated
with a failed fragment instance. Currently, AuxErrorInfoPB only contains
one field: RPCErrorInfoPB, which is only set if the fragment failed
because a RPC to another impalad failed. The RPCErrorInfoPB contains
the destination node of the failed RPC and the posix error code of the
failed RPC.

Coordinator::UpdateBackendExecStatus(ReportExecStatusRequestPB, ...)
uses the information in RPCErrorInfoPB (if one is set) to blacklist
the target node. While RPCErrorInfoPB::dest_node can be set to the address
of the Coordinator, the Coordinator will not blacklist itself. The
Coordinator only blacklists the node if the RPC failed with a specific
error code (currently either ENOTCONN, ECONNREFUSED, ESHUTDOWN).

Testing:
* Ran core tests
* Added new test to test_blacklist.py

Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Reviewed-on: http://gerrit.cloudera.org:8080/14677
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-12-20 02:50:46 +00:00
Thomas Tauber-Marshall
dfc968dff1 IMPALA-8339: Add local executor blacklist to coordinators
This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.

For the first patch, executors will only be blacklisted if the KRPC
status for Exec() is an error. Followup work will add blacklisting of
executors in more complex scenarios, eg. if an executor appears to be
a straggler.

When a query is scheduled and there is currently some blacklisted
executors, a new line 'Blacklisted Executors:' is added to the profile
listing the hostnames of all such executors.

Testing:
- Added a case to the cluster mgr BE unit test that uses blacklisting.
- Added e2e test cases for killing and restarting an impalad.
- Manual randomized testing locally with iptables.
TODO
- Add an e2e test case where an impalad becomes briefly unreachable.
- Manual/stress tests on a real cluster.

Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Reviewed-on: http://gerrit.cloudera.org:8080/13868
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-07-30 10:38:03 +00:00