IMPALA-9224: Blacklist nodes with faulty disk for spilling

This patch extends blacklist functionality by adding executor node to
blacklist if a query fails caused by disk failure during spill-to-disk.
Also classifies disk error codes and defines a blacklistable error set
for non-transient disk errors. Coordinator blacklists executor only if
the executor hitted blacklistable error during spill-to-disk.

Adds a new debug action to simulate disk write error during spill-to-
disk. To use, specify in query options as:
  'debug_action': 'IMPALA_TMP_FILE_WRITE:<hostname>:<port>:<action>'

  where <hostname> and <port> represent the impalad which execute the
  fragment instances, <port> is the BE krpc port (default 27000).

Adds new test cases for blacklist and query-retry to cover the code
changes.

Testing:
 - Passed new test cases.
 - Passed exhaustive test.
 - Manually simulated disk failures in scratch directories on nodes
   of a cluster, verified that the nodes were blacklisted as
   expected.

Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437
Reviewed-on: http://gerrit.cloudera.org:8080/16949
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
wzhou-code
2021-01-12 14:00:21 -08:00
committed by Impala Public Jenkins
parent 91fd8fd130
commit b5e2a0ce2e
14 changed files with 510 additions and 11 deletions

View File

@@ -468,6 +468,9 @@ error_codes = (
"Query $0 terminated due to join rows produced exceeds the limit of $1 "
"at node with id $2. Unset or increase JOIN_ROWS_PRODUCED_LIMIT query option "
"to produce more rows."),
("LOCAL_DISK_FAULTY", 152,
"Query execution failure caused by local disk IO fatal error on backend: $0."),
)
import sys