Previously, the fault injection utility will inject a fault
on every 3 RPC calls for ReportExecStatus() RPCs. As shown
in IMPALA-5588, with an unfortunate sequence in which other
RPCs happen between the retry of ReportExecStatus() RPC in
QueryState::ReportExecStatusAux(), ReportExecStatus() can
hit injected faults 3 times in a row, causing the query
to be cancelled in QueryState::ReportExecStatusAux().
This change fixes the problem by reducing the fault injection
frequency to once every 16 RPC calls for ReportExecStatus(),
CancelQueryFInstances() and ExecQueryFInstances() RPCs.
Also incorporated the fix by Michael Brown for a python bug in
test_rpc_exception.py so tests hitting unexpected exception
will re-throw that exception for better diagnosis on test failure.
Change-Id: I0ce4445e8552a22f23371bed1196caf7d0a3f312
Reviewed-on: http://gerrit.cloudera.org:8080/7310
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Previously, the retry logic in DoRpc() only allows retry to
happen if send() didn't complete successfully and the exception
indicates a closed connection. However, send() returning
successfully doesn't guarantee that the bytes have actually
reached the remote peer. According to the man page of send(),
when the message does not fit into the send buffer of the socket,
send() normally blocks. So the payload of RPC may be buffered in
the kernel if there is room for it. TCP allows a connection to
be half-open. If an Impalad node is restarted, a stale client
connection to that node may still allow send() to appear to succeed
even though the payload wasn't sent. However, upon calling recv()
in the RPC call to fetch the response, the client will get a return
value of 0. In which case, thrift will throw an exception as the
connection to the remote peer is closed already. Apparently, the
existing retry logic doesn't quite handle this case. One can
consistently reproduce the problem by warming the client cache
followed by restarting one of the Impalad nodes. It will result
a series of query failures due to stale connections.
This change augments the retry logic to also retry the entire RPC
if the exception string contains the messages "No more data to read."
or "SSL_read: Connection reset by peer" to capture the case of stale
connections. Our usage of thrift doesn't involve half-open TCP connection
so having a broken connection in recv() indicates the remote end has
closed the socket already. The generated thrift code doesn't knowingly
close the socket before an RPC completes unless the process crashes,
the connection is stale (e.g. the remote node was restarted) or the
remote end fails to read from the client. In either cases, the entire
RPC should just be retried with a new connection.
This change also fixes QueryState::ReportExecStatusAux() to
unconditionally for up to 3 times when reporting exec status of a
fragment instance. Previously, it may break out of the loop early
if RPC fails with 'retry_is_safe' == true (e.g. due to recv() timeout)
or if the connection to coordinator fails (IMPALA-5576). Declaring the
RPC to have failed may cause all fragment instances of a query to be
cancelled locally, triggering query hang due to IMPALA-2990. Similarly,
the cancellation RPC is also idempotent so it should be unconditionally
retried up to 3 times with 100ms sleep time in between.
The status reporting is idempotent as the handler simply ignores
RPC if it determines that all fragment instances on a given backend
is done so it should be safe to retry the RPC. This change updates
ApplyExecStatusReport() to handle duplicated status reports with
done bit set. Previously we would drop some other fragment instances'
statuses if we received duplicate 'done' statuses from the same
fragment instance(s).
Testing done: Warmed up client cache by running stress test followed by
restarting some Impalad nodes. Running queries used to fail or hang
consistently in the past. It now works with patch. Also ran CSL enduranace
tests and exhaustive builds.
Change-Id: I4d722c8ad3bf0e78e89887b6cb286c69ca61b8f5
Reviewed-on: http://gerrit.cloudera.org:8080/7284
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.
When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.
Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.
The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.
Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Reviewed-on: http://gerrit.cloudera.org:8080/7229
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.
This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.
This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.
Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Reviewed-on: http://gerrit.cloudera.org:8080/7063
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins