Commit Graph

11 Commits

Author SHA1 Message Date
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
wzhou-code
6cdc8b5ce7 IMPALA-6788: Abort ExecFInstance() RPC loop early after query failure
Stops issuing ExecQueryFInstance rpcs and cancels any inflight when
backend reports failure.
Adds new debug action CONSTRUCT_QUERY_STATE_REPORT that runs when
constructing a query state report.
Adds a new test case for handling errors reported from query state.

Testing:
 - Ran following command for new test case and verified that the code
   working as expected:
     ./bin/impala-py.test tests/custom_cluster/test_rpc_exception.py\
       ::TestRPCException::test_state_report_error \
       --workload_exploration_strategy=functional-query:exhaustive
 - Passed exhaustive tests.

Change-Id: I034788f7720fc97c25c54f006ff72dce6cb199c3
Reviewed-on: http://gerrit.cloudera.org:8080/16192
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-07-21 02:43:56 +00:00
Thomas Tauber-Marshall
1e616774d4 IMPALA-8712: Make ExecQueryFInstances async
This patch refactors the ExecQueryFInstances rpc to be asychronous.
Previously, Impala would issue all the Exec()s, wait for all of them
to complete, and then check if any of them resulted in an error. We
now stop issuing Exec()s and cancel any that are still in flight as
soon as an error occurs.

It also performs some cleanup around the thread safety of
Coordinator::BackendState, including adding comments and DCHECKS.

=== Exec RPC Thread Pool ===
This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This
thread pool was used to partially simulate async Exec() prior to the
switch to KRPC, which provides built-in async rpc capabilities.

Removing this thread pool has potential performance implications, as
it means that the Exec() parameters are serialized in serialize rather
than in parallel (with the level of parallelism determined by the size
of the thread pool, which was configurable by an Advanced flag and
defaulted to 12).

To ensure we don't regress query startup times, I did some performance
testing. All tests were done on a 10 node cluster. The baseline used
for the tests did not include IMPALA-9181, a perf optimization for
query startup done to facilitate this work.

I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the
query startup times from the profiles. For each concurrency level, the
average regression in query startup time was < 2ms. Because query e2e
running time was much longer than this, there was no noticable change
in total query time.

I also ran a 'worst case scenario' with a table with 10,000 pertitions
to create a very large Exec() payload to serialize (~1.21MB vs.
~10KB-30KB for TPCH 100). Again, change in query startup time was
neglible.
============================

Testing:
- Added a e2e test that verifies that a query where an Exec() fails
  doesn't wait for all Exec()s to complete before cancelling and
  returning the error to the client.

Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32
Reviewed-on: http://gerrit.cloudera.org:8080/15154
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-25 01:22:34 +00:00
Thomas Tauber-Marshall
d72fd9a999 IMPALA-9209: Fix flakiness in test_end_data_stream_error
TestRPCException.execute_test_query is a helper function that is used
by tests that set an RPC debug action to repeatedly run a query until
the debug action is hit.

Previously, it required that either the query is expected to always
succeed, or it must always fail if the debug action is hit and an
expected error is provided. However, the two tests that have an
expected error, test_end_data_stream_error and
test_transmit_data_error, both set two debug actions - one that will
cause a query failure and one that won't (because we always retry
'reject too busy' errors).

If only the debug action that doesn't cause query failure is hit, the
query won't fail and 'execute_test_query' will fail on the assert that
expects that the query must fail if the action was hit. This is rare,
as both debug actions have a high probability of being hit on a given
run of the query.

The solution is to remove the requirement from the tests that the
query must fail if an expected error is provided and the debug action
is hit.

Change-Id: I499955b2d61c6b806f78e124c7ab919b242921bc
Reviewed-on: http://gerrit.cloudera.org:8080/14870
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-12-10 21:54:25 +00:00
Thomas Tauber-Marshall
f998427cf0 IMPALA-8138: Reintroduce rpc debugging options
In the past, Impala had a very simple 'fault injection' framework for
simulating failed rpcs between impalads. With the move to KRPC, that
framework was not carried over, and we lost the ability to test
certain failure scenarios.

This patch reintroduces this functionality. It removes the prior fault
injection framework in favor of the existing debug action framework,
which is more flexible.

To facilitate this, a few modifications are made to the debug action
framework:
- In addition to matching on a label, debug actions may now match on
  optional arguments. In this patch, the debug action
  IMPALA_SERVICE_POOL takes the arguments 'host', 'port', and
  'rpc name' to allow simulating the failure of specific rpcs to
  specific impalads.
- The FAIL action now takes an optional 'error message' parameter. In
  this patch, the debug action IMPALA_SERVICE_POOL uses this to
  simulate different types of rpc errors, eg. 'service too busy'.
- The FAIL action increments a metric, 'impala.debug_action.fail', so
  that tests can check that it has actually been hit. Prior to this
  patch the tests in test_rpc_exception.py where all passing
  spuriously as the faults they were supposed to be testing were no
  longer being injected.

This patch uses these new mechanisms to introduce tests that simulate
failures in DataStreamService rpcs. Follow up patches will add test
cases for ControlService rpcs.

Change-Id: I9c047ebce6d32c5ae461f70279391fa2df4c2029
Reviewed-on: http://gerrit.cloudera.org:8080/14641
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-27 22:27:58 +00:00
Michael Ho
8d7f638654 IMPALA-7212: Removes --use_krpc flag and remove old DataStream services
This change removes the flag --use_krpc which allows users
to fall back to using Thrift based implementation of DataStream
services. This flag was originally added during development of
IMPALA-2567. It has served its purpose.

As we port more ImpalaInternalServices to use KRPC, it's becoming
increasingly burdensome to maintain parallel implementation of the
RPC handlers. Therefore, going forward, KRPC is always enabled.
This change removes the Thrift based implemenation of DataStreamServices
and also simplifies some of the tests which were skipped when KRPC
is disabled.

Testing done: core debug build.

Change-Id: Icfed200751508478a3d728a917448f2dabfc67c3
Reviewed-on: http://gerrit.cloudera.org:8080/10835
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-24 02:36:50 +00:00
Lars Volker
a8fc9f0fc7 IMPALA-6508: add KRPC test flag
This change adds a flag "--use_krpc" to start-impala-cluster.py. The
flag is currently passed as an argument to the impalad daemon. In the
future it will also enable KRPC for the catalogd and statestored
daemons.

This change also adds a flag "--test_krpc" to pytest. When running tests
using "impala-py.test --test_krpc", the test cluster will be started
by passing "--use_krpc" to start-impala-cluster.py (see above).

This change also adds a SkipIf to skip tests based on whether the
cluster was started with KRPC support or not.

- SkipIf.not_krpc can be used to mark a test that depends on KRPC.
- SkipIf.not_thrift can be used to mark a test that depends on Thrift
  RPC.

This change adds a meta test to make sure that the new SkipIf decorators
work correctly. The test should be removed as soon as real tests have
been added with the new decorators.

Change-Id: Ie01a5de2afac4a0f43d5fceff283f6108ad6a3ab
Reviewed-on: http://gerrit.cloudera.org:8080/9291
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-16 09:26:01 +00:00
Michael Ho
fb8ea6a9cc IMPALA-5588: Reduce the frequency of fault injection
Previously, the fault injection utility will inject a fault
on every 3 RPC calls for ReportExecStatus() RPCs. As shown
in IMPALA-5588, with an unfortunate sequence in which other
RPCs happen between the retry of ReportExecStatus() RPC in
QueryState::ReportExecStatusAux(), ReportExecStatus() can
hit injected faults 3 times in a row, causing the query
to be cancelled in QueryState::ReportExecStatusAux().

This change fixes the problem by reducing the fault injection
frequency to once every 16 RPC calls for ReportExecStatus(),
CancelQueryFInstances() and ExecQueryFInstances() RPCs.

Also incorporated the fix by Michael Brown for a python bug in
test_rpc_exception.py so tests hitting unexpected exception
will re-throw that exception for better diagnosis on test failure.

Change-Id: I0ce4445e8552a22f23371bed1196caf7d0a3f312
Reviewed-on: http://gerrit.cloudera.org:8080/7310
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-30 09:27:47 +00:00
Michael Ho
e0ba5ef6a2 IMPALA-5558/IMPALA-5576: Reopen stale client connection
Previously, the retry logic in DoRpc() only allows retry to
happen if send() didn't complete successfully and the exception
indicates a closed connection. However, send() returning
successfully doesn't guarantee that the bytes have actually
reached the remote peer. According to the man page of send(),
when the message does not fit into the send buffer of the socket,
send() normally blocks. So the payload of RPC may be buffered in
the kernel if there is room for it. TCP allows a connection to
be half-open. If an Impalad node is restarted, a stale client
connection to that node may still allow send() to appear to succeed
even though the payload wasn't sent. However, upon calling recv()
in the RPC call to fetch the response, the client will get a return
value of 0. In which case, thrift will throw an exception as the
connection to the remote peer is closed already. Apparently, the
existing retry logic doesn't quite handle this case. One can
consistently reproduce the problem by warming the client cache
followed by restarting one of the Impalad nodes. It will result
a series of query failures due to stale connections.

This change augments the retry logic to also retry the entire RPC
if the exception string contains the messages "No more data to read."
or "SSL_read: Connection reset by peer" to capture the case of stale
connections. Our usage of thrift doesn't involve half-open TCP connection
so having a broken connection in recv() indicates the remote end has
closed the socket already. The generated thrift code doesn't knowingly
close the socket before an RPC completes unless the process crashes,
the connection is stale (e.g. the remote node was restarted) or the
remote end fails to read from the client. In either cases, the entire
RPC should just be retried with a new connection.

This change also fixes QueryState::ReportExecStatusAux() to
unconditionally for up to 3 times when reporting exec status of a
fragment instance. Previously, it may break out of the loop early
if RPC fails with 'retry_is_safe' == true (e.g. due to recv() timeout)
or if the connection to coordinator fails (IMPALA-5576). Declaring the
RPC to have failed may cause all fragment instances of a query to be
cancelled locally, triggering query hang due to IMPALA-2990. Similarly,
the cancellation RPC is also idempotent so it should be unconditionally
retried up to 3 times with 100ms sleep time in between.

The status reporting is idempotent as the handler simply ignores
RPC if it determines that all fragment instances on a given backend
is done so it should be safe to retry the RPC. This change updates
ApplyExecStatusReport() to handle duplicated status reports with
done bit set. Previously we would drop some other fragment instances'
statuses if we received duplicate 'done' statuses from the same
fragment instance(s).

Testing done: Warmed up client cache by running stress test followed by
restarting some Impalad nodes. Running queries used to fail or hang
consistently in the past. It now works with patch. Also ran CSL enduranace
tests and exhaustive builds.

Change-Id: I4d722c8ad3bf0e78e89887b6cb286c69ca61b8f5
Reviewed-on: http://gerrit.cloudera.org:8080/7284
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-26 09:06:38 +00:00
Michael Ho
2356516688 IMPALA-5537: Retry RPC on somes exceptions with SSL connection
After the fix for IMPALA-5388, all TSSLException thrown will be
treated as fatal error and the query will fail. Turns out that
this is too strict and in a secure cluster under load, queries
can easily hit timeout waiting for RPC response.

When running without SSL, we call RetryRpcRecv() to retry the recv
part of an RPC if the TSocket underlying the RPC gets an EAGAIN
during recv(). This change extends that logic to cover secure
connection. In particular, we pattern match against the exception
string "SSL_read: Resource temporarily unavailable" which corresponds
to EAGAIN error code being thrown in the SSL_read() path.

Similarly, we will handle closed connection in send() path with
secure connection by pattern matching against the exception string
"TTransportException: Transport not open". To verify that the exception
is thrown during the send part of a RPC call, the RPC client interface
has been augmented to take a bool* argument which is set to true after
the send part of the RPC has completed but before the recv part starts.
If DoRPC() catches an exception and the send part isn't done yet, the
entire RPC if the exception string matches certain substrings which are
safe to retry.

The fault injection utility has also been updated to distinguish between
time out and lost connection to exercise different error handling paths
in the send and recv paths.

Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c
Reviewed-on: http://gerrit.cloudera.org:8080/7229
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-21 10:04:15 +00:00
Michael Ho
7db2d30646 IMPALA-5388: Only retry RPC on lost connection in send call
Previously, DoRpc() blacklists only a couple of conditions
which shouldn't retry the RPC on exception. This is fragile
as the errors could have happened after the payload has been
successfully sent to the destination. Such aggressive retry
behavior can lead to duplicated row batches being sent, causing
wrong results in queries.

This change fixes the problem by whitelisting the conditions
in which the RPC can be retried. Specifically, it pattern-matches
against certain errors in TSocket::write_partial() in the thrift
library and only retries the RPC in those cases. With SSL enabled,
we will never retry. We should investigate whether there are some
cases in which it's safe to retry.

This change also adds fault injection in the TransmitData() RPC
caller's path to emulate different exception cases.

Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b
Reviewed-on: http://gerrit.cloudera.org:8080/7063
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-08 10:09:18 +00:00