This adds trace events for data stream RPCs and
dumps them when they take longer than
--impala_slow_rpc_threshold_ms.
I needed to modify the KRPC code to do this because it
currently only dumps traces for RPCs with deadlines.
I plan to add some version of this upstream in Kudu
so that we don't diverge our KRPC implementation.
Example output from test_exchange_small_buffer:
I1111 08:38:53.732910 26509 rpcz_store.cc:265] Call impala.DataStreamService.TransmitData from 127.0.0.1:42434 (request call id 43) took 7799ms. Request Metrics: {}
I1111 08:38:53.732928 26509 rpcz_store.cc:269] Trace:
1111 08:38:45.933412 (+ 0us) impala-service-pool.cc:167] Inserting onto call queue
1111 08:38:45.933449 (+ 37us) impala-service-pool.cc:254] Handling call
1111 08:38:45.933470 (+ 21us) krpc-data-stream-mgr.cc:227] Added early sender
1111 08:38:47.906542 (+1973072us) krpc-data-stream-recvr.cc:327] Enqueuing deferred RPC
1111 08:38:53.732858 (+5826316us) krpc-data-stream-recvr.cc:506] Processing deferred RPC
1111 08:38:53.732860 (+ 2us) krpc-data-stream-recvr.cc:399] Deserializing batch
1111 08:38:53.732888 (+ 28us) krpc-data-stream-recvr.cc:426] Enqueuing deserialized batch
1111 08:38:53.732895 (+ 7us) inbound_call.cc:162] Queueing success response
Disabled +-clang-diagnostic-gnu-zero-variadic-macro-arguments because it
had false positives on the TRACE_TO invocations.
Testing:
* Ran exhaustive and ASAN tests
* Ran stress test
Change-Id: Ic7af4b45c43ec731d742d3696112c5f800849947
Reviewed-on: http://gerrit.cloudera.org:8080/14668
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This includes some optimisations and a bulk move of tests
to exhaustive.
Move a bunch of custom cluster tests to exhaustive. I selected
these partially based on runtime (i.e. I looked most carefully
at the tests that ran for over a minute) and the likelihood
of them catching a precommit bug. Regression tests for specific
edge cases and tests for parts of the code that are very stable
were prime candidates.
Remove an unnecessary cluster restart in test_breakpad.
Merge test_scheduler_error into test_failpoints to avoid an unnecessary
cluster restart.
Speed up cluster starts by ensuring that the default statestore args are
applied even when _start_impala_cluster() is called directly. This
shaves a couple of seconds off each restart. We made the default args
use a faster update frequency - see IMPALA-7185 - but they did not
take effect in all tests.
Change-Id: Ib2e3e7ebc9695baec4d69183387259958df10f62
Reviewed-on: http://gerrit.cloudera.org:8080/13967
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The startup flag stress_datastream_recvr_delay_ms is not available in
development builds. Skip the test in non-developement builds.
Testing done: Ran the test with release build and verified that it's skipped.
Change-Id: I5caaa6fa39d6c97f313b675838c27740af9aa1d5
Reviewed-on: http://gerrit.cloudera.org:8080/12610
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Previously, when a row batch failed to be deserialized in the
data stream receiver, we will return the error status to the
sender of the row batch without inserting the row batch. The
receiver will continue to operate without flagging any error.
The assumption is that the sender will eventually cancel the
query upon receiving the failed status.
Normally, when a caller of GetBatch() successfully dequeues a row
batch from the batch queue, it will kick off the draining of the
row batches from the deferred queue into the normal batch queue,
which will further continue the cycle of draining the deferred queue
upon the next call to GetBatch() until the deferred queue becomes empty.
When an error is hit when deserializing a deferred batch to be inserted
into the batch queue, the existing code will simply not insert the row
batch or flag any error. This breaks the cycle of the deferred queue
draining as the batch queue may become empty forever. The caller of
GetBatch() will block indefinitely until the query is cancelled.
The existing code works fine as the expectation is that the query will
be cancelled once the sender receives the error status from the RPC
response. However, this behavior is still not ideal as it lets a query
which has hit a fatal error to hold on to resources for extended period
of time.
This patch fixes the problem by explicitly recording any error during
row batch insertion in an error status object. Callers of GetBatch()
will now also poll for this status object while waiting for row batch
to show up and bail out early if there is any error.
A new test case has been added to simulate the problematic case above.
Change-Id: Iaa74937b046d95484887533be548249e96078755
Reviewed-on: http://gerrit.cloudera.org:8080/12567
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>