This adds trace events for data stream RPCs and
dumps them when they take longer than
--impala_slow_rpc_threshold_ms.
I needed to modify the KRPC code to do this because it
currently only dumps traces for RPCs with deadlines.
I plan to add some version of this upstream in Kudu
so that we don't diverge our KRPC implementation.
Example output from test_exchange_small_buffer:
I1111 08:38:53.732910 26509 rpcz_store.cc:265] Call impala.DataStreamService.TransmitData from 127.0.0.1:42434 (request call id 43) took 7799ms. Request Metrics: {}
I1111 08:38:53.732928 26509 rpcz_store.cc:269] Trace:
1111 08:38:45.933412 (+ 0us) impala-service-pool.cc:167] Inserting onto call queue
1111 08:38:45.933449 (+ 37us) impala-service-pool.cc:254] Handling call
1111 08:38:45.933470 (+ 21us) krpc-data-stream-mgr.cc:227] Added early sender
1111 08:38:47.906542 (+1973072us) krpc-data-stream-recvr.cc:327] Enqueuing deferred RPC
1111 08:38:53.732858 (+5826316us) krpc-data-stream-recvr.cc:506] Processing deferred RPC
1111 08:38:53.732860 (+ 2us) krpc-data-stream-recvr.cc:399] Deserializing batch
1111 08:38:53.732888 (+ 28us) krpc-data-stream-recvr.cc:426] Enqueuing deserialized batch
1111 08:38:53.732895 (+ 7us) inbound_call.cc:162] Queueing success response
Disabled +-clang-diagnostic-gnu-zero-variadic-macro-arguments because it
had false positives on the TRACE_TO invocations.
Testing:
* Ran exhaustive and ASAN tests
* Ran stress test
Change-Id: Ic7af4b45c43ec731d742d3696112c5f800849947
Reviewed-on: http://gerrit.cloudera.org:8080/14668
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Exposes a list of build flags via the impalad web UI. The build flags
can be viewed on the root page under the "Version" section. They can
be accessed via other tests through the debug version of the root page
(e.g. adding &json to the URL). The build flags are listed in a JSON
array so that they can be parsed easily. This should help run Impala
tests against a remote Impala cluster.
The build flags are read in CMakeLists.txt and then stored in
preprocessor variables.
Three build flags are exposed as part of this commit:
- Is_NDEBUG = [true, false]
- Whether NDEBUG was true or false at compile time
- CMake_Build_Type = [DEBUG, RELEASE, ADDRESS_SANITIZER, TIDY, UBSAN,
UBSAN_FULL, TSAN, CODE_COVERAGE_RELEASE, CODE_COVERAGE_DEBUG]
- The value of CMAKE_BUILD_TYPE at compile time
- Library_Link_Type = [DYNAMIC, STATIC]
- Derived from the compile time value of BUILD_SHARED_LIBS
There are a few other minor changes that are apart of this commit:
* The patch modifies environ.py so that it supports fetching build metadata
for both local and remote clusters.
* The tests under the tests/webserver directory were not being run because
'webserver' was not whitelisted in tests/run-tests.py. This patch fixes
that and addresses several test failures in run-tests.py.
* It reverts part of IMPALA-6947 so that their is no dependency from
start-impala-cluster.py to environ.py. The timeout discussed IMPALA-6947
is now set at compile time.
Testing:
Added new tests to webserver/test_web_pages.py to ensure that the build
flags are being set. Some tests are only run when run against a local
cluster because we have no way of getting the build info from a remote
cluster, whereas local clusters contain a .cmake_build_type file.
Change-Id: I47e3ad4cbf844909bdaf22a6f9d7bd915dce3f19
Reviewed-on: http://gerrit.cloudera.org:8080/11410
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Isilon has been failing on the exchange-delays-zero-rows
test case due to slow scans. Running this part of the
test with a larger value for stress_datastream_recvr_delay_ms
solved the issue.
Since this part of the test is sensitive to slow scans,
I pulled this out into its own test. The new test can
apply an extra delay for platforms with slow scans.
This subsumes the fix for IMPALA-6811 and treats S3,
ADLS, and ISILON as platforms with slow scans.
test_exchange_small_delay and test_exchange_large_delay
(minus exchange-delays-zero-rows) go back to the timeouts
they were using before IMPALA-6811. These tests did not
see issues with those timeouts.
The new arrangement with test_exchange_large_delay_zero_rows
does not change any timeouts except for Isilon. This applies
a slow scan extra delay on top of Isilon's already slow
build.
Change-Id: I2e919a4e502b1e6a4156aafbbe4b5ddfe679ed89
Reviewed-on: http://gerrit.cloudera.org:8080/10208
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The exchange-delays-zero-rows portion of test_exchange_delays
tests a RPC timeout when sending an EOS with no rows. In
order to send the EOS, the executor must have completed
the scan (which returns zero rows). On S3, the scan has
been slow and can exceed the current receiver delay of
10 seconds, leading to a failure to timeout. In manual
testing, the sender timeout is being applied quickly
once the scan finishes.
This increases the receiver delay for S3 and ADLS to
20 seconds. None of the existing symptoms show the
scan taking that long. Most failures have a scan that
barely exceeds 10 seconds, the current receiver delay.
This has only been seen on S3, but ADLS is included as a
precaution.
Change-Id: I967e6eb336c801219c77d657655c42984910b479
Reviewed-on: http://gerrit.cloudera.org:8080/9995
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The timeout is doubled on build types known to be slow to reduce the
odds of hitting the flakiness.
Testing:
I wasn't able to reproduce the original problem locally. However I was
able to reproduce a similar problem by decreasing the timeout to 2000ms,
which suggests that > 1s hiccups are possible with an ASAN build, so
increasing the timeout could help.
Change-Id: Ib00cb3c63c123b7ff263e21c7ee41fa6af12606f
Reviewed-on: http://gerrit.cloudera.org:8080/8357
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
Avoid running the problematic query with short delays. This combination
doesn't add coverage - the short delay was meant to test behaviour when
multiple batches were sent, but there are deliberately no batches sent
with this query.
Testing:
Ran a build against Isilon, which succeeded. Ran the test in a loop
locally overnight.
Change-Id: Ia75c42be2de600344de7af5a917d7843880ea6de
Reviewed-on: http://gerrit.cloudera.org:8080/8111
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Impala compiled with the address sanitizer, or compiled with code
coverage, runs through code paths much slower. This can cause end-to-end
tests that pass on a non-ASAN or non-code coverage build to fail. Some
examples include IMPALA-2721, IMPALA-2973, and IMPALA-3501. These
classes of failures tend always to involve some time-sensitive condition
that fails to succeed under such "slow builds".
The works-around in the past have been to simply increase the timeout.
The problem with this approach is that it relaxes conditions for tests
on builds that see the field--i.e., release builds--for builds that
never will--i.e., ASAN and code coverage.
This patch fixes that problem by allowing test authors to set timeout
values based on a *specific* build type. The author may choose timeouts
with a default value, and different timeouts for either or both
so-called "slow builds": ASAN and code coverage.
We detect the so-called "specific build type" by inspecting the binary
expected to be at the path under test. This removes the need to make
alterations to Impala itself. The inspection done is to read the DWARF
information in the binary, specifically the first compile unit's
DW_AT_producer and DW_AT_name DIE attributes. We employ a heuristic
based on these attributes' values to guess the build type. If we can't
determine the build type, we will assume it's a debug build. More
information on this is in IMPALA-3501.
A quick summary of the changes follows:
1. Move some of the logic in tests.common.skip to tests.common.environ
and rework some skip marks to be more precise.
2. Add Pyelftools for convenient deserialization of DWARF
3. Our Pyelftools usage requires collections.OrderedDict, which isn't in
python2.6; also add Monkeypatch to handle this.
4. Add ImpalaBuild and specific_build_type_timeout, the core of the new
functionality
5. Fix the statestore tests that only fail under code coverage (the
basis for IMPALA-3501)
Testing:
The tests that were previously, reliably failing under code coverage now
pass. I also ran perfunctory tests of debug, release, and ASAN builds to
ensure our detection of build type is working. This patch will *not*
turn the code coverage builds green; there are other tests that fail,
and fixing all of them here is out of the scope of this patch.
Change-Id: I2b675c04c54e36d404fd9e5a6cf085fb8d6d0e47
Reviewed-on: http://gerrit.cloudera.org:8080/3156
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
Scans on S3 seem to take longer to start up, so the parameters selected
for tests running on HDFS may not make sense. I increase the times by a
factor of 10 so that the same scenario is exercised, just slower.
Change-Id: Ifab74668658212246e5216610bac8b9595b29e5f
Reviewed-on: http://gerrit.cloudera.org:8080/2425
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Silvius Rus <srus@cloudera.com>
Tested-by: Internal Jenkins
Add a custom cluster test that tests for delays in registering data
stream receivers. We add a stress option to artificially delay this
registration to ensure that it can be handled correctly.
Change-Id: Id5f5746b6023c301bacfa305c525846cdde822c9
Reviewed-on: http://gerrit.cloudera.org:8080/2306
Tested-by: Internal Jenkins
Reviewed-by: Silvius Rus <srus@cloudera.com>