The change tweaks the HTML template for the timeline summary
to make it slightly more readable:
- Adds legend strings to the CPU graphs
- Inserts the test run name into the CPU chart title to clarify
which chart show which build/test phase
- Stretches the CPU charts a bit wider
- Identifes the common prefix of the phase/container names (the build
name) and delete it from the chart labels. This increases legibility
by cutting down on noise and growing the chart real estate.
To support this change the Python drivers are also changed:
the build name parameter, which is the common prefix, is passed
to monitor.py and written to the JSON output
- The name of the build and data load phase container is suffixed with
"-build" so that it shares the naming convention for the other
containers.
- The timeline graph section is sized explicitly byt computing the height
from the number of distinct tasks. This avoids having a second scrollbar
for the timeline, which is annoying.
The formula is pretty crude: it uses empirical constants, but produces
an OK layout for the default font sizes in Chrome (both on Linux
and the Mac).
Tested so far by tweaking the HTML template and an HTML result file
from an earlier build.
Change-Id: I7a41bea762b0e33f3d71b0be57eedbacb19c680c
Reviewed-on: http://gerrit.cloudera.org:8080/11578
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
I've observed that builds of test-with-docker that have "suite
parallelism" sometimes hang when the Docker containers are
being created. (The implementation had multiple threads calling
"docker create" simultaneously.) Trolling the mailing lists,
it's maybe a bug in Docker or the kernel. I've never caught
it live enough to strace it.
A hopeful workaround is to serialize the docker create calls, which is
easy and harmless, given that "docker create" is usually pretty quick
(subsecond) and the overall run time here is hours+.
With this change, I was able to run test-with-docker with
--suite-concurrency=6 on a c5.9xlarge in AWS, with a total runtime of
1h35m.
The hangs are intermittent and cause, in the typical case, inconsistency
in runtimes because less parallelism happens when one of the "docker
create" calls hang. (I've seen them resume after one of the other
containers finishes.) We'll find out with time whether this stabilizes
it or has no effect.
Change-Id: I3e44db7a6ce08a42d6fe574d7348332578cd9e51
Reviewed-on: http://gerrit.cloudera.org:8080/11481
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In the test-with-docker context, rpc-mgr-kerberized-test was failing,
ultimately due to the fact that the hostname was resolving to 127.0.0.1
and then back to 'localhost'.
This commit applies a workaround of adding "127.0.0.1 $(hostnahostname)"
to /etc/hosts, which allows the test to pass, and is what's done in
bootstrap_system.sh. In the Docker context, /etc/hosts needs to be
updated on every container start, because Docker tries to provide an
/etc/hosts for you. Previously, we were doing a different customization
(adding "hostname" to the existing "127.0.0.1" line), which wasn't good
enough for rpc-mgr-kerberized-test.
The original workaround, which is in "boostrap_system.sh" is still
in place, but I've added more documentation about reproduction there.
I've also filed HDFS-13797 to track the HDFS issue.
Change-Id: I91003cbc86177feb7f99563f61297f7da7fabab4
Reviewed-on: http://gerrit.cloudera.org:8080/11113
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ExprTest.TimestampFunctions,
query_test.test_scanners.TestOrc.test_type_conversions, and
query_test.test_queries.TestHdfsQueries.test_hdfs_scan_node were all
failing when using test-with-docker with mismatched dates.
As it turns out, there is code that calls readlink(/etc/localtime)
and parses the output to identify the current timezone name.
This is described in localtime(5) on Ubuntu16:
It should be an absolute or relative symbolic link pointing to
/usr/share/zoneinfo/, followed by a timezone identifier such as
"Europe/Berlin" or "Etc/UTC". ... Because the timezone identifier is
extracted from the symlink target name of /etc/localtime, this file
may not be a normal file or hardlink."
To honor this requirement, and to make the tests pass, I re-jiggered
how I pass the time zone information from the host into the container.
The previously failing tests now pass.
Change-Id: Ia9facfd9741806e7dbb868d8d06d9296bf86e52f
Reviewed-on: http://gerrit.cloudera.org:8080/11106
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
TestKuduOperations, when run using test-with-docker, failed with errors
like:
Remote error: Service unavailable: Timed out: could not wait for desired
snapshot timestamp to be consistent: Tablet is lagging too much to be able to
serve snapshot scan. Lagging by: 1985348 ms, (max is 30000 ms):
The underlying issue, as discovered by Thomas Tauber-Marshall, is that Kudu
serializes the hostnames of Kudu tablet servers, and, different containers in
test-with-docker use different hostnames. This was exposed after "IMPALA-6812:
Fix flaky Kudu scan tests" switched to using READ_AT_SNAPSHOT for Kudu reads.
Using the same hostname for all the containers is easy and harmless;
this change does just that.
Change-Id: Iea8c5096b515a79601be2e919d32585fb2796b3d
Reviewed-on: http://gerrit.cloudera.org:8080/11082
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
It is common for developers to specify JAVA_HOME in
bin/impala-config-local.sh, so wait until after it is
sourced to validate JAVA_HOME.
Also, try harder to auto-detect the system's JAVA_HOME
in case it has not been specified in the environment.
Here is a run through of different scenarios:
1. Not set in environment, not set in impala-config-local.sh:
Didn't work before, now tries to autodetect by looking
for javac on the PATH
2. Set in environment, not set in impala-config-local.sh:
No change
3. Not set in environment, set in impala-config-local.sh:
Didn't work before, now works
4. Set in environment and set in impala-config-local.sh:
This used to be potentially inconsistent (i.e. JAVA comes
from the environment's JAVA_HOME, but JAVA_HOME is
overwritten by impala-config-local.sh), now it always
uses the value from impala-config-local.sh.
Change-Id: Idf3521b4f44fdbdc841a90fd00c477c9423a75bb
Reviewed-on: http://gerrit.cloudera.org:8080/10702
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
* Adds -ASAN suites to test-with-docker.
* Adds --tail flag, which starts a tail subprocess. This
isn't pretty (there's potential for overlap), but it's a dead simple
way to keep an eye on what's going on.
* Fixes a bug wherein I could call "docker rm <container>" twice
simultaneously, which would make Docker fail the second call,
and then fail the related "docker rmi". It's better to serialize,
and I did that with a simple lock.
Change-Id: I51451cdf1352fc0f9516d729b9a77700488d993f
Reviewed-on: http://gerrit.cloudera.org:8080/10319
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit adds a little of git-wrangling to allow test-with-docker to
work when invoked from git directories managed by "git worktree". These
are different in that they reference another git directory elsewhere on
the file system, which also needs to be mounted into the container.
Change-Id: I9186e0b6f068aacc25f8d691508165c04329fa8b
Reviewed-on: http://gerrit.cloudera.org:8080/10335
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If the build was failing, test-with-docker wouldn't recognize
it and continue with the script; this fixes that.
The bash puzzle I learned here is that
bash -c "set -e; function f() { false; echo f; }; if f; then echo x; fi"
will print "f" and "x", despite the set -e, even if f is put into
a sub-shell with parentheses.
Change-Id: I285e2f4d07e34898d73beba857e9ac325ed4e6db
Reviewed-on: http://gerrit.cloudera.org:8080/10318
Tested-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.
Bug fixes:
* Embarassingly, I was still skipping thrift-server-test in the backend
tests. This was a mistake in handling feedback from my last review.
* I made the timeline a little bit taller to clip less.
Adding workloads:
* I added the RAT licensing check.
* I added exhaustive runs. This led me to model the suites a little
bit more in Python, with a class representing a suite with a
bunch of data about the suite. It's not perfect and still
coupled with the entrypoint.sh shell script, but it feels
workable. As part of adding exhaustive tests, I had
to re-work the timeout handling, since now different
suites meaningfully have different timeouts.
Speed ups:
* To speed up test runs, I added a mechanism to split py.test suites into
multiple shards with a py.test argument. This involved a little bit of work in
conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.
Furthermore, I moved a bit more logic about managing the
list of suites into Python.
* Doing the full build with "-notests" and only building
the backend tests in the relevant target that needs them. This speeds
up "docker commit" significantly by removing about 20GB from the
container. I had to indicates that expr-codegen-test depends on
expr-codegen-test-ir, which was missing.
* I sped up copying the Kudu data: previously I did
both a move and a copy; now I'm doing a move followed by a move. One
of the moves is cross-filesystem so is slow, but this does half the
amount of copying.
Memory usage:
* I tweaked the memlimit_gb settings to have a higher default. I've been
fighting empirically to have the tests run well on c4.8xlarge and
m4.10xlarge.
The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:
* The non-first Impalad does very little JVM work, so having
an Xmx keeps it small, even in the parallel tests.
* Datanodes do work, but they essentially were never garbage
collecting, because JVM defaults let them use up to 1/4th
the machine memory. (I observed this based on RSS at the
end of the run; nothing fancier.) Adding Xms/Xmx settings
helped.
* Similarly, I piped the settings through to HBase.
A few daemons still run without resource limitations, but they don't
seem to be a problem.
Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Allows running the tests that make up the "core" suite in about 2 hours.
By comparison, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend
tends to run in about 3.5 hours.
This commit:
* Adds "echo" statements in a few places, to facilitate timing.
* Adds --skip-parallel/--skip-serial flags to run-tests.py,
and exposes them in run-all-tests.sh.
* Marks TestRuntimeFilters as a serial test. This test runs
queries that need > 1GB of memory, and, combined with
other tests running in parallel, can kill the parallel test
suite.
* Adds "test-with-docker.py", which runs a full build, data load,
and executes tests inside of Docker containers, generating
a timeline at the end. In short, one container is used
to do the build and data load, and then this container is
re-used to run various tests in parallel. All logs are
left on the host system.
Besides the obvious win of getting test results more quickly, this
commit serves as an example of how to get various bits of Impala
development working inside of Docker containers. For example, Kudu
relies on atomic rename of directories, which isn't available in most
Docker filesystems, and entrypoint.sh works around it.
In addition, the timeline generated by the build suggests where further
optimizations can be made. Most obviously, dataload eats up a precious
~30-50 minutes, on a largely idle machine.
This work is significantly CPU and memory hungry. It was developed on a
32-core, 120GB RAM Google Compute Engine machine. I've worked out
parallelism configurations such that it runs nicely on 60GB of RAM
(c4.8xlarge) and over 100GB (eg., m4.10xlarge, which has 160GB). There is
some simple logic to guess at some knobs, and there are knobs. By and
large, EC2 and GCE price machines linearly, so, if CPU usage can be kept
up, it's not wasteful to run on bigger machines.
Change-Id: I82052ef31979564968effef13a3c6af0d5c62767
Reviewed-on: http://gerrit.cloudera.org:8080/9085
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>