impala

mirror of https://github.com/apache/impala.git synced 2026-01-31 09:00:19 -05:00

Author	SHA1	Message	Date
Laszlo Gaal	ee3da43709	Prettify the timeline produced by test-with-docker.py The change tweaks the HTML template for the timeline summary to make it slightly more readable: - Adds legend strings to the CPU graphs - Inserts the test run name into the CPU chart title to clarify which chart show which build/test phase - Stretches the CPU charts a bit wider - Identifes the common prefix of the phase/container names (the build name) and delete it from the chart labels. This increases legibility by cutting down on noise and growing the chart real estate. To support this change the Python drivers are also changed: the build name parameter, which is the common prefix, is passed to monitor.py and written to the JSON output - The name of the build and data load phase container is suffixed with "-build" so that it shares the naming convention for the other containers. - The timeline graph section is sized explicitly byt computing the height from the number of distinct tasks. This avoids having a second scrollbar for the timeline, which is annoying. The formula is pretty crude: it uses empirical constants, but produces an OK layout for the default font sizes in Chrome (both on Linux and the Mac). Tested so far by tweaking the HTML template and an HTML result file from an earlier build. Change-Id: I7a41bea762b0e33f3d71b0be57eedbacb19c680c Reviewed-on: http://gerrit.cloudera.org:8080/11578 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-09 19:12:50 +00:00
Philip Zeyliger	91673fee60	IMPALA-7624: Workaround docker/kernel bug causing test-with-docker to sometimes hang. I've observed that builds of test-with-docker that have "suite parallelism" sometimes hang when the Docker containers are being created. (The implementation had multiple threads calling "docker create" simultaneously.) Trolling the mailing lists, it's maybe a bug in Docker or the kernel. I've never caught it live enough to strace it. A hopeful workaround is to serialize the docker create calls, which is easy and harmless, given that "docker create" is usually pretty quick (subsecond) and the overall run time here is hours+. With this change, I was able to run test-with-docker with --suite-concurrency=6 on a c5.9xlarge in AWS, with a total runtime of 1h35m. The hangs are intermittent and cause, in the typical case, inconsistency in runtimes because less parallelism happens when one of the "docker create" calls hang. (I've seen them resume after one of the other containers finishes.) We'll find out with time whether this stabilizes it or has no effect. Change-Id: I3e44db7a6ce08a42d6fe574d7348332578cd9e51 Reviewed-on: http://gerrit.cloudera.org:8080/11481 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-26 02:20:45 +00:00
Philip Zeyliger	2d6a459c76	IMPALA-7390: Configure /etc/hosts to avoid rpc-mgr-kerberized-test issues. In the test-with-docker context, rpc-mgr-kerberized-test was failing, ultimately due to the fact that the hostname was resolving to 127.0.0.1 and then back to 'localhost'. This commit applies a workaround of adding "127.0.0.1 $(hostnahostname)" to /etc/hosts, which allows the test to pass, and is what's done in bootstrap_system.sh. In the Docker context, /etc/hosts needs to be updated on every container start, because Docker tries to provide an /etc/hosts for you. Previously, we were doing a different customization (adding "hostname" to the existing "127.0.0.1" line), which wasn't good enough for rpc-mgr-kerberized-test. The original workaround, which is in "boostrap_system.sh" is still in place, but I've added more documentation about reproduction there. I've also filed HDFS-13797 to track the HDFS issue. Change-Id: I91003cbc86177feb7f99563f61297f7da7fabab4 Reviewed-on: http://gerrit.cloudera.org:8080/11113 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-08 00:26:34 +00:00
Philip Zeyliger	cf5de09761	IMPALA-7385: Fix test-with-docker errors having to do with time zones. ExprTest.TimestampFunctions, query_test.test_scanners.TestOrc.test_type_conversions, and query_test.test_queries.TestHdfsQueries.test_hdfs_scan_node were all failing when using test-with-docker with mismatched dates. As it turns out, there is code that calls readlink(/etc/localtime) and parses the output to identify the current timezone name. This is described in localtime(5) on Ubuntu16: It should be an absolute or relative symbolic link pointing to /usr/share/zoneinfo/, followed by a timezone identifier such as "Europe/Berlin" or "Etc/UTC". ... Because the timezone identifier is extracted from the symlink target name of /etc/localtime, this file may not be a normal file or hardlink." To honor this requirement, and to make the tests pass, I re-jiggered how I pass the time zone information from the host into the container. The previously failing tests now pass. Change-Id: Ia9facfd9741806e7dbb868d8d06d9296bf86e52f Reviewed-on: http://gerrit.cloudera.org:8080/11106 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-06 22:41:02 +00:00
Philip Zeyliger	abf6f8f465	Fix TestKuduOperations tests in test-with-docker by using consistent hostname. TestKuduOperations, when run using test-with-docker, failed with errors like: Remote error: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Tablet is lagging too much to be able to serve snapshot scan. Lagging by: 1985348 ms, (max is 30000 ms): The underlying issue, as discovered by Thomas Tauber-Marshall, is that Kudu serializes the hostnames of Kudu tablet servers, and, different containers in test-with-docker use different hostnames. This was exposed after "IMPALA-6812: Fix flaky Kudu scan tests" switched to using READ_AT_SNAPSHOT for Kudu reads. Using the same hostname for all the containers is easy and harmless; this change does just that. Change-Id: Iea8c5096b515a79601be2e919d32585fb2796b3d Reviewed-on: http://gerrit.cloudera.org:8080/11082 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-01 01:25:38 +00:00
Joe McDonnell	28b4ad14f6	IMPALA-7161: Fix impala-config.sh's handling of JAVA_HOME It is common for developers to specify JAVA_HOME in bin/impala-config-local.sh, so wait until after it is sourced to validate JAVA_HOME. Also, try harder to auto-detect the system's JAVA_HOME in case it has not been specified in the environment. Here is a run through of different scenarios: 1. Not set in environment, not set in impala-config-local.sh: Didn't work before, now tries to autodetect by looking for javac on the PATH 2. Set in environment, not set in impala-config-local.sh: No change 3. Not set in environment, set in impala-config-local.sh: Didn't work before, now works 4. Set in environment and set in impala-config-local.sh: This used to be potentially inconsistent (i.e. JAVA comes from the environment's JAVA_HOME, but JAVA_HOME is overwritten by impala-config-local.sh), now it always uses the value from impala-config-local.sh. Change-Id: Idf3521b4f44fdbdc841a90fd00c477c9423a75bb Reviewed-on: http://gerrit.cloudera.org:8080/10702 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-18 21:42:22 +00:00
Philip Zeyliger	85ed7ae88b	IMPALA-6070: Adding ASAN, --tail to test-with-docker. * Adds -ASAN suites to test-with-docker. * Adds --tail flag, which starts a tail subprocess. This isn't pretty (there's potential for overlap), but it's a dead simple way to keep an eye on what's going on. * Fixes a bug wherein I could call "docker rm <container>" twice simultaneously, which would make Docker fail the second call, and then fail the related "docker rmi". It's better to serialize, and I did that with a simple lock. Change-Id: I51451cdf1352fc0f9516d729b9a77700488d993f Reviewed-on: http://gerrit.cloudera.org:8080/10319 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-19 00:37:50 +00:00
Philip Zeyliger	6454b74d2e	test-with-docker: work with git worktree This commit adds a little of git-wrangling to allow test-with-docker to work when invoked from git directories managed by "git worktree". These are different in that they reference another git directory elsewhere on the file system, which also needs to be mounted into the container. Change-Id: I9186e0b6f068aacc25f8d691508165c04329fa8b Reviewed-on: http://gerrit.cloudera.org:8080/10335 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-18 02:18:38 +00:00
Philip Zeyliger	f1709a64bd	test-with-docker: exit properly on failures If the build was failing, test-with-docker wouldn't recognize it and continue with the script; this fixes that. The bash puzzle I learned here is that bash -c "set -e; function f() { false; echo f; }; if f; then echo x; fi" will print "f" and "x", despite the set -e, even if f is put into a sub-shell with parentheses. Change-Id: I285e2f4d07e34898d73beba857e9ac325ed4e6db Reviewed-on: http://gerrit.cloudera.org:8080/10318 Tested-by: Philip Zeyliger <philip@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2018-05-07 20:10:35 +00:00
Philip Zeyliger	2e6a63e31e	IMPALA-6070: Further improvements to test-with-docker. This commit tackles a few additions and improvements to test-with-docker. In general, I'm adding workloads (e.g., exhaustive, rat-check), tuning memory setting and parallelism, and trying to speed things up. Bug fixes: * Embarassingly, I was still skipping thrift-server-test in the backend tests. This was a mistake in handling feedback from my last review. * I made the timeline a little bit taller to clip less. Adding workloads: * I added the RAT licensing check. * I added exhaustive runs. This led me to model the suites a little bit more in Python, with a class representing a suite with a bunch of data about the suite. It's not perfect and still coupled with the entrypoint.sh shell script, but it feels workable. As part of adding exhaustive tests, I had to re-work the timeout handling, since now different suites meaningfully have different timeouts. Speed ups: * To speed up test runs, I added a mechanism to split py.test suites into multiple shards with a py.test argument. This involved a little bit of work in conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh. Furthermore, I moved a bit more logic about managing the list of suites into Python. * Doing the full build with "-notests" and only building the backend tests in the relevant target that needs them. This speeds up "docker commit" significantly by removing about 20GB from the container. I had to indicates that expr-codegen-test depends on expr-codegen-test-ir, which was missing. * I sped up copying the Kudu data: previously I did both a move and a copy; now I'm doing a move followed by a move. One of the moves is cross-filesystem so is slow, but this does half the amount of copying. Memory usage: * I tweaked the memlimit_gb settings to have a higher default. I've been fighting empirically to have the tests run well on c4.8xlarge and m4.10xlarge. The more memory a minicluster and test suite run uses, the fewer parallel suites we can run. By observing the peak processes at the tail of a run (with a new "memory_usage" function that uses a ps/sort/awk trick) and by observing peak container total_rss, I found that we had several JVMs that didn't have Xmx settings set. I added Xms/Xmx settings in a few places: * The non-first Impalad does very little JVM work, so having an Xmx keeps it small, even in the parallel tests. * Datanodes do work, but they essentially were never garbage collecting, because JVM defaults let them use up to 1/4th the machine memory. (I observed this based on RSS at the end of the run; nothing fancier.) Adding Xms/Xmx settings helped. * Similarly, I piped the settings through to HBase. A few daemons still run without resource limitations, but they don't seem to be a problem. Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c Reviewed-on: http://gerrit.cloudera.org:8080/10123 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 20:47:29 +00:00
Philip Zeyliger	2896b8d127	IMPALA-6070: Expose using Docker to run tests faster. Allows running the tests that make up the "core" suite in about 2 hours. By comparison, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend tends to run in about 3.5 hours. This commit: * Adds "echo" statements in a few places, to facilitate timing. * Adds --skip-parallel/--skip-serial flags to run-tests.py, and exposes them in run-all-tests.sh. * Marks TestRuntimeFilters as a serial test. This test runs queries that need > 1GB of memory, and, combined with other tests running in parallel, can kill the parallel test suite. * Adds "test-with-docker.py", which runs a full build, data load, and executes tests inside of Docker containers, generating a timeline at the end. In short, one container is used to do the build and data load, and then this container is re-used to run various tests in parallel. All logs are left on the host system. Besides the obvious win of getting test results more quickly, this commit serves as an example of how to get various bits of Impala development working inside of Docker containers. For example, Kudu relies on atomic rename of directories, which isn't available in most Docker filesystems, and entrypoint.sh works around it. In addition, the timeline generated by the build suggests where further optimizations can be made. Most obviously, dataload eats up a precious ~30-50 minutes, on a largely idle machine. This work is significantly CPU and memory hungry. It was developed on a 32-core, 120GB RAM Google Compute Engine machine. I've worked out parallelism configurations such that it runs nicely on 60GB of RAM (c4.8xlarge) and over 100GB (eg., m4.10xlarge, which has 160GB). There is some simple logic to guess at some knobs, and there are knobs. By and large, EC2 and GCE price machines linearly, so, if CPU usage can be kept up, it's not wasteful to run on bigger machines. Change-Id: I82052ef31979564968effef13a3c6af0d5c62767 Reviewed-on: http://gerrit.cloudera.org:8080/9085 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-06 06:40:07 +00:00

11 Commits