IMPALA-2605: prevent long-running child processes from keeping TCP connection open

The problem: By default, all file descriptors opened by a process,
including sockets, are inherited by any forked child processes. This
includes the connection socket created at the beginning of each test
in ImpalaTestSuite.setup_class(). In
TestHiveMetaStoreFailure.test_hms_service_dies(), the Hive Metastore
is stopped and restarted, meaning the metastore in now a child process
of the test process. This causes the client connection not to be
closed when the parent process (the test) exits, meaning that one of a
finite number of connections (64) to Impala is left permanently in
use.

This would be barely noticeable except run-tests.py runs the mini
stress test with 4 * <num CPUs> concurrent clients by default. On our
build machines, this is 64 clients, which is also the default max
number of connections for an impalad. When a test process tries to
make the 65th connection (since the leaked connection is still there),
it blocks until a connection is freed up. Due to a quirk of the xdist
py.test plugin that I don't fully understand, the test framework will
not clean up test classes (and close the connections) until a number
of tests complete, causing the test process to deadlock.

The solution: use the close_fds argument to make sure the TCP socket
is closed in the spawned child process. This is also done in
CustomClusterTestSuite._start_impala_cluster() when it starts the new
cluster.

This patch also switches test_hms_failure.py to use check_call()
instead of call(), and explicitly caps the number of stress clients at
64.

Change-Id: I03feae922883a0624df1422ffb6ba5f1d83fb869
Reviewed-on: http://gerrit.cloudera.org:8080/1853
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
This commit is contained in:
Skye Wanderman-Milne
2016-01-21 18:59:08 -08:00
committed by Internal Jenkins
parent d3ab94162c
commit 9c4eb9fc61
3 changed files with 16 additions and 7 deletions

View File

@@ -40,9 +40,13 @@ NUM_CONCURRENT_TESTS = multiprocessing.cpu_count()
if 'NUM_CONCURRENT_TESTS' in os.environ:
NUM_CONCURRENT_TESTS = int(os.environ['NUM_CONCURRENT_TESTS'])
# Default the number of stress clinets to 4x the number of CPUs
# Default the number of stress clinets to 4x the number of CPUs (but not exceeding the
# default max # of concurrent connections)
# This can be overridden by setting the NUM_STRESS_CLIENTS environment variable.
NUM_STRESS_CLIENTS = multiprocessing.cpu_count() * 4
# TODO: fix the stress test so it can start more clients than available connections
# without deadlocking (e.g. close client after each test instead of on test class
# teardown).
NUM_STRESS_CLIENTS = min(multiprocessing.cpu_count() * 4, 64)
if 'NUM_STRESS_CLIENTS' in os.environ:
NUM_STRESS_CLIENTS = int(os.environ['NUM_STRESS_CLIENTS'])