Files
impala/tests/failure
Tim Armstrong 7fd046d00f IMPALA-9611: fix hang when cancelling join builder
The error could occur in the following scenario, where
thread A is executing a join build fragment and thread
B is cancelling the fragment instance.
1. Thread A is in HandoffToProbesAndWait(), reads is_cancelled_
  and sees false.
2. Thread B in RuntimeState::Cancel() sets is_cancelled_ = true,
  acquires cancellation_cvs_lock_, then calls NotifyAll() on the
  condition variable
3. Thread A calls Wait() on the condition variable, blocks forever
  because cancellation already happened.

The fix is for thread B to acquire the lock that thread A is
holding. That prevents the race because #1 and #3 above are in the
same critical section and thread B won't be able to signal the
condition variable until thread A has released it.

Testing:
Added metric check to test_failpoints to make it easier to detect
hangs caused by those tests in future.

Looped test_failpoints.py overnight, which was previously enough
to reproduce the failure within a couple of hours.

Ran exhaustive tests.

Change-Id: I996ad2055d6542eb57e12c663b89de5f84208f77
Reviewed-on: http://gerrit.cloudera.org:8080/15672
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-07 23:26:14 +00:00
..