mirror of
https://github.com/apache/impala.git
synced 2026-01-20 21:01:42 -05:00
The error could occur in the following scenario, where thread A is executing a join build fragment and thread B is cancelling the fragment instance. 1. Thread A is in HandoffToProbesAndWait(), reads is_cancelled_ and sees false. 2. Thread B in RuntimeState::Cancel() sets is_cancelled_ = true, acquires cancellation_cvs_lock_, then calls NotifyAll() on the condition variable 3. Thread A calls Wait() on the condition variable, blocks forever because cancellation already happened. The fix is for thread B to acquire the lock that thread A is holding. That prevents the race because #1 and #3 above are in the same critical section and thread B won't be able to signal the condition variable until thread A has released it. Testing: Added metric check to test_failpoints to make it easier to detect hangs caused by those tests in future. Looped test_failpoints.py overnight, which was previously enough to reproduce the failure within a couple of hours. Ran exhaustive tests. Change-Id: I996ad2055d6542eb57e12c663b89de5f84208f77 Reviewed-on: http://gerrit.cloudera.org:8080/15672 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>