mirror of
https://github.com/apache/impala.git
synced 2026-01-28 18:00:14 -05:00
Several recent runs have resulted in a boost mutex
invalid argument exception. The mutex in question is
the one that guards individual lib_cache entries
(LibCacheEntry::lock).
The exception is thrown due to the entry being deleted
by another thread prior to the current thread acquiring
the entry lock. This scenario happens when:
1) thread t1 looks up an existing entry e
a. the lookup is guarded by a cache lock
b. the cache lock is released (L356)
c. e's lock is acquired on the next line and propagated
up the call stack (look for the swaps)
d. while e is locked, its use-count is incremented.
2) thread t2 deletes entry e
a. the cache lock is acquired and e is looked up
b. e's lock is acquired
c. if e's usecount is 0 and was marked for removal, e is deleted.
If t2 runs following (1b), then t1 will acquire a lock on a deleted
entry (1c), causing the mutex exception.
There are two parts to the fix in this change:
(1) don't crash and (2) don't let concurrency regress.
1) remove 1b: keep the cache lock while acquiring e's lock.
The cache lock is then released and all other operations proceed as before.
Note that current lock ordering is still maintained (coarse to fine),
but now, the cache lock can be held while other threads block access
to the entry lock. When files have been copied, these operations
are short. While files are loading, accesses to the same entry
can serialize access to the entire cache.
2) add cache entry after its loaded.
Currently, on a cache miss, a new entry is created, locked,
added to the cache, and finally the cache is unlocked. The intent
is to allow other threads to concurrently load files for other entries.
However, now that the cache lock is held while the entry lock is held,
the loading thread will block other thread's from acquiring the same entry,
which will block access to the cache. The work-around is to release
the cache lock when loading a new entry and add the cache entry only when
its loaded. The workaround avoids expensive work while holding the
cache lock, but may do wasted work if multiple threads miss on the
same entry and load their entries in parallel.
Testing:
- ran core tests + hdfs
- added an end-to-end test that concurrently uses and drops/creates from
the same lib-cache entry. with current settings, the use-after-free
error is reliably reproduced.
- manual testing to examine concurrency of the following cases:
- concurrent function creation from multiple lib files
(stresses coordinator)
- concurrent function use from multiple lib files
(stresses backend)
Change-Id: I1f178a2114cb3969dbb920949ddff4e8220b742e
Reviewed-on: http://gerrit.cloudera.org:8080/9626
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins