The test 'test_jvm_pause_monitor_logs_entries' stops and starts an
impalad, and confirms that that the JVM pause monitor detects the pause
by looking for a specific message in the log. In a test run the test
failed to find the correct message after sleeping for 1.2 seconds.
Because the test notes the last message that it sees in the log, we can
observe that the test would have found the correct message if it had
waited for just a few more milliseconds.
This change increases the time that the test waits to 2 seconds.
TESTING:
Ran end-to-end tests cleanly and checked that
test_jvm_pause_monitor_logs_entries ran OK.
Change-Id: I735c0c0ecfd3a9099c9cef332c5e79854bec7b8d
Reviewed-on: http://gerrit.cloudera.org:8080/12475
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Following up to IMPALA-6857, it's useful for monitoring tools to see if
the pause monitor is getting triggered, and to see other GC metrics.
The Java side here, and the Thrift side, were easy enough.
However, the Impala metric implementation here caused us to call into
the frontend to read through the JMX memory beans 72 times, because each
call to GetValue() was getting all the data for the pool. This structure
made it hard to add additional, non-pool, metrics, and it felt wasteful.
To combat this, I added a cache of 10 seconds for getting the metrics
from the Frontend. The counters will typically re-use the same data.
There are five metrics here, and to avoid yet another enum class, I used
C++ lambdas to capture which field of the Thrift object I care about. If
folks like the approach, I think it can simplify way the enums for the
pool metrics as well.
I measured the cost of calling into the metrics code by
looping the metrics-gathering 100 times and looking at CPU
time for the process using this script:
START_CPU=$(cat /proc/$(fuser 25000/tcp 2> /dev/null | tr -d ' ')/stat | awk '{ print $14 + $15 }')
for i in $(seq 100); do
curl http://localhost:25000/jsonmetrics?json > /dev/null 2> /dev/null
done
END_CPU=$( cat /proc/$(fuser 25000/tcp 2> /dev/null | tr -d ' ')/stat | awk '{ print $14 + $15 }')
echo $START_CPU $END_CPU $(($END_CPU - $START_CPU))
On a release build on my development machine, gathering metrics 100
times took 0.16 cpu seconds without this change and 0.07 cpu seconds
with this change. The measurement accuracy here is 0.01 (I spot-checked
this with using the cpuacct cgroup infrastructure which gives you nanos,
but it was more painful to script), but this convinces me that this is a
net improvement.
Change-Id: Ia707393962ad94ef715ec015b3fe3bb1769104a2
Reviewed-on: http://gerrit.cloudera.org:8080/11468
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Pause monitor:
=============
This commit adds a stripped down version of Hadoop's JvmPauseMonitor
class (https://bit.ly/2O6qSwm) . The core implementaion is borrowed
from hadoop-common project and the hadoop dependencies are removed.
- Removed dependency on AbstractService.
- Not relying on Hadoop's Configuration object for reading confs.
- Switched to Guava's implementation of Stopwatch.
This utility class can detect both GC/non-GC pauses. In case of GC
pauses, the GC metrics during the pause period are logged.
Sample Output:
=============
Detected pause in JVM or host machine (eg GC): pause of approximately
2356ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2241ms
GC pool 'PS Scavenge' had collection(s): count=3 time=352ms
Detected pause in JVM or host machine (eg GC): pause of approximately
1964ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2082ms
GC pool 'PS Scavenge' had collection(s): count=1 time=251ms
Detected pause in JVM or host machine (eg GC): pause of approximately
2120ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2454ms
Detected pause in JVM or host machine (eg GC): pause of approximately
2238ms
GC pool 'PS MarkSweep' had collection(s): count=5 time=13464ms
Detected pause in JVM or host machine (eg GC): pause of approximately
2233ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2733ms
JMX Metrics:
============
JMX metrics are now emmitted for Impala and Catalog JVMs at the web end
point /jmx.
- Impalad: http(s)://<impalad-host>:25000/jmx
- Catalogd: http(s)://<catalogd-host>:25020/jmx
Misc:
====
Renamed JvmMetric -> JvmMemoryMetric to make the intent more clear. It
doesn't relate to the functionality of the patch in anyway.
Testing:
=======
- Tested it manually with kill -SIGSTOP/-SIGCONT <pid>. Made sure that
the non-GC JVM pauses are logged.
- This class' functionality is tested manually by invoking it's main()
- Injected a memory leak into the Catalog server code and made sure the
GC is detected.
Change-Id: I30d897b7e063846ad6d8f88243e2c04264da0341
Reviewed-on: http://gerrit.cloudera.org:8080/10998
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>