Files
impala/docker/timeline.html.template
Philip Zeyliger 2e6a63e31e IMPALA-6070: Further improvements to test-with-docker.
This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.

Bug fixes:

* Embarassingly, I was still skipping thrift-server-test in the backend
  tests. This was a mistake in handling feedback from my last review.

* I made the timeline a little bit taller to clip less.

Adding workloads:

* I added the RAT licensing check.

* I added exhaustive runs. This led me to model the suites a little
  bit more in Python, with a class representing a suite with a
  bunch of data about the suite. It's not perfect and still
  coupled with the entrypoint.sh shell script, but it feels
  workable. As part of adding exhaustive tests, I had
  to re-work the timeout handling, since now different
  suites meaningfully have different timeouts.

Speed ups:

* To speed up test runs, I added a mechanism to split py.test suites into
  multiple shards with a py.test argument. This involved a little bit of work in
  conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.

  Furthermore, I moved a bit more logic about managing the
  list of suites into Python.

* Doing the full build with "-notests" and only building
  the backend tests in the relevant target that needs them. This speeds
  up "docker commit" significantly by removing about 20GB from the
  container.  I had to indicates that expr-codegen-test depends on
  expr-codegen-test-ir, which was missing.

* I sped up copying the Kudu data: previously I did
  both a move and a copy; now I'm doing a move followed by a move. One
  of the moves is cross-filesystem so is slow, but this does half the
  amount of copying.

Memory usage:

* I tweaked the memlimit_gb settings to have a higher default. I've been
  fighting empirically to have the tests run well on c4.8xlarge and
  m4.10xlarge.

The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:

 * The non-first Impalad does very little JVM work, so having
   an Xmx keeps it small, even in the parallel tests.
 * Datanodes do work, but they essentially were never garbage
   collecting, because JVM defaults let them use up to 1/4th
   the machine memory. (I observed this based on RSS at the
   end of the run; nothing fancier.) Adding Xms/Xmx settings
   helped.
 * Similarly, I piped the settings through to HBase.

A few daemons still run without resource limitations, but they don't
seem to be a problem.

Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 20:47:29 +00:00

142 lines
5.0 KiB
Plaintext

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!--
Template/header for a timeline visualization of a multi-container build.
The timelines represent interesting log lines, with one row per container.
The charts represent CPU usage within those containers.
To use this, concatenate this with a '<script>' block defining
a global variable named data.
The expected format of data is exemplified by the following,
and is tightly coupled with the implementation generating
it in monitor.py. The intention of this unfriendly file format
is to do as much munging as plausible in Python.
To make the visualization relative to the start time (i.e., to say that all
builds start at 00:00), the timestamps are all seconds since the build began.
To make the visualization work with them, the timestamps are then converted
into local time, and get displayed reasonably. This is a workaround to the fact
that the visualization library for the timelines does not accept any data types
that represent duration, but we still want timestamp-style formatting.
var data = {
// max timestamp seen, in seconds since the epoch
"max_ts": 8153.0,
// map of container name to an array of metrics
"metrics": {
"i-20180312-140548-ee-test-serial": [
// a single metric point is an array of timestamp, user CPU, system CPU
// CPU is the percent of 1 CPU used since the previous timestamp.
[
4572.0,
0.11,
0.07
]
]
},
// Array of timelines
"timeline": [
// a timeline entry contains a name (for the entire row of the timeline),
// the message (for a segment of the timeline), and start and end timestamps
// for the segment.
[
"i-20180312-140548",
"+ echo '>>> build' '4266 (begin)'",
0.0,
0.0
]
]
}
-->
<script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
<script type="text/javascript">
google.charts.load("current", {packages:["timeline", "corechart"]});
google.charts.setOnLoadCallback(drawChart);
function ts_to_hms(secs) {
var s = secs % 60;
var m = Math.floor(secs / 60) % 60;
var h = Math.floor(secs / (60 * 60));
return [h, m, s];
}
/* Returns a Date object corresponding to secs seconds since the epoch, in
* localtime. Date(x) and Date(0, 0, 0, 0, 0, 0, 0, x) differ in that the
* former returns UTC whereas the latter returns the browser local time.
* For consistent handling within this visualization, we use localtime.
*
* Beware that local time can be discontinuous around time changes.
*/
function ts_to_date(secs) {
// secs may be a float, so we use millis as a common denominator unit
var millis = 1000 * secs;
return new Date(1970 /* yr; beginning of unix epoch */, 0 /* mo */, 0 /* d */,
0 /* hr */, 0 /* min */, 0 /* sec */, millis);
}
function drawChart() {
var timelineContainer = document.getElementById('timelineContainer');
var chart = new google.visualization.Timeline(timelineContainer);
var dataTable = new google.visualization.DataTable();
dataTable.addColumn({ type: 'string', id: 'Position' });
dataTable.addColumn({ type: 'string', id: 'Name' });
// timeofday isn't supported here
dataTable.addColumn({ type: 'datetime', id: 'Start' });
dataTable.addColumn({ type: 'datetime', id: 'End' });
// Timeline
for (i = 0; i < data.timeline.length; ++i) {
var row = data.timeline[i];
dataTable.addRow([ row[0], row[1], ts_to_date(row[2]), ts_to_date(row[3]) ]);
}
chart.draw(dataTable, { height: "400px" } );
for (const k of Object.keys(data.metrics)) {
var lineChart = document.createElement("div");
lineChartContainer.appendChild(lineChart);
var dataTable = new google.visualization.DataTable();
dataTable.addColumn({ type: 'timeofday', id: 'Time' });
dataTable.addColumn({ type: 'number', id: 'User' });
dataTable.addColumn({ type: 'number', id: 'System' });
for (const row of data.metrics[k]) {
dataTable.addRow([ ts_to_hms(row[0]), row[1], row[2] ]);
}
var options = {
title: 'CPU',
legend: { position: 'bottom' },
hAxis: {
minValue: [0, 0, 0],
maxValue: ts_to_hms(data.max_ts)
}
};
var chart = new google.visualization.LineChart(lineChart);
chart.draw(dataTable, options);
}
}
</script>
<div id="timelineContainer" style="height: 400px;"></div>
<div id="lineChartContainer" style="height: 200px;"></div>