Commit Graph

7 Commits

Author SHA1 Message Date
Joe McDonnell
5acce4d200 IMPALA-9472,IMPALA-9473: Add per-partition metrics for data cache
This adds two sets of metrics. The first is per-partition metrics
to track the performance of the underlying filesystem for the
data cache. It keeps histograms of read, write, and eviction
latency for each data cache partition along with another metric
recording the path for the partition. These are exposed as the
following metrics:
impala-server.io-mgr.remote-data-cache-partition-$0.path
impala-server.io-mgr.remote-data-cache-partition-$0.read-latency
impala-server.io-mgr.remote-data-cache-partition-$0.write-latency
impala-server.io-mgr.remote-data-cache-partition-$0.eviction-latency

This also adds metrics to keep counts of hits, misses, and entries
in the data cache. Since reducing the latency of IO is an important
feature of the data cache, the absolute count of hits and misses
is as important as the hit bytes and miss bytes. This adds the
following metrics:
impala-server.io-mgr.remote-data-cache-hit-count
impala-server.io-mgr.remote-data-cache-miss-count
impala-server.io-mgr.remote-data-cache-num-entries

To track metrics around inserts, this also adds the following
metrics:
impala-server.io-mgr.remote-data-cache-num-inserts
impala-server.io-mgr.remote-data-cache-dropped-entries
impala-server.io-mgr.remote-data-cache-instant-evictions
An instant eviction happens when inserting an entry into the cache
fails and the entry is immediately evicted during insert. This is
currently only possible for LIRS when the entry's size is larger
than the unprotected capacity. This manifests when the cache
size is very small. For example, for an 8MB entry, this would
manifest when a cache shard is smaller than 160MB. This metric
is primarily for debugging.

Testing:
 - Hand testing to verify the per-partition latency histograms
 - Modified custom_cluster/test_data_cache.py to also test
   the counts.

Change-Id: I56a57d75ff11f00ebc85b85bcaf104fb8108c478
Reviewed-on: http://gerrit.cloudera.org:8080/15382
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-01 14:37:58 +00:00
Joe McDonnell
01691b998a IMPALA-8690: Add LIRS cache eviction algorithm
One concern for the data cache is that the LRU eviction
algorithm is suceptible to being flushed by large
scans of low priority data. This implements the LIRS algorithm
described in "LIRS: An Efficient Low Inter-reference Recency
Set Replacement Policy to Improve Buffer Cache Performance"
by Song Jiang / Xiaodon Xhang 2002. LIRS is a scan-resistent
eviction algorithm with low performance penalty to LRU.

This introduces the startup flag data_cache_eviction_policy to
control which eviction policy to use. The only two options are
LRU and LIRS, with the default continuing to be LRU.

To accomodate the new algorithm and associated tests, some
code moved around:
1. The RLCacheShard implementation moved from util/cache/cache.cc
   to util/cache/rl-cache.cc.
2. The backend cache tests were split into multiple files.
   util/cache/cache-test.h contains shared cache testing code.
   util/cache/cache-test.cc contains generic tests that should
   work for any algorithm.
   util/cache/rl-cache-test.cc are RLCacheShard specific tests
   util/cache/lirs-cache-test.cc are LIRS specific tests
3. To make it easy for clients of the cache code to customize
   the cache eviction algorithm, the public interface changed
   from using a template to taking the policy as an argument.
4. Cache::MemoryType is removed.
5. Cache adds an Init() method to verify the validity of
   startup flags

Testing:
 - Added LIRS specific backend cache tests (lirs-cache-test)
 - Ran TPC-DS with a very small cache and concurrency to test
   corner cases with the LIRS eviction policy
 - Parameterized data-cache-test to run for both LRU and LIRS
 - Added LIRS equivalents for tests in custom_cluster/test_data_cache.py
 - Ran cache-bench with LRU and LIRS. The results are:
   Test case           | Algorithm | Lookups / sec | Hit rate
   ZIPFIAN ratio=1.00x | LRU       | 11.31M        | 99.9%
   ZIPFIAN ratio=1.00x | LIRS      | 10.09M        | 99.8%
   ZIPFIAN ratio=3.00x | LRU       | 11.36M        | 95.9%
   ZIPFIAN ratio=3.00x | LIRS      |  9.27M        | 96.4%
   UNIFORM ratio=1.00x | LRU       |  7.46M        | 99.8%
   UNIFORM ratio=1.00x | LIRS      |  6.93M        | 99.8%
   UNIFORM ratio=3.00x | LRU       |  5.63M        | 33.3%
   UNIFORM ratio=3.00x | LIRS      |  3.24M        | 33.3%
   The takeaway is that LIRS is a bit slower on lookups and
   quite a bit slower on inserts. However, they both are still
   doing millions of operations per second, so it should not
   be a bottleneck for the data cache.

Change-Id: I670fa4b2b7c93998130dc4e8b2546bb93e9a84f8
Reviewed-on: http://gerrit.cloudera.org:8080/15306
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-31 16:08:02 +00:00
Michael Ho
9874ce37a9 IMPALA-8691: Query option to disable data cache
This change adds a query option to disable the data cache for
a given session. By default, this option is set to false. When
it's set to true, all queries will by-pass the data cache. This
allows users to avoid polluting the cache for accesses to tables
which they don't want to cache. A follow-up change will add
a per-table query hint to allow caching disabled for a given
table only.

There is some small refactoring in the code to make it clearer
the type of caching being referred to in the code. As the code
stands now, we have both HDFS caching (for local reads) and the
data cache (for remote reads). BufferOpts has been extended to
allow users to explicitly state intention for using either/both
of the caches.

Change-Id: I39122ac38435cedf94b2b39145863764d0b5b6c8
Reviewed-on: http://gerrit.cloudera.org:8080/14015
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-23 18:25:44 +00:00
Bikramjeet Vig
dc0f8a1eee IMPALA-8570: Fix flakiness in test_restart_statestore_query_resilience
The test relies on scheduling decisions made on a 3 node minicluster
without erasure coding. This patch ensures that this test is skipped
if those conditions are not met by adding a new
SkipIfNotHdfsMinicluster.scheduling marker for the same. Existing
tests that rely on the same conditions were also updated to use the
marker.

Change-Id: I0a54b6e149c42b696c954b5240d6de61453bf7f9
Reviewed-on: http://gerrit.cloudera.org:8080/13406
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-31 04:26:02 +00:00
Michael Ho
460aef657a IMPALA-8512: Disable certain tests on Centos6
The data cache related tests rely on data cache files being created
successfully on local filesystem. The cache initialization may fail
if the cache directory resides on a ext filesystem which is affected
by KUDU-1508 (metadata corruption after hole punching in some files).
On some older versions of Centos6, the tests fail as a result of
this bug.

This change skips these tests if they detect that it's running on
an old system affected by KUDU-1508. This patch also disables a
filesystem-util test which relies on readdir() returning the correct
entries' types. On some older platforms such as Centos6, this feature
may not be fully supported on all filesystems.

Change-Id: Ifbff15415bc690f779a09ec93a7ded8b394eca10
Reviewed-on: http://gerrit.cloudera.org:8080/13271
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2019-05-08 18:16:28 +00:00
Michael Ho
25db9ea8f3 IMPALA-8496: Fix flakiness of test_data_cache.py
test_data_cache.py was added as part of IMPALA-8341 to verify
that the DataCache hit / miss counts and the DataCache metrics
are working as expected. The test seems to fail intermittently
due to unexpected cache misses.

Part of the test creates a temp table from tpch_parquet.lineitem
and uses it to join against tpch_parquet.lineitem itself on the
l_orderkey column. The test expects a complete cache hit for
tpch_parquet.lineitem when joining against the temp table as it
should be cached entirely as part of CTAS statement. However, this
doesn't work as expected all the time. In particular, the data cache
internally divides up the key space into multiple shards and a key
is hashed to determine the shard it belongs to. By default, the
number of shards is the same as number of CPU cores (e.g. 16 for AWS
m5-4xlarge instance). Since the cache size is set to 500MB, each shard
will have a capacity of 31MB only. In some cases, it's possible that
some rows of l_orderkey are evicted if the shard they belong to grow
beyond 31MB. The problem is not deterministic as part of the cache key
is the modification time of the file, which changes from run-to-run as
it's essentially determined by the data loading time of the job. This
leads to flakiness of the test.

To fix this problem, this patch forces the data cache to use a single
shard only for determinisim. In addition, the test is also skipped for
non-HDFS and HDFS erasure encoding builds as it's dependent on the scan
range assignment. To exercise the cache more extensively, the plan is
to enable it by default for S3 builds instead of relying on BE and E2E
tests only.

Testing done:
- Ran test_data_cache.py 10+ times, each with different mtime
  for tpch_parquet.lineitem; Used to fail 2 out of 3 runs.

Change-Id: I98d5b8fa1d3fb25682a64bffaf56d751a140e4c9
Reviewed-on: http://gerrit.cloudera.org:8080/13242
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-05 02:11:55 +00:00
Michael Ho
2ece4c9b2e IMPALA-8341: Data cache for remote reads
This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Reviewed-on: http://gerrit.cloudera.org:8080/12987
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-03 19:39:42 +00:00