mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Change-Id: I63cd1ff7b885a094a4a3e91c31101d25414b4db7 Reviewed-on: http://gerrit.cloudera.org:8080/10454 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
603 lines
28 KiB
XML
603 lines
28 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept rev="1.4" id="hdfs_caching">
|
|
|
|
<title>Using HDFS Caching with Impala (<keyword keyref="impala21"/> or higher only)</title>
|
|
<titlealts audience="PDF"><navtitle>HDFS Caching</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Performance"/>
|
|
<data name="Category" value="Scalability"/>
|
|
<data name="Category" value="HDFS"/>
|
|
<data name="Category" value="HDFS Caching"/>
|
|
<data name="Category" value="Memory"/>
|
|
<data name="Category" value="Administrators"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
HDFS caching provides performance and scalability benefits in production environments where Impala queries
|
|
and other Hadoop jobs operate on quantities of data much larger than the physical RAM on the DataNodes,
|
|
making it impractical to rely on the Linux OS cache, which only keeps the most recently used data in memory.
|
|
Data read from the HDFS cache avoids the overhead of checksumming and memory-to-memory copying involved when
|
|
using data from the Linux OS cache.
|
|
</p>
|
|
|
|
<note>
|
|
<p>
|
|
On a small or lightly loaded cluster, HDFS caching might not produce any speedup. It might even lead to
|
|
slower queries, if I/O read operations that were performed in parallel across the entire cluster are replaced by in-memory
|
|
operations operating on a smaller number of hosts. The hosts where the HDFS blocks are cached can become
|
|
bottlenecks because they experience high CPU load while processing the cached data blocks, while other hosts remain idle.
|
|
Therefore, always compare performance with and without this feature enabled, using a realistic workload.
|
|
</p>
|
|
<p rev="2.2.0">
|
|
In <keyword keyref="impala22_full"/> and higher, you can spread the CPU load more evenly by specifying the <codeph>WITH REPLICATION</codeph>
|
|
clause of the <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements.
|
|
This clause lets you control the replication factor for
|
|
HDFS caching for a specific table or partition. By default, each cached block is
|
|
only present on a single host, which can lead to CPU contention if the same host
|
|
processes each cached block. Increasing the replication factor lets Impala choose
|
|
different hosts to process different cached blocks, to better distribute the CPU load.
|
|
Always use a <codeph>WITH REPLICATION</codeph> setting of at least 3, and adjust upward
|
|
if necessary to match the replication factor for the underlying HDFS data files.
|
|
</p>
|
|
<p rev="2.5.0">
|
|
In <keyword keyref="impala25_full"/> and higher, Impala automatically randomizes which host processes
|
|
a cached HDFS block, to avoid CPU hotspots. For tables where HDFS caching is not applied,
|
|
Impala designates which host to process a data block using an algorithm that estimates
|
|
the load on each host. If CPU hotspots still arise during queries,
|
|
you can enable additional randomization for the scheduling algorithm for non-HDFS cached data
|
|
by setting the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option.
|
|
</p>
|
|
</note>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
|
|
<!-- Could conref this background link; haven't decided yet the best place or if it's needed twice. -->
|
|
|
|
<p>
|
|
For background information about how to set up and manage HDFS caching for a <keyword keyref="distro"/> cluster, see
|
|
<xref keyref="setup_hdfs_caching"/>.
|
|
</p>
|
|
</conbody>
|
|
|
|
<concept id="hdfs_caching_overview">
|
|
|
|
<title>Overview of HDFS Caching for Impala</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Concepts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In <keyword keyref="impala14_full"/> and higher, Impala can use the HDFS caching feature to make more effective use of RAM, so that
|
|
repeated queries can take advantage of data <q>pinned</q> in memory regardless of how much data is
|
|
processed overall. The HDFS caching feature lets you designate a subset of frequently accessed data to be
|
|
pinned permanently in memory, remaining in the cache across multiple queries and never being evicted. This
|
|
technique is suitable for tables or partitions that are frequently accessed and are small enough to fit
|
|
entirely within the HDFS memory cache. For example, you might designate several dimension tables to be
|
|
pinned in the cache, to speed up many different join queries that reference them. Or in a partitioned
|
|
table, you might pin a partition holding data from the most recent time period because that data will be
|
|
queried intensively; then when the next set of data arrives, you could unpin the previous partition and pin
|
|
the partition holding the new data.
|
|
</p>
|
|
|
|
<p>
|
|
Because this Impala performance feature relies on HDFS infrastructure, it only applies to Impala tables
|
|
that use HDFS data files. HDFS caching for Impala does not apply to HBase tables, S3 tables,
|
|
Kudu tables,
|
|
or Isilon tables.
|
|
</p>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hdfs_caching_prereqs">
|
|
|
|
<title>Setting Up HDFS Caching for Impala</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To use HDFS caching with Impala, first set up that feature for your <keyword keyref="distro"/> cluster:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Decide how much memory to devote to the HDFS cache on each host. Remember that the total memory available
|
|
for cached data is the sum of the cache sizes on all the hosts. By default, any data block is only cached on one
|
|
host, although you can cache a block across multiple hosts by increasing the replication factor.
|
|
<!-- Obsoleted in Impala 2.2 and higher by IMPALA-1587.
|
|
Once a data block is cached on one host, all requests to process that block are routed to that same host.)
|
|
-->
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Issue <cmdname>hdfs cacheadmin</cmdname> commands to set up one or more cache pools, owned by the same
|
|
user as the <cmdname>impalad</cmdname> daemon (typically <codeph>impala</codeph>). For example:
|
|
<codeblock>hdfs cacheadmin -addPool four_gig_pool -owner impala -limit 4000000000
|
|
</codeblock>
|
|
For details about the <cmdname>hdfs cacheadmin</cmdname> command, see
|
|
<xref keyref="setup_hdfs_caching"/>.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
Once HDFS caching is enabled and one or more pools are available, see
|
|
<xref href="impala_perf_hdfs_caching.xml#hdfs_caching_ddl"/> for how to choose which Impala data to load
|
|
into the HDFS cache. On the Impala side, you specify the cache pool name defined by the <codeph>hdfs
|
|
cacheadmin</codeph> command in the Impala DDL statements that enable HDFS caching for a table or partition,
|
|
such as <codeph>CREATE TABLE ... CACHED IN <varname>pool</varname></codeph> or <codeph>ALTER TABLE ... SET
|
|
CACHED IN <varname>pool</varname></codeph>.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hdfs_caching_ddl">
|
|
|
|
<title>Enabling HDFS Caching for Impala Tables and Partitions</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Begin by choosing which tables or partitions to cache. For example, these might be lookup tables that are
|
|
accessed by many different join queries, or partitions corresponding to the most recent time period that
|
|
are analyzed by different reports or ad hoc queries.
|
|
</p>
|
|
|
|
<p>
|
|
In your SQL statements, you specify logical divisions such as tables and partitions to be cached. Impala
|
|
translates these requests into HDFS-level directives that apply to particular directories and files. For
|
|
example, given a partitioned table <codeph>CENSUS</codeph> with a partition key column
|
|
<codeph>YEAR</codeph>, you could choose to cache all or part of the data as follows:
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/impala_cache_replication_factor"/>
|
|
|
|
<codeblock>-- Cache the entire table (all partitions).
|
|
alter table census set cached in '<varname>pool_name</varname>';
|
|
|
|
-- Remove the entire table from the cache.
|
|
alter table census set uncached;
|
|
|
|
-- Cache a portion of the table (a single partition).
|
|
-- If the table is partitioned by multiple columns (such as year, month, day),
|
|
-- the ALTER TABLE command must specify values for all those columns.
|
|
alter table census partition (year=1960) set cached in '<varname>pool_name</varname>';
|
|
|
|
<ph rev="2.2.0">-- Cache the data from one partition on up to 4 hosts, to minimize CPU load on any
|
|
-- single host when the same data block is processed multiple times.
|
|
alter table census partition (year=1970)
|
|
set cached in '<varname>pool_name</varname>' with replication = 4;</ph>
|
|
|
|
-- At each stage, check the volume of cached data.
|
|
-- For large tables or partitions, the background loading might take some time,
|
|
-- so you might have to wait and reissue the statement until all the data
|
|
-- has finished being loaded into the cache.
|
|
show table stats census;
|
|
+-------+-------+--------+------+--------------+--------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Format |
|
|
+-------+-------+--------+------+--------------+--------+
|
|
| 1900 | -1 | 1 | 11B | NOT CACHED | TEXT |
|
|
| 1940 | -1 | 1 | 11B | NOT CACHED | TEXT |
|
|
| 1960 | -1 | 1 | 11B | 11B | TEXT |
|
|
| 1970 | -1 | 1 | 11B | NOT CACHED | TEXT |
|
|
| Total | -1 | 4 | 44B | 11B | |
|
|
+-------+-------+--------+------+--------------+--------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
<b>CREATE TABLE considerations:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The HDFS caching feature affects the Impala <codeph>CREATE TABLE</codeph> statement as follows:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
You can put a <codeph>CACHED IN '<varname>pool_name</varname>'</codeph> clause
|
|
<ph rev="2.2.0">and optionally a <codeph>WITH REPLICATION = <varname>number_of_hosts</varname></codeph> clause</ph>
|
|
at the end of a
|
|
<codeph>CREATE TABLE</codeph> statement to automatically cache the entire contents of the table,
|
|
including any partitions added later. The <varname>pool_name</varname> is a pool that you previously set
|
|
up with the <cmdname>hdfs cacheadmin</cmdname> command.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Once a table is designated for HDFS caching through the <codeph>CREATE TABLE</codeph> statement, if new
|
|
partitions are added later through <codeph>ALTER TABLE ... ADD PARTITION</codeph> statements, the data in
|
|
those new partitions is automatically cached in the same pool.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you want to perform repetitive queries on a subset of data from a large table, and it is not practical
|
|
to designate the entire table or specific partitions for HDFS caching, you can create a new cached table
|
|
with just a subset of the data by using <codeph>CREATE TABLE ... CACHED IN '<varname>pool_name</varname>'
|
|
AS SELECT ... WHERE ...</codeph>. When you are finished with generating reports from this subset of data,
|
|
drop the table and both the data files and the data cached in RAM are automatically deleted.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
See <xref href="impala_create_table.xml#create_table"/> for the full syntax.
|
|
</p>
|
|
|
|
<p>
|
|
<b>Other memory considerations:</b>
|
|
</p>
|
|
|
|
<p>
|
|
Certain DDL operations, such as <codeph>ALTER TABLE ... SET LOCATION</codeph>, are blocked while the
|
|
underlying HDFS directories contain cached files. You must uncache the files first, before changing the
|
|
location, dropping the table, and so on.
|
|
</p>
|
|
|
|
<p> When data is requested to be pinned in memory, that process happens in
|
|
the background without blocking access to the data while the caching is
|
|
in progress. Loading the data from disk could take some time. Impala
|
|
reads each HDFS data block from memory if it has been pinned already, or
|
|
from disk if it has not been pinned yet.</p>
|
|
|
|
<p>
|
|
The amount of data that you can pin on each node through the HDFS caching mechanism is subject to a quota
|
|
that is enforced by the underlying HDFS service. Before requesting to pin an Impala table or partition in
|
|
memory, check that its size does not exceed this quota.
|
|
</p>
|
|
|
|
<note>
|
|
Because the HDFS cache consists of combined memory from all the DataNodes in the cluster, cached tables or
|
|
partitions can be bigger than the amount of HDFS cache memory on any single host.
|
|
</note>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hdfs_caching_etl">
|
|
|
|
<title>Loading and Removing Data with HDFS Caching Enabled</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="ETL"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
When HDFS caching is enabled, extra processing happens in the background when you add or remove data
|
|
through statements such as <codeph>INSERT</codeph> and <codeph>DROP TABLE</codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
<b>Inserting or loading data:</b>
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
When Impala performs an <codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> or
|
|
<codeph><xref href="impala_load_data.xml#load_data">LOAD DATA</xref></codeph> statement for a table or
|
|
partition that is cached, the new data files are automatically cached and Impala recognizes that fact
|
|
automatically.
|
|
</li>
|
|
|
|
<li>
|
|
If you perform an <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> through Hive, as always, Impala
|
|
only recognizes the new data files after a <codeph>REFRESH <varname>table_name</varname></codeph>
|
|
statement in Impala.
|
|
</li>
|
|
|
|
<li>
|
|
If the cache pool is entirely full, or becomes full before all the requested data can be cached, the
|
|
Impala DDL statement returns an error. This is to avoid situations where only some of the requested data
|
|
could be cached.
|
|
</li>
|
|
|
|
<li>
|
|
When HDFS caching is enabled for a table or partition, new data files are cached automatically when they
|
|
are added to the appropriate directory in HDFS, without the need for a <codeph>REFRESH</codeph> statement
|
|
in Impala. Impala automatically performs a <codeph>REFRESH</codeph> once the new data is loaded into the
|
|
HDFS cache.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
<b>Dropping tables, partitions, or cache pools:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The HDFS caching feature interacts with the Impala
|
|
<codeph><xref href="impala_drop_table.xml#drop_table">DROP TABLE</xref></codeph> and
|
|
<codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE ... DROP PARTITION</xref></codeph>
|
|
statements as follows:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
When you issue a <codeph>DROP TABLE</codeph> for a table that is entirely cached, or has some partitions
|
|
cached, the <codeph>DROP TABLE</codeph> succeeds and all the cache directives Impala submitted for that
|
|
table are removed from the HDFS cache system.
|
|
</li>
|
|
|
|
<li>
|
|
The same applies to <codeph>ALTER TABLE ... DROP PARTITION</codeph>. The operation succeeds and any cache
|
|
directives are removed.
|
|
</li>
|
|
|
|
<li>
|
|
As always, the underlying data files are removed if the dropped table is an internal table, or the
|
|
dropped partition is in its default location underneath an internal table. The data files are left alone
|
|
if the dropped table is an external table, or if the dropped partition is in a non-default location.
|
|
</li>
|
|
|
|
<li>
|
|
If you designated the data files as cached through the <cmdname>hdfs cacheadmin</cmdname> command, and
|
|
the data files are left behind as described in the previous item, the data files remain cached. Impala
|
|
only removes the cache directives submitted by Impala through the <codeph>CREATE TABLE</codeph> or
|
|
<codeph>ALTER TABLE</codeph> statements. It is OK to have multiple redundant cache directives pertaining
|
|
to the same files; the directives all have unique IDs and owners so that the system can tell them apart.
|
|
</li>
|
|
|
|
<li>
|
|
If you drop an HDFS cache pool through the <cmdname>hdfs cacheadmin</cmdname> command, all the Impala
|
|
data files are preserved, just no longer cached. After a subsequent <codeph>REFRESH</codeph>,
|
|
<codeph>SHOW TABLE STATS</codeph> reports 0 bytes cached for each associated Impala table or partition.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
<b>Relocating a table or partition:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The HDFS caching feature interacts with the Impala
|
|
<codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE ... SET LOCATION</xref></codeph>
|
|
statement as follows:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
If you have designated a table or partition as cached through the <codeph>CREATE TABLE</codeph> or
|
|
<codeph>ALTER TABLE</codeph> statements, subsequent attempts to relocate the table or partition through
|
|
an <codeph>ALTER TABLE ... SET LOCATION</codeph> statement will fail. You must issue an <codeph>ALTER
|
|
TABLE ... SET UNCACHED</codeph> statement for the table or partition first. Otherwise, Impala would lose
|
|
track of some cached data files and have no way to uncache them later.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hdfs_caching_admin">
|
|
|
|
<title>Administration for HDFS Caching with Impala</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Here are the guidelines and steps to check or change the status of HDFS caching for Impala data:
|
|
</p>
|
|
|
|
<p>
|
|
<b>hdfs cacheadmin command:</b>
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
If you drop a cache pool with the <cmdname>hdfs cacheadmin</cmdname> command, Impala queries against the
|
|
associated data files will still work, by falling back to reading the files from disk. After performing a
|
|
<codeph>REFRESH</codeph> on the table, Impala reports the number of bytes cached as 0 for all associated
|
|
tables and partitions.
|
|
</li>
|
|
|
|
<li>
|
|
You might use <cmdname>hdfs cacheadmin</cmdname> to get a list of existing cache pools, or detailed
|
|
information about the pools, as follows:
|
|
<codeblock scale="60">hdfs cacheadmin -listDirectives # Basic info
|
|
Found 122 entries
|
|
ID POOL REPL EXPIRY PATH
|
|
123 testPool 1 never /user/hive/warehouse/tpcds.store_sales
|
|
124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15
|
|
125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01
|
|
...
|
|
|
|
hdfs cacheadmin -listDirectives -stats # More details
|
|
Found 122 entries
|
|
ID POOL REPL EXPIRY PATH BYTES_NEEDED BYTES_CACHED FILES_NEEDED FILES_CACHED
|
|
123 testPool 1 never /user/hive/warehouse/tpcds.store_sales 0 0 0 0
|
|
124 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-01-15 143169 143169 1 1
|
|
125 testPool 1 never /user/hive/warehouse/tpcds.store_sales/ss_date=1998-02-01 112447 112447 1 1
|
|
...
|
|
</codeblock>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
<b>Impala SHOW statement:</b>
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
For each table or partition, the <codeph>SHOW TABLE STATS</codeph> or <codeph>SHOW PARTITIONS</codeph>
|
|
statement displays the number of bytes currently cached by the HDFS caching feature. If there are no
|
|
cache directives in place for that table or partition, the result set displays <codeph>NOT
|
|
CACHED</codeph>. A value of 0, or a smaller number than the overall size of the table or partition,
|
|
indicates that the cache request has been submitted but the data has not been entirely loaded into memory
|
|
yet. See <xref href="impala_show.xml#show"/> for details.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
<b>Impala memory limits:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The Impala HDFS caching feature interacts with the Impala memory limits as follows:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The maximum size of each HDFS cache pool is specified externally to Impala, through the <cmdname>hdfs
|
|
cacheadmin</cmdname> command.
|
|
</li>
|
|
|
|
<li>
|
|
All the memory used for HDFS caching is separate from the <cmdname>impalad</cmdname> daemon address space
|
|
and does not count towards the limits of the <codeph>--mem_limit</codeph> startup option,
|
|
<codeph>MEM_LIMIT</codeph> query option, or further limits imposed through YARN resource management or
|
|
the Linux <codeph>cgroups</codeph> mechanism.
|
|
</li>
|
|
|
|
<li>
|
|
Because accessing HDFS cached data avoids a memory-to-memory copy operation, queries involving cached
|
|
data require less memory on the Impala side than the equivalent queries on uncached data. In addition to
|
|
any performance benefits in a single-user environment, the reduced memory helps to improve scalability
|
|
under high-concurrency workloads.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="hdfs_caching_performance">
|
|
|
|
<title>Performance Considerations for HDFS Caching with Impala</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In Impala 1.4.0 and higher, Impala supports efficient reads from data that is pinned in memory through HDFS
|
|
caching. Impala takes advantage of the HDFS API and reads the data from memory rather than from disk
|
|
whether the data files are pinned using Impala DDL statements, or using the command-line mechanism where
|
|
you specify HDFS paths.
|
|
</p>
|
|
|
|
<p>
|
|
When you examine the output of the <cmdname>impala-shell</cmdname> <cmdname>SUMMARY</cmdname> command, or
|
|
look in the metrics report for the <cmdname>impalad</cmdname> daemon, you see how many bytes are read from
|
|
the HDFS cache. For example, this excerpt from a query profile illustrates that all the data read during a
|
|
particular phase of the query came from the HDFS cache, because the <codeph>BytesRead</codeph> and
|
|
<codeph>BytesReadDataNodeCache</codeph> values are identical.
|
|
</p>
|
|
|
|
<codeblock>HDFS_SCAN_NODE (id=0):(Total: 11s114ms, non-child: 11s114ms, % non-child: 100.00%)
|
|
- AverageHdfsReadThreadConcurrency: 0.00
|
|
- AverageScannerThreadConcurrency: 32.75
|
|
<b> - BytesRead: 10.47 GB (11240756479)
|
|
- BytesReadDataNodeCache: 10.47 GB (11240756479)</b>
|
|
- BytesReadLocal: 10.47 GB (11240756479)
|
|
- BytesReadShortCircuit: 10.47 GB (11240756479)
|
|
- DecompressionTime: 27s572ms
|
|
</codeblock>
|
|
|
|
<p>
|
|
For queries involving smaller amounts of data, or in single-user workloads, you might not notice a
|
|
significant difference in query response time with or without HDFS caching. Even with HDFS caching turned
|
|
off, the data for the query might still be in the Linux OS buffer cache. The benefits become clearer as
|
|
data volume increases, and especially as the system processes more concurrent queries. HDFS caching
|
|
improves the scalability of the overall system. That is, it prevents query performance from declining when
|
|
the workload outstrips the capacity of the Linux OS cache.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hdfs_caching_encryption_caveat"/>
|
|
|
|
<p>
|
|
<b>SELECT considerations:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The Impala HDFS caching feature interacts with the
|
|
<codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement and query performance as
|
|
follows:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Impala automatically reads from memory any data that has been designated as cached and actually loaded
|
|
into the HDFS cache. (It could take some time after the initial request to fully populate the cache for a
|
|
table with large size or many partitions.) The speedup comes from two aspects: reading from RAM instead
|
|
of disk, and accessing the data straight from the cache area instead of copying from one RAM area to
|
|
another. This second aspect yields further performance improvement over the standard OS caching
|
|
mechanism, which still results in memory-to-memory copying of cached data.
|
|
</li>
|
|
|
|
<li>
|
|
For small amounts of data, the query speedup might not be noticeable in terms of wall clock time. The
|
|
performance might be roughly the same with HDFS caching turned on or off, due to recently used data being
|
|
held in the Linux OS cache. The difference is more pronounced with:
|
|
<ul>
|
|
<li>
|
|
Data volumes (for all queries running concurrently) that exceed the size of the Linux OS cache.
|
|
</li>
|
|
|
|
<li>
|
|
A busy cluster running many concurrent queries, where the reduction in memory-to-memory copying and
|
|
overall memory usage during queries results in greater scalability and throughput.
|
|
</li>
|
|
|
|
<li>
|
|
Thus, to really exercise and benchmark this feature in a development environment, you might need to
|
|
simulate realistic workloads and concurrent queries that match your production environment.
|
|
</li>
|
|
|
|
<li>
|
|
One way to simulate a heavy workload on a lightly loaded system is to flush the OS buffer cache (on
|
|
each DataNode) between iterations of queries against the same tables or partitions:
|
|
<codeblock>$ sync
|
|
$ echo 1 > /proc/sys/vm/drop_caches
|
|
</codeblock>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
Impala queries take advantage of HDFS cached data regardless of whether the cache directive was issued by
|
|
Impala or externally through the <cmdname>hdfs cacheadmin</cmdname> command, for example for an external
|
|
table where the cached data files might be accessed by several different Hadoop components.
|
|
</li>
|
|
|
|
<li>
|
|
If your query returns a large result set, the time reported for the query could be dominated by the time
|
|
needed to print the results on the screen. To measure the time for the underlying query processing, query
|
|
the <codeph>COUNT()</codeph> of the big result set, which does all the same processing but only prints a
|
|
single line to the screen.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|