mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Adds documentation for the catalog_partial_fetch_max_files configuration flag, which limits the number of file descriptors returned in a catalog fetch. Change-Id: I30b7a29ae78d97d15dd7f946d83f7535181f214e Reviewed-on: http://gerrit.cloudera.org:8080/23676 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
1208 lines
51 KiB
XML
1208 lines
51 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="scalability">
|
||
|
||
<title>Scalability Considerations for Impala</title>
|
||
|
||
<titlealts audience="PDF">
|
||
|
||
<navtitle>Scalability Considerations</navtitle>
|
||
|
||
</titlealts>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Performance"/>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Planning"/>
|
||
<data name="Category" value="Querying"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Memory"/>
|
||
<data name="Category" value="Scalability"/>
|
||
<!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. -->
|
||
<data name="Category" value="Proof of Concept"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
This section explains how the size of your cluster and the volume of data influences SQL
|
||
performance and schema design for Impala tables. Typically, adding more cluster capacity
|
||
reduces problems due to memory limits or disk throughput. On the other hand, larger
|
||
clusters are more likely to have other kinds of scalability issues, such as a single slow
|
||
node that causes performance problems for queries.
|
||
</p>
|
||
|
||
<p outputclass="toc inpage"/>
|
||
|
||
<p conref="../shared/impala_common.xml#common/cookbook_blurb"/>
|
||
|
||
</conbody>
|
||
|
||
<concept audience="hidden" id="scalability_memory">
|
||
|
||
<title>Overview and Guidelines for Impala Memory Usage</title>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Memory"/>
|
||
<data name="Category" value="Concepts"/>
|
||
<data name="Category" value="Best Practices"/>
|
||
<data name="Category" value="Guidelines"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<!--
|
||
Outline adapted from Alan Choi's "best practices" and/or "performance cookbook" papers.
|
||
-->
|
||
|
||
<codeblock>Memory Usage – the Basics
|
||
* Memory is used by:
|
||
* Hash join – RHS tables after decompression, filtering and projection
|
||
* Group by – proportional to the #groups
|
||
* Parquet writer buffer – 1GB per partition
|
||
* IO buffer (shared across queries)
|
||
* Metadata cache (no more than 1GB typically)
|
||
* Memory held and reused by later query
|
||
* Impala releases memory from time to time starting in 1.4.
|
||
|
||
Memory Usage – Estimating Memory Usage
|
||
* Use Explain Plan
|
||
* Requires statistics! Mem estimate without stats is meaningless.
|
||
* Reports per-host memory requirement for this cluster size.
|
||
* Re-run if you’ve re-sized the cluster!
|
||
[image of explain plan]
|
||
|
||
Memory Usage – Estimating Memory Usage
|
||
* EXPLAIN’s memory estimate issues
|
||
* Can be way off – much higher or much lower.
|
||
* group by’s estimate can be particularly off – when there’s a large number of group by columns.
|
||
* Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... NDV of group by column n
|
||
* Ignore EXPLAIN’s estimate if it’s too high! • Do your own estimate for group by
|
||
* GROUP BY mem usage = (total number of groups * size of each row) + (total number of groups * size of each row) / num node
|
||
|
||
Memory Usage – Finding Actual Memory Usage
|
||
* Search for “Per Node Peak Memory Usage” in the profile.
|
||
This is accurate. Use it for production capacity planning.
|
||
|
||
Memory Usage – Actual Memory Usage
|
||
* For complex queries, how do I know which part of my query is using too much memory?
|
||
* Use the ExecSummary from the query profile!
|
||
- But is that "Peak Mem" number aggregate or per-node?
|
||
[image of executive summary]
|
||
|
||
Memory Usage – Hitting Mem-limit
|
||
* Top causes (in order) of hitting mem-limit even when running a single query:
|
||
1. Lack of statistics
|
||
2. Lots of joins within a single query
|
||
3. Big-table joining big-table
|
||
4. Gigantic group by
|
||
|
||
Memory Usage – Hitting Mem-limit
|
||
Lack of stats
|
||
* Wrong join order, wrong join strategy, wrong insert strategy
|
||
* Explain Plan tells you that!
|
||
[image of explain plan]
|
||
* Fix: Compute Stats table
|
||
|
||
Memory Usage – Hitting Mem-limit
|
||
Lots of joins within a single query
|
||
* select...from fact, dim1, dim2,dim3,...dimN where ...
|
||
* Each dim tbl can fit in memory, but not all of them together
|
||
* As of Impala 1.4, Impala might choose the wrong plan – BROADCAST
|
||
FIX 1: use shuffle hint
|
||
select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ...
|
||
FIX 2: pre-join the dim tables (if possible)
|
||
- How about an example to illustrate that technique?
|
||
* few join=>better perf!
|
||
|
||
Memory Usage: Hitting Mem-limit
|
||
Big-table joining big-table
|
||
* Big-table (after decompression, filtering, and projection) is a table that is bigger than total cluster memory size.
|
||
* Impala 2.0 will do this (via disk-based join). Consider using Hive for now.
|
||
* (Advanced) For a simple query, you can try this advanced workaround – per-partition join
|
||
* Requires the partition key be part of the join key
|
||
select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3)
|
||
union all
|
||
select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6)
|
||
|
||
Memory Usage: Hitting Mem-limit
|
||
Gigantic group by
|
||
* The total number of distinct groups is huge, such as group by userid.
|
||
* Impala 2.0 will do this (via disk-based agg). Consider using Hive for now.
|
||
- Is this one of the cases where people were unhappy we recommended Hive?
|
||
* (Advanced) For a simple query, you can try this advanced workaround – per-partition agg
|
||
* Requires the partition key be part of the group by
|
||
select part_key, col1, col2, ...agg(..) from tbl where
|
||
part_key in (1,2,3)
|
||
Union all
|
||
Select part_key, col1, col2, ...agg(..) from tbl where
|
||
part_key in (4,5,6)
|
||
- But where's the GROUP BY in the preceding query? Need a real example.
|
||
|
||
Memory Usage: Additional Notes
|
||
* Use explain plan for estimate; use profile for accurate measure
|
||
* Data skew can use uneven memory usage
|
||
* Review previous common issues on out-of-memory
|
||
* Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently
|
||
</codeblock>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="scalability_catalog">
|
||
|
||
<title>Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized
|
||
for tables containing relatively few, large data files. Schemas containing thousands of
|
||
tables, or tables containing thousands of partitions, can encounter performance issues
|
||
during startup or during DDL operations such as <codeph>ALTER TABLE</codeph> statements.
|
||
</p>
|
||
|
||
<note type="important" rev="TSB-168">
|
||
<p>
|
||
Because of a change in the default heap size for the <cmdname>catalogd</cmdname>
|
||
daemon in <keyword
|
||
keyref="impala25_full"/> and higher, the following
|
||
procedure to increase the <cmdname>catalogd</cmdname> memory limit might be required
|
||
following an upgrade to <keyword keyref="impala25_full"/> even if not needed
|
||
previously.
|
||
</p>
|
||
</note>
|
||
|
||
<p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"
|
||
/>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
<concept id="catalog_file_metadata" rev="5.0.0_IMPALA-11402">
|
||
<title>Limiting file metadata fetched in Catalog requests (<keyword keyref="impala50_full"/> and
|
||
higher)</title>
|
||
<conbody>
|
||
<p>To prevent Catalog service (Catalogd) Out-of-Memory (OOM) errors when coordinator fetching metadata for
|
||
tables with millions of files, the new configuration flag
|
||
<codeph>catalog_partial_fetch_max_files</codeph> has been introduced.</p>
|
||
<p>This flag limits the maximum number of file descriptors returned in a single Catalog fetch
|
||
response. This response is for the <codeph>GetPartialCatalogObject</codeph> RPC, which is
|
||
used in local catalog mode. See <xref href="impala_metadata.xml"/></p>
|
||
<p><b>Default:</b> 1,000,000 files</p>
|
||
<p>If a request exceeds this limit, Catalogd truncates the response at the partition level.
|
||
The Impala coordinator then automatically sends subsequent requests to fetch the remaining
|
||
metadata, and it detects any version changes to force a query replan, ensuring metadata
|
||
consistency.</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept rev="2.1.0" id="statestore_scalability">
|
||
|
||
<title>Scalability Considerations for the Impala Statestore</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Before <keyword keyref="impala21_full"/>, the statestore sent only one kind of message
|
||
to its subscribers. This message contained all updates for any topics that a subscriber
|
||
had subscribed to. It also served to let subscribers know that the statestore had not
|
||
failed, and conversely the statestore used the success of sending a heartbeat to a
|
||
subscriber to decide whether or not the subscriber had failed.
|
||
</p>
|
||
|
||
<p>
|
||
Combining topic updates and failure detection in a single message led to bottlenecks in
|
||
clusters with large numbers of tables, partitions, and HDFS data blocks. When the
|
||
statestore was overloaded with metadata updates to transmit, heartbeat messages were
|
||
sent less frequently, sometimes causing subscribers to time out their connection with
|
||
the statestore. Increasing the subscriber timeout and decreasing the frequency of
|
||
statestore heartbeats worked around the problem, but reduced responsiveness when the
|
||
statestore failed or restarted.
|
||
</p>
|
||
|
||
<p>
|
||
As of <keyword keyref="impala21_full"/>, the statestore now sends topic updates and
|
||
heartbeats in separate messages. This allows the statestore to send and receive a steady
|
||
stream of lightweight heartbeats, and removes the requirement to send topic updates
|
||
according to a fixed schedule, reducing statestore network overhead.
|
||
</p>
|
||
|
||
<p>
|
||
The statestore now has the following relevant configuration flags for the
|
||
<cmdname>statestored</cmdname> daemon:
|
||
</p>
|
||
|
||
<dl>
|
||
<dlentry id="statestore_num_update_threads">
|
||
|
||
<dt>
|
||
<codeph>-statestore_num_update_threads</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The number of threads inside the statestore dedicated to sending topic updates. You
|
||
should not typically need to change this value.
|
||
<p>
|
||
<b>Default:</b> 10
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_update_frequency_ms">
|
||
|
||
<dt>
|
||
<codeph>-statestore_update_frequency_ms</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The frequency, in milliseconds, with which the statestore tries to send topic
|
||
updates to each subscriber. This is a best-effort value; if the statestore is unable
|
||
to meet this frequency, it sends topic updates as fast as it can. You should not
|
||
typically need to change this value.
|
||
<p>
|
||
<b>Default:</b> 2000
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_num_heartbeat_threads">
|
||
|
||
<dt>
|
||
<codeph>-statestore_num_heartbeat_threads</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The number of threads inside the statestore dedicated to sending heartbeats. You
|
||
should not typically need to change this value.
|
||
<p>
|
||
<b>Default:</b> 10
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_heartbeat_frequency_ms">
|
||
|
||
<dt>
|
||
<codeph>-statestore_heartbeat_frequency_ms</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The frequency, in milliseconds, with which the statestore tries to send heartbeats
|
||
to each subscriber. This value should be good for large catalogs and clusters up to
|
||
approximately 150 nodes. Beyond that, you might need to increase this value to make
|
||
the interval longer between heartbeat messages.
|
||
<p>
|
||
<b>Default:</b> 1000 (one heartbeat message every second)
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_heartbeat_tcp_timeout_seconds">
|
||
|
||
<dt>
|
||
<codeph>-statestore_heartbeat_tcp_timeout_seconds</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The time after which a heartbeat RPC to a subscriber will timeout. This setting
|
||
protects against badly hung machines that are not able to respond to the heartbeat
|
||
RPC in short order. Increase this if there are intermittent heartbeat RPC timeouts
|
||
shown in statestore's log. You can reference the max value of
|
||
"statestore.priority-topic-update-durations" metric on statestore to get a
|
||
reasonable value. Note that priority topic updates are assumed to be small amounts
|
||
of data that take a small amount of time to process (similar to the heartbeat
|
||
complexity).
|
||
<p>
|
||
<b>Default:</b> 3
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_max_missed_heartbeats">
|
||
|
||
<dt>
|
||
<codeph>-statestore_max_missed_heartbeats</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
Maximum number of consecutive heartbeat messages an impalad can miss before being
|
||
declared failed by the statestore. You should not typically need to change this
|
||
value.
|
||
<p>
|
||
<b>Default:</b> 10
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_subscriber_timeout_secs">
|
||
|
||
<dt>
|
||
<codeph>-statestore_subscriber_timeout_secs</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The amount of time (in seconds) that may elapse before the connection with the
|
||
statestore is considered lost by subscribers (impalad/catalogd). Impalad will
|
||
reregister itself to statestore, which may cause its absence in the next round of
|
||
cluster membership update. This will cause query failures like "Cancelled due to
|
||
unreachable impalad(s)". The value of this flag should be comparable to
|
||
<codeph>
|
||
(statestore_heartbeat_frequency_ms / 1000 + statestore_heartbeat_tcp_timeout_seconds)
|
||
* statestore_max_missed_heartbeats</codeph>,
|
||
so subscribers won't reregister themselves too early and allow statestore to
|
||
resend heartbeats. You can also reference the max value of
|
||
"statestore-subscriber.heartbeat-interval-time" metrics on impalads to get a
|
||
reasonable value.
|
||
<p>
|
||
<b>Default:</b> 30
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
</dl>
|
||
|
||
<p>
|
||
If it takes a very long time for a cluster to start up, and
|
||
<cmdname>impala-shell</cmdname> consistently displays <codeph>This Impala daemon is not
|
||
ready to accept user requests</codeph>, the statestore might be taking too long to send
|
||
the entire catalog topic to the cluster. In this case, consider adding
|
||
<codeph>--load_catalog_in_background=false</codeph> to your catalog service
|
||
configuration. This setting stops the statestore from loading the entire catalog into
|
||
memory at cluster startup. Instead, metadata for each table is loaded when the table is
|
||
accessed for the first time.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="scalability_buffer_pool" rev="2.10.0 IMPALA-3200">
|
||
|
||
<title>Effect of Buffer Pool on Memory Usage (<keyword keyref="impala210"/> and higher)</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
The buffer pool feature, available in <keyword keyref="impala210"/> and higher, changes
|
||
the way Impala allocates memory during a query. Most of the memory needed is reserved at
|
||
the beginning of the query, avoiding cases where a query might run for a long time
|
||
before failing with an out-of-memory error. The actual memory estimates and memory
|
||
buffers are typically smaller than before, so that more queries can run concurrently or
|
||
process larger volumes of data than previously.
|
||
</p>
|
||
|
||
<p>
|
||
The buffer pool feature includes some query options that you can fine-tune:
|
||
<xref keyref="buffer_pool_limit"/>,
|
||
<xref
|
||
keyref="default_spillable_buffer_size"/>,
|
||
<xref keyref="max_row_size"
|
||
/>, and <xref keyref="min_spillable_buffer_size"/>.
|
||
</p>
|
||
|
||
<p>
|
||
Most of the effects of the buffer pool are transparent to you as an Impala user. Memory
|
||
use during spilling is now steadier and more predictable, instead of increasing rapidly
|
||
as more data is spilled to disk. The main change from a user perspective is the need to
|
||
increase the <codeph>MAX_ROW_SIZE</codeph> query option setting when querying tables
|
||
with columns containing long strings, many columns, or other combinations of factors
|
||
that produce very large rows. If Impala encounters rows that are too large to process
|
||
with the default query option settings, the query fails with an error message suggesting
|
||
to increase the <codeph>MAX_ROW_SIZE</codeph> setting.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept audience="hidden" id="scalability_cluster_size">
|
||
|
||
<title>Scalability Considerations for Impala Cluster Size and Topology</title>
|
||
|
||
<conbody>
|
||
|
||
<p/>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept audience="hidden" id="concurrent_connections">
|
||
|
||
<title>Scaling the Number of Concurrent Connections</title>
|
||
|
||
<conbody>
|
||
|
||
<p/>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept rev="2.0.0" id="spill_to_disk">
|
||
|
||
<title>SQL Operations that Spill to Disk</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Certain memory-intensive operations write temporary data to disk (known as
|
||
<term>spilling</term> to disk) when Impala is close to exceeding its memory limit on a
|
||
particular host.
|
||
</p>
|
||
|
||
<p>
|
||
The result is a query that completes successfully, rather than failing with an
|
||
out-of-memory error. The tradeoff is decreased performance due to the extra disk I/O to
|
||
write the temporary data and read it back in. The slowdown could be potentially be
|
||
significant. Thus, while this feature improves reliability, you should optimize your
|
||
queries, system parameters, and hardware configuration to make this spilling a rare
|
||
occurrence.
|
||
</p>
|
||
|
||
<note rev="2.10.0 IMPALA-3200">
|
||
<p>
|
||
In <keyword keyref="impala210"/> and higher, also see
|
||
<xref
|
||
keyref="scalability_buffer_pool"/> for changes to Impala memory
|
||
allocation that might change the details of which queries spill to disk, and how much
|
||
memory and disk space is involved in the spilling operation.
|
||
</p>
|
||
</note>
|
||
|
||
<p>
|
||
<b>What kinds of queries might spill to disk:</b>
|
||
</p>
|
||
|
||
<p>
|
||
Several SQL clauses and constructs require memory allocations that could activat the
|
||
spilling mechanism:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
<p>
|
||
when a query uses a <codeph>GROUP BY</codeph> clause for columns with millions or
|
||
billions of distinct values, Impala keeps a similar number of temporary results in
|
||
memory, to accumulate the aggregate results for each value in the group.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
<p>
|
||
When large tables are joined together, Impala keeps the values of the join columns
|
||
from one table in memory, to compare them to incoming values from the other table.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
<p>
|
||
When a large result set is sorted by the <codeph>ORDER BY</codeph> clause, each node
|
||
sorts its portion of the result set in memory.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
<p>
|
||
The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators build in-memory
|
||
data structures to represent all values found so far, to eliminate duplicates as the
|
||
query progresses.
|
||
</p>
|
||
</li>
|
||
|
||
<!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
|
||
<li>
|
||
<p rev="IMPALA-3471">
|
||
In <keyword keyref="impala26_full"/> and higher, <term>top-N</term> queries (those with
|
||
<codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can also spill.
|
||
Impala allocates enough memory to hold as many rows as specified by the <codeph>LIMIT</codeph>
|
||
clause, plus enough memory to hold as many rows as specified by any <codeph>OFFSET</codeph> clause.
|
||
</p>
|
||
</li>
|
||
-->
|
||
</ul>
|
||
|
||
<p
|
||
conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
|
||
|
||
<p>
|
||
<b>How Impala handles scratch disk space for spilling:</b>
|
||
</p>
|
||
|
||
<p rev="obwl"
|
||
conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
|
||
|
||
<p>
|
||
<b>Memory usage for SQL operators:</b>
|
||
</p>
|
||
|
||
<p rev="2.10.0 IMPALA-3200">
|
||
In <keyword keyref="impala210_full"/> and higher, the way SQL operators such as
|
||
<codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, transition between
|
||
using additional memory or activating the spill-to-disk feature is changed. The memory
|
||
required to spill to disk is reserved up front, and you can examine it in the
|
||
<codeph>EXPLAIN</codeph> plan when the <codeph>EXPLAIN_LEVEL</codeph> query option is
|
||
set to 2 or higher.
|
||
</p>
|
||
|
||
<p>
|
||
The infrastructure of the spilling feature affects the way the affected SQL operators,
|
||
such as <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. On
|
||
each host that participates in the query, each such operator in a query requires memory
|
||
to store rows of data and other data structures. Impala reserves a certain amount of
|
||
memory up front for each operator that supports spill-to-disk that is sufficient to
|
||
execute the operator. If an operator accumulates more data than can fit in the reserved
|
||
memory, it can either reserve more memory to continue processing data in memory or start
|
||
spilling data to temporary scratch files on disk. Thus, operators with spill-to-disk
|
||
support can adapt to different memory constraints by using however much memory is
|
||
available to speed up execution, yet tolerate low memory conditions by spilling data to
|
||
disk.
|
||
</p>
|
||
|
||
<p>
|
||
The amount data depends on the portion of the data being handled by that host, and thus
|
||
the operator may end up consuming different amounts of memory on different hosts.
|
||
</p>
|
||
|
||
<!--
|
||
<p>
|
||
The infrastructure of the spilling feature affects the way the affected SQL operators, such as
|
||
<codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory.
|
||
On each host that participates in the query, each such operator in a query accumulates memory
|
||
while building the data structure to process the aggregation or join operation. The amount
|
||
of memory used depends on the portion of the data being handled by that host, and thus might
|
||
be different from one host to another. When the amount of memory being used for the operator
|
||
on a particular host reaches a threshold amount, Impala reserves an additional memory buffer
|
||
to use as a work area in case that operator causes the query to exceed the memory limit for
|
||
that host. After allocating the memory buffer, the memory used by that operator remains
|
||
essentially stable or grows only slowly, until the point where the memory limit is reached
|
||
and the query begins writing temporary data to disk.
|
||
</p>
|
||
|
||
<p rev="2.2.0">
|
||
Prior to Impala 2.2, the extra memory buffer for an operator that might spill to disk
|
||
was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size,
|
||
and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value
|
||
is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, the memory for the buffer
|
||
is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</ph> A query that uses
|
||
multiple such operators might allocate multiple such memory buffers, as the size of the data structure
|
||
for each operator crosses the threshold on a particular host.
|
||
</p>
|
||
|
||
<p>
|
||
Therefore, a query that processes a relatively small amount of data on each host would likely
|
||
never reach the threshold for any operator, and would never allocate any extra memory buffers. A query
|
||
that did process millions of groups, distinct values, join keys, and so on might cross the threshold,
|
||
causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed
|
||
on any particular host, thus reducing the chance of requiring the extra memory allocation.
|
||
</p>
|
||
-->
|
||
|
||
<p>
|
||
<b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in
|
||
Impala 1.4. This feature was extended to cover join queries, aggregation functions, and
|
||
analytic functions in Impala 2.0. The size of the memory work area required by each
|
||
operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2.
|
||
<ph
|
||
rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to take
|
||
advantage of the Impala buffer pool feature and be more predictable and stable in
|
||
<keyword keyref="impala210_full"/>.</ph>
|
||
</p>
|
||
|
||
<p>
|
||
<b>Avoiding queries that spill to disk:</b>
|
||
</p>
|
||
|
||
<p>
|
||
Because the extra I/O can impose significant performance overhead on these types of
|
||
queries, try to avoid this situation by using the following steps:
|
||
</p>
|
||
|
||
<ol>
|
||
<li>
|
||
Detect how often queries spill to disk, and how much temporary data is written. Refer
|
||
to the following sources:
|
||
<ul>
|
||
<li>
|
||
The output of the <codeph>PROFILE</codeph> command in the
|
||
<cmdname>impala-shell</cmdname> interpreter. This data shows the memory usage for
|
||
each host and in total across the cluster. The <codeph>WriteIoBytes</codeph>
|
||
counter reports how much data was written to disk for each operator during the
|
||
query. (In <keyword
|
||
keyref="impala29_full"/>, the counter was
|
||
named <codeph>ScratchBytesWritten</codeph>; in
|
||
<keyword
|
||
keyref="impala28_full"/> and earlier, it was named
|
||
<codeph>BytesWritten</codeph>.)
|
||
</li>
|
||
|
||
<li>
|
||
The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface.
|
||
Select the query to examine and click the corresponding
|
||
<uicontrol>Profile</uicontrol> link. This data breaks down the memory usage for a
|
||
single host within the cluster, the host whose web interface you are connected to.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
|
||
<li>
|
||
Use one or more techniques to reduce the possibility of the queries spilling to disk:
|
||
<ul>
|
||
<li>
|
||
Increase the Impala memory limit if practical, for example, if you can increase
|
||
the available memory by more than the amount of temporary data written to disk on
|
||
a particular node. Remember that in Impala 2.0 and later, you can issue
|
||
<codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you fine-tune the
|
||
memory usage for queries from JDBC and ODBC applications.
|
||
</li>
|
||
|
||
<li>
|
||
Increase the number of nodes in the cluster, to increase the aggregate memory
|
||
available to Impala and reduce the amount of memory required on each node.
|
||
</li>
|
||
|
||
<li>
|
||
Add more memory to the hosts running Impala daemons.
|
||
</li>
|
||
|
||
<li>
|
||
On a cluster with resources shared between Impala and other Hadoop components, use
|
||
resource management features to allocate more memory for Impala. See
|
||
<xref
|
||
href="impala_resource_management.xml#resource_management"/>
|
||
for details.
|
||
</li>
|
||
|
||
<li>
|
||
If the memory pressure is due to running many concurrent queries rather than a few
|
||
memory-intensive ones, consider using the Impala admission control feature to
|
||
lower the limit on the number of concurrent queries. By spacing out the most
|
||
resource-intensive queries, you can avoid spikes in memory usage and improve
|
||
overall response times. See
|
||
<xref
|
||
href="impala_admission.xml#admission_control"/> for details.
|
||
</li>
|
||
|
||
<li>
|
||
Tune the queries with the highest memory requirements, using one or more of the
|
||
following techniques:
|
||
<ul>
|
||
<li>
|
||
Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in
|
||
large-scale joins and aggregation queries.
|
||
</li>
|
||
|
||
<li>
|
||
Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer
|
||
numeric values instead.
|
||
</li>
|
||
|
||
<li>
|
||
Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy
|
||
being used for the most resource-intensive queries. See
|
||
<xref href="impala_explain_plan.xml#perf_explain"
|
||
/> for
|
||
details.
|
||
</li>
|
||
|
||
<li>
|
||
If Impala still chooses a suboptimal execution strategy even with statistics
|
||
available, or if it is impractical to keep the statistics up to date for huge
|
||
or rapidly changing tables, add hints to the most resource-intensive queries
|
||
to select the right execution strategy. See
|
||
<xref
|
||
href="impala_hints.xml#hints"/> for details.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
|
||
<li>
|
||
If your queries experience substantial performance overhead due to spilling,
|
||
enable the <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option
|
||
prevents queries whose memory usage is likely to be exorbitant from spilling to
|
||
disk. See
|
||
<xref
|
||
href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/>
|
||
for details. As you tune problematic queries using the preceding steps, fewer and
|
||
fewer will be cancelled by this option setting.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
</ol>
|
||
|
||
<p>
|
||
<b>Testing performance implications of spilling to disk:</b>
|
||
</p>
|
||
|
||
<p>
|
||
To artificially provoke spilling, to test this feature and understand the performance
|
||
implications, use a test environment with a memory limit of at least 2 GB. Issue the
|
||
<codeph>SET</codeph> command with no arguments to check the current setting for the
|
||
<codeph>MEM_LIMIT</codeph> query option. Set the query option
|
||
<codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk
|
||
feature to prevent runaway disk usage from queries that are known in advance to be
|
||
suboptimal. Within <cmdname>impala-shell</cmdname>, run a query that you expect to be
|
||
memory-intensive, based on the criteria explained earlier. A self-join of a large table
|
||
is a good candidate:
|
||
</p>
|
||
|
||
<codeblock>select count(*) from big_table a join big_table b using (column_with_many_values);
|
||
</codeblock>
|
||
|
||
<p>
|
||
Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory
|
||
usage on each node during the query.
|
||
<!--
|
||
The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph>
|
||
portion. For example, this profile shows that the query did not quite exceed the memory limit.
|
||
-->
|
||
</p>
|
||
|
||
<!-- Commenting out because now stale due to changes from the buffer pool (IMPALA-3200).
|
||
To do: Revisit these details later if indicated by user feedback.
|
||
|
||
<codeblock>BlockMgr:
|
||
- BlockWritesIssued: 1
|
||
- BlockWritesOutstanding: 0
|
||
- BlocksCreated: 24
|
||
- BlocksRecycled: 1
|
||
- BufferedPins: 0
|
||
- MaxBlockSize: 8.00 MB (8388608)
|
||
<b>- MemoryLimit: 200.00 MB (209715200)</b>
|
||
<b>- PeakMemoryUsage: 192.22 MB (201555968)</b>
|
||
- TotalBufferWaitTime: 0ns
|
||
- TotalEncryptionTime: 0ns
|
||
- TotalIntegrityCheckTime: 0ns
|
||
- TotalReadBlockTime: 0ns
|
||
</codeblock>
|
||
|
||
<p>
|
||
In this case, because the memory limit was already below any recommended value, I increased the volume of
|
||
data for the query rather than reducing the memory limit any further.
|
||
</p>
|
||
-->
|
||
|
||
<p>
|
||
Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak
|
||
memory usage reported in the profile output. Now try the memory-intensive query again.
|
||
</p>
|
||
|
||
<p>
|
||
Check if the query fails with a message like the following:
|
||
</p>
|
||
|
||
<codeblock>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted
|
||
to prevent potentially bad plans from using too many cluster resources. Compute stats on
|
||
these tables, hint the plan or disable this behavior via query options to enable spilling.
|
||
</codeblock>
|
||
|
||
<p>
|
||
If so, the query could have consumed substantial temporary disk space, slowing down so
|
||
much that it would not complete in any reasonable time. Rather than rely on the
|
||
spill-to-disk feature in this case, issue the <codeph>COMPUTE STATS</codeph> statement
|
||
for the table or tables in your sample query. Then run the query again, check the peak
|
||
memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory limit
|
||
again if necessary to be lower than the peak memory usage.
|
||
</p>
|
||
|
||
<p>
|
||
At this point, you have a query that is memory-intensive, but Impala can optimize it
|
||
efficiently so that the memory usage is not exorbitant. You have set an artificial
|
||
constraint through the <codeph>MEM_LIMIT</codeph> option so that the query would
|
||
normally fail with an out-of-memory error. But the automatic spill-to-disk feature means
|
||
that the query should actually succeed, at the expense of some extra disk I/O to read
|
||
and write temporary work data.
|
||
</p>
|
||
|
||
<p>
|
||
Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph>
|
||
output again. This time, look for lines of this form:
|
||
</p>
|
||
|
||
<codeblock>- SpilledPartitions: <varname>N</varname>
|
||
</codeblock>
|
||
|
||
<p>
|
||
If you see any such lines with <varname>N</varname> greater than 0, that indicates the
|
||
query would have failed in Impala releases prior to 2.0, but now it succeeded because of
|
||
the spill-to-disk feature. Examine the total time taken by the
|
||
<codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero
|
||
<codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that
|
||
did not spill, for example in the <codeph>PROFILE</codeph> output when the same query is
|
||
run with a higher memory limit. This gives you an idea of the performance penalty of the
|
||
spill operation for a particular query with a particular memory limit. If you make the
|
||
memory limit just a little lower than the peak memory usage, the query only needs to
|
||
write a small amount of temporary data to disk. The lower you set the memory limit, the
|
||
more temporary data is written and the slower the query becomes.
|
||
</p>
|
||
|
||
<p>
|
||
Now repeat this procedure for actual queries used in your environment. Use the
|
||
<codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more
|
||
memory than necessary due to lack of statistics on the relevant tables and columns, and
|
||
issue <codeph>COMPUTE STATS</codeph> where necessary.
|
||
</p>
|
||
|
||
<p>
|
||
<b>When to use DISABLE_UNSAFE_SPILLS:</b>
|
||
</p>
|
||
|
||
<p>
|
||
You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the
|
||
time. Whether and how frequently to use this option depends on your system environment
|
||
and workload.
|
||
</p>
|
||
|
||
<p>
|
||
<codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc
|
||
queries whose performance characteristics and memory usage are not known in advance. It
|
||
prevents <q>worst-case scenario</q> queries that use large amounts of memory
|
||
unnecessarily. Thus, you might turn this option on within a session while developing new
|
||
SQL code, even though it is turned off for existing applications.
|
||
</p>
|
||
|
||
<p>
|
||
Organizations where table and column statistics are generally up-to-date might leave
|
||
this option turned on all the time, again to avoid worst-case scenarios for untested
|
||
queries or if a problem in the ETL pipeline results in a table with no statistics.
|
||
Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail fast</q> in this case
|
||
and immediately gather statistics or tune the problematic queries.
|
||
</p>
|
||
|
||
<p>
|
||
Some organizations might leave this option turned off. For example, you might have
|
||
tables large enough that the <codeph>COMPUTE STATS</codeph> takes substantial time to
|
||
run, making it impractical to re-run after loading new data. If you have examined the
|
||
<codeph>EXPLAIN</codeph> plans of your queries and know that they are operating
|
||
efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
|
||
case, you know that any queries that spill will not go overboard with their memory
|
||
consumption.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="complex_query">
|
||
|
||
<title>Limits on Query Size and Complexity</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
There are hardcoded limits on the maximum size and complexity of queries. Currently, the
|
||
maximum number of expressions in a query is 2000. You might exceed the limits with large
|
||
or deeply nested queries produced by business intelligence tools or other query
|
||
generators.
|
||
</p>
|
||
|
||
<p>
|
||
If you have the ability to customize such queries or the query generation logic that
|
||
produces them, replace sequences of repetitive expressions with single operators such as
|
||
<codeph>IN</codeph> or <codeph>BETWEEN</codeph> that can represent multiple values or
|
||
ranges. For example, instead of a large number of <codeph>OR</codeph> clauses:
|
||
</p>
|
||
|
||
<codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
|
||
</codeblock>
|
||
|
||
<p>
|
||
use a single <codeph>IN</codeph> clause:
|
||
</p>
|
||
|
||
<codeblock>WHERE val IN (1,2,6,100,...)</codeblock>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="scalability_io">
|
||
|
||
<title>Scalability Considerations for Impala I/O</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Impala parallelizes its I/O operations aggressively, therefore the more disks you can
|
||
attach to each host, the better. Impala retrieves data from disk so quickly using bulk
|
||
read operations on large blocks, that most queries are CPU-bound rather than I/O-bound.
|
||
</p>
|
||
|
||
<p>
|
||
Because the kind of sequential scanning typically done by Impala queries does not
|
||
benefit much from the random-access capabilities of SSDs, spinning disks typically
|
||
provide the most cost-effective kind of storage for Impala data, with little or no
|
||
performance penalty as compared to SSDs.
|
||
</p>
|
||
|
||
<p>
|
||
Resource management features such as YARN, Llama, and admission control typically
|
||
constrain the amount of memory, CPU, or overall number of queries in a high-concurrency
|
||
environment. Currently, there is no throttling mechanism for Impala I/O.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="big_tables">
|
||
|
||
<title>Scalability Considerations for Table Layout</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Due to the overhead of retrieving and updating table metadata in the metastore database,
|
||
try to limit the number of columns in a table to a maximum of approximately 2000.
|
||
Although Impala can handle wider tables than this, the metastore overhead can become
|
||
significant, leading to query performance that is slower than expected based on the
|
||
actual data volume.
|
||
</p>
|
||
|
||
<p>
|
||
To minimize overhead related to the metastore database and Impala query planning, try to
|
||
limit the number of partitions for any partitioned table to a few tens of thousands.
|
||
</p>
|
||
|
||
<p rev="IMPALA-5309">
|
||
If the volume of data within a table makes it impractical to run exploratory queries,
|
||
consider using the <codeph>TABLESAMPLE</codeph> clause to limit query processing to only
|
||
a percentage of data within the table. This technique reduces the overhead for query
|
||
startup, I/O to read the data, and the amount of network, CPU, and memory needed to
|
||
process intermediate results during the query. See <xref keyref="tablesample"/> for
|
||
details.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept rev="" id="kerberos_overhead_cluster_size">
|
||
|
||
<title>Kerberos-Related Network Overhead for Large Clusters</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a
|
||
number of simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might
|
||
be able to process all the requests within roughly 5 seconds. For a cluster with 1000
|
||
hosts, the time to process the requests would be roughly 500 seconds. Impala also makes
|
||
a number of DNS requests at the same time as these Kerberos-related requests.
|
||
</p>
|
||
|
||
<p>
|
||
While these authentication requests are being processed, any submitted Impala queries
|
||
will fail. During this period, the KDC and DNS may be slow to respond to requests from
|
||
components other than Impala, so other secure services might be affected temporarily.
|
||
</p>
|
||
|
||
<p>
|
||
In <keyword keyref="impala212_full"/> or earlier, to reduce the frequency of the
|
||
<codeph>kinit</codeph> renewal that initiates a new set of authentication requests,
|
||
increase the <codeph>kerberos_reinit_interval</codeph> configuration setting for the
|
||
<codeph>impalad</codeph> daemons. Currently, the default is 60 minutes. Consider using a
|
||
higher value such as 360 (6 hours).
|
||
</p>
|
||
|
||
<p>
|
||
The <codeph>kerberos_reinit_interval</codeph> configuration setting is removed in
|
||
<keyword keyref="impala30_full"/>, and the above step is no longer needed.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696">
|
||
|
||
<title>Avoiding CPU Hotspots for HDFS Cached Data</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
You can use the HDFS caching feature, described in
|
||
<xref
|
||
href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, with Impala to
|
||
reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
|
||
</p>
|
||
|
||
<p>
|
||
In the early days of this feature, you might have found that enabling HDFS caching
|
||
resulted in little or no performance improvement, because it could result in
|
||
<q>hotspots</q>: instead of the I/O to read the table data being parallelized across the
|
||
cluster, the I/O was reduced but the CPU load to process the data blocks might be
|
||
concentrated on a single host.
|
||
</p>
|
||
|
||
<p>
|
||
To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the
|
||
<codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that
|
||
use HDFS caching. This clause allows more than one host to cache the relevant data
|
||
blocks, so the CPU load can be shared, reducing the load on any one host. See
|
||
<xref
|
||
href="impala_create_table.xml#create_table"/> and
|
||
<xref
|
||
href="impala_alter_table.xml#alter_table"/> for details.
|
||
</p>
|
||
|
||
<p>
|
||
Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to
|
||
the way that Impala schedules the work of processing data blocks on different hosts. In
|
||
<keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work
|
||
for HDFS cached data is divided better among all the hosts that have cached replicas for
|
||
a particular data block. When more than one host has a cached replica for a data block,
|
||
Impala assigns the work of processing that block to whichever host has done the least
|
||
work (in terms of number of bytes read) for the current query. If hotspots persist even
|
||
with this load-based scheduling algorithm, you can enable the query option
|
||
<codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph> to further distribute the CPU load. This
|
||
setting causes Impala to randomly pick a host to process a cached data block if the
|
||
scheduling algorithm encounters a tie when deciding which host has done the least work.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="scalability_file_handle_cache" rev="2.10.0 IMPALA-4623">
|
||
|
||
<title>Scalability Considerations for File Handle Caching</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
One scalability aspect that affects heavily loaded clusters is the load on the metadata
|
||
layer from looking up the details as each file is opened. On HDFS, that can lead to
|
||
increased load on the NameNode, and on S3, this can lead to an excessive number of S3
|
||
metadata requests. For example, a query that does a full table scan on a partitioned
|
||
table may need to read thousands of partitions, each partition containing multiple data
|
||
files. Accessing each column of a Parquet file also involves a separate <q>open</q>
|
||
call, further increasing the load on the NameNode. High NameNode overhead can add
|
||
startup time (that is, increase latency) to Impala queries, and reduce overall
|
||
throughput for non-Impala workloads that also require accessing HDFS files.
|
||
</p>
|
||
|
||
<p>
|
||
You can reduce the number of calls made to your file system's metadata layer by enabling
|
||
the file handle caching feature. Data files that are accessed by different queries, or
|
||
even multiple times within the same query, can be accessed without a new <q>open</q>
|
||
call and without fetching the file details multiple times.
|
||
</p>
|
||
|
||
<p>
|
||
Impala supports file handle caching for the following file systems:
|
||
<ul>
|
||
<li>
|
||
HDFS in <keyword keyref="impala210_full"/> and higher
|
||
<p>
|
||
In Impala 3.2 and higher, file handle caching also applies to remote HDFS file
|
||
handles. This is controlled by the <codeph>cache_remote_file_handles</codeph> flag
|
||
for an <codeph>impalad</codeph>. It is recommended that you use the default value
|
||
of <codeph>true</codeph> as this caching prevents your NameNode from overloading
|
||
when your cluster has many remote HDFS reads.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
S3 in <keyword keyref="impala33_full"/> and higher
|
||
<p>
|
||
The <codeph>cache_s3_file_handles</codeph> <codeph>impalad</codeph> flag controls
|
||
the S3 file handle caching. The feature is enabled by default with the flag set to
|
||
<codeph>true</codeph>.
|
||
</p>
|
||
</li>
|
||
</ul>
|
||
</p>
|
||
|
||
<p>
|
||
The feature is enabled by default with 20,000 file handles to be cached. To change the
|
||
value, set the configuration option <codeph>max_cached_file_handles</codeph> to a
|
||
non-zero value for each <cmdname>impalad</cmdname> daemon. From the initial default
|
||
value of 20000, adjust upward if NameNode request load is still significant, or downward
|
||
if it is more important to reduce the extra memory usage on each host. Each cache entry
|
||
consumes 6 KB, meaning that caching 20,000 file handles requires up to 120 MB on each
|
||
Impala executor. The exact memory usage varies depending on how many file handles have
|
||
actually been cached; memory is freed as file handles are evicted from the cache.
|
||
</p>
|
||
|
||
<p>
|
||
If a manual operation moves a file to the trashcan while the file handle is cached,
|
||
Impala still accesses the contents of that file. This is a change from prior behavior.
|
||
Previously, accessing a file that was in the trashcan would cause an error. This
|
||
behavior only applies to non-Impala methods of removing files, not the Impala mechanisms
|
||
such as <codeph>TRUNCATE TABLE</codeph> or <codeph>DROP TABLE</codeph>.
|
||
</p>
|
||
|
||
<p>
|
||
If files are removed, replaced, or appended by operations outside of Impala, the way to
|
||
bring the file information up to date is to run the <codeph>REFRESH</codeph> statement
|
||
on the table.
|
||
</p>
|
||
|
||
<p>
|
||
File handle cache entries are evicted as the cache fills up, or based on a timeout
|
||
period when they have not been accessed for some time.
|
||
</p>
|
||
|
||
<p>
|
||
To evaluate the effectiveness of file handle caching for a particular workload, issue
|
||
the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> or examine
|
||
query profiles in the Impala Web UI. Look for the ratio of
|
||
<codeph>CachedFileHandlesHitCount</codeph> (ideally, should be high) to
|
||
<codeph>CachedFileHandlesMissCount</codeph> (ideally, should be low). Before starting
|
||
any evaluation, run several representative queries to <q>warm up</q> the cache because
|
||
the first time each data file is accessed is always recorded as a cache miss.
|
||
</p>
|
||
|
||
<p>
|
||
To see metrics about file handle caching for each <cmdname>impalad</cmdname> instance,
|
||
examine the following fields on the <uicontrol>/metrics</uicontrol> page in the Impala
|
||
Web UI:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
<uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>
|
||
</li>
|
||
|
||
<li>
|
||
<uicontrol>impala-server.io.mgr.num-cached-file-handles</uicontrol>
|
||
</li>
|
||
</ul>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
</concept>
|