mirror of
https://github.com/apache/impala.git
synced 2026-01-07 09:02:19 -05:00
This now gives a clean RAT check with bin/check-rat-report.py, which is one way for the Impala community to check compliance with ASF rules on intellectual property. Change-Id: I2ad06435f84a65ba126759e42a18fdaf52cd7036 Reviewed-on: http://gerrit.cloudera.org:8080/5232 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins Reviewed-by: John Russell <jrussell@cloudera.com>
852 lines
38 KiB
XML
852 lines
38 KiB
XML
<?xml version="1.0" encoding="UTF-8"?><!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="scalability">
|
||
|
||
<title>Scalability Considerations for Impala</title>
|
||
<titlealts audience="PDF"><navtitle>Scalability Considerations</navtitle></titlealts>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Performance"/>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Planning"/>
|
||
<data name="Category" value="Querying"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Memory"/>
|
||
<data name="Category" value="Scalability"/>
|
||
<!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. -->
|
||
<data name="Category" value="Proof of Concept"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
This section explains how the size of your cluster and the volume of data influences SQL performance and
|
||
schema design for Impala tables. Typically, adding more cluster capacity reduces problems due to memory
|
||
limits or disk throughput. On the other hand, larger clusters are more likely to have other kinds of
|
||
scalability issues, such as a single slow node that causes performance problems for queries.
|
||
</p>
|
||
|
||
<p outputclass="toc inpage"/>
|
||
|
||
<p conref="../shared/impala_common.xml#common/cookbook_blurb"/>
|
||
|
||
</conbody>
|
||
|
||
<concept audience="Cloudera" id="scalability_memory">
|
||
|
||
<title>Overview and Guidelines for Impala Memory Usage</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Memory"/>
|
||
<data name="Category" value="Concepts"/>
|
||
<data name="Category" value="Best Practices"/>
|
||
<data name="Category" value="Guidelines"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<!--
|
||
Outline adapted from Alan Choi's "best practices" and/or "performance cookbook" papers.
|
||
-->
|
||
|
||
<codeblock>Memory Usage – the Basics
|
||
* Memory is used by:
|
||
* Hash join – RHS tables after decompression, filtering and projection
|
||
* Group by – proportional to the #groups
|
||
* Parquet writer buffer – 1GB per partition
|
||
* IO buffer (shared across queries)
|
||
* Metadata cache (no more than 1GB typically)
|
||
* Memory held and reused by later query
|
||
* Impala releases memory from time to time starting in 1.4.
|
||
|
||
Memory Usage – Estimating Memory Usage
|
||
* Use Explain Plan
|
||
* Requires statistics! Mem estimate without stats is meaningless.
|
||
* Reports per-host memory requirement for this cluster size.
|
||
* Re-run if you’ve re-sized the cluster!
|
||
[image of explain plan]
|
||
|
||
Memory Usage – Estimating Memory Usage
|
||
* EXPLAIN’s memory estimate issues
|
||
* Can be way off – much higher or much lower.
|
||
* group by’s estimate can be particularly off – when there’s a large number of group by columns.
|
||
* Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... NDV of group by column n
|
||
* Ignore EXPLAIN’s estimate if it’s too high! • Do your own estimate for group by
|
||
* GROUP BY mem usage = (total number of groups * size of each row) + (total number of groups * size of each row) / num node
|
||
|
||
Memory Usage – Finding Actual Memory Usage
|
||
* Search for “Per Node Peak Memory Usage” in the profile.
|
||
This is accurate. Use it for production capacity planning.
|
||
|
||
Memory Usage – Actual Memory Usage
|
||
* For complex queries, how do I know which part of my query is using too much memory?
|
||
* Use the ExecSummary from the query profile!
|
||
- But is that "Peak Mem" number aggregate or per-node?
|
||
[image of executive summary]
|
||
|
||
Memory Usage – Hitting Mem-limit
|
||
* Top causes (in order) of hitting mem-limit even when running a single query:
|
||
1. Lack of statistics
|
||
2. Lots of joins within a single query
|
||
3. Big-table joining big-table
|
||
4. Gigantic group by
|
||
|
||
Memory Usage – Hitting Mem-limit
|
||
Lack of stats
|
||
* Wrong join order, wrong join strategy, wrong insert strategy
|
||
* Explain Plan tells you that!
|
||
[image of explain plan]
|
||
* Fix: Compute Stats table
|
||
|
||
Memory Usage – Hitting Mem-limit
|
||
Lots of joins within a single query
|
||
* select...from fact, dim1, dim2,dim3,...dimN where ...
|
||
* Each dim tbl can fit in memory, but not all of them together
|
||
* As of Impala 1.4, Impala might choose the wrong plan – BROADCAST
|
||
FIX 1: use shuffle hint
|
||
select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ...
|
||
FIX 2: pre-join the dim tables (if possible)
|
||
- How about an example to illustrate that technique?
|
||
* few join=>better perf!
|
||
|
||
Memory Usage: Hitting Mem-limit
|
||
Big-table joining big-table
|
||
* Big-table (after decompression, filtering, and projection) is a table that is bigger than total cluster memory size.
|
||
* Impala 2.0 will do this (via disk-based join). Consider using Hive for now.
|
||
* (Advanced) For a simple query, you can try this advanced workaround – per-partition join
|
||
* Requires the partition key be part of the join key
|
||
select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3)
|
||
union all
|
||
select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6)
|
||
|
||
Memory Usage: Hitting Mem-limit
|
||
Gigantic group by
|
||
* The total number of distinct groups is huge, such as group by userid.
|
||
* Impala 2.0 will do this (via disk-based agg). Consider using Hive for now.
|
||
- Is this one of the cases where people were unhappy we recommended Hive?
|
||
* (Advanced) For a simple query, you can try this advanced workaround – per-partition agg
|
||
* Requires the partition key be part of the group by
|
||
select part_key, col1, col2, ...agg(..) from tbl where
|
||
part_key in (1,2,3)
|
||
Union all
|
||
Select part_key, col1, col2, ...agg(..) from tbl where
|
||
part_key in (4,5,6)
|
||
- But where's the GROUP BY in the preceding query? Need a real example.
|
||
|
||
Memory Usage: Additional Notes
|
||
* Use explain plan for estimate; use profile for accurate measure
|
||
* Data skew can use uneven memory usage
|
||
* Review previous common issues on out-of-memory
|
||
* Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently
|
||
</codeblock>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="scalability_catalog">
|
||
|
||
<title>Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</title>
|
||
|
||
<conbody>
|
||
|
||
<p audience="Cloudera">
|
||
Details to fill in in future: Impact of <q>load catalog in background</q> option.
|
||
Changing timeouts. Related Cloudera Manager settings.
|
||
</p>
|
||
|
||
<p>
|
||
Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized for tables
|
||
containing relatively few, large data files. Schemas containing thousands of tables, or tables containing
|
||
thousands of partitions, can encounter performance issues during startup or during DDL operations such as
|
||
<codeph>ALTER TABLE</codeph> statements.
|
||
</p>
|
||
|
||
<note type="important" rev="TSB-168">
|
||
<p>
|
||
Because of a change in the default heap size for the <cmdname>catalogd</cmdname> daemon in
|
||
<keyword keyref="impala25_full"/> and higher, the following procedure to increase the <cmdname>catalogd</cmdname>
|
||
memory limit might be required following an upgrade to <keyword keyref="impala25_full"/> even if not
|
||
needed previously.
|
||
</p>
|
||
</note>
|
||
|
||
<p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/>
|
||
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept rev="2.1.0" id="statestore_scalability">
|
||
|
||
<title>Scalability Considerations for the Impala Statestore</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Before CDH 5.3, the statestore sent only one kind of message to its subscribers. This message contained all
|
||
updates for any topics that a subscriber had subscribed to. It also served to let subscribers know that the
|
||
statestore had not failed, and conversely the statestore used the success of sending a heartbeat to a
|
||
subscriber to decide whether or not the subscriber had failed.
|
||
</p>
|
||
|
||
<p>
|
||
Combining topic updates and failure detection in a single message led to bottlenecks in clusters with large
|
||
numbers of tables, partitions, and HDFS data blocks. When the statestore was overloaded with metadata
|
||
updates to transmit, heartbeat messages were sent less frequently, sometimes causing subscribers to time
|
||
out their connection with the statestore. Increasing the subscriber timeout and decreasing the frequency of
|
||
statestore heartbeats worked around the problem, but reduced responsiveness when the statestore failed or
|
||
restarted.
|
||
</p>
|
||
|
||
<p>
|
||
As of CDH 5.3, the statestore now sends topic updates and heartbeats in separate messages. This allows the
|
||
statestore to send and receive a steady stream of lightweight heartbeats, and removes the requirement to
|
||
send topic updates according to a fixed schedule, reducing statestore network overhead.
|
||
</p>
|
||
|
||
<p>
|
||
The statestore now has the following relevant configuration flags for the <cmdname>statestored</cmdname>
|
||
daemon:
|
||
</p>
|
||
|
||
<dl>
|
||
<dlentry id="statestore_num_update_threads">
|
||
|
||
<dt>
|
||
<codeph>-statestore_num_update_threads</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The number of threads inside the statestore dedicated to sending topic updates. You should not
|
||
typically need to change this value.
|
||
<p>
|
||
<b>Default:</b> 10
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_update_frequency_ms">
|
||
|
||
<dt>
|
||
<codeph>-statestore_update_frequency_ms</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The frequency, in milliseconds, with which the statestore tries to send topic updates to each
|
||
subscriber. This is a best-effort value; if the statestore is unable to meet this frequency, it sends
|
||
topic updates as fast as it can. You should not typically need to change this value.
|
||
<p>
|
||
<b>Default:</b> 2000
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_num_heartbeat_threads">
|
||
|
||
<dt>
|
||
<codeph>-statestore_num_heartbeat_threads</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The number of threads inside the statestore dedicated to sending heartbeats. You should not typically
|
||
need to change this value.
|
||
<p>
|
||
<b>Default:</b> 10
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
|
||
<dlentry id="statestore_heartbeat_frequency_ms">
|
||
|
||
<dt>
|
||
<codeph>-statestore_heartbeat_frequency_ms</codeph>
|
||
</dt>
|
||
|
||
<dd>
|
||
The frequency, in milliseconds, with which the statestore tries to send heartbeats to each subscriber.
|
||
This value should be good for large catalogs and clusters up to approximately 150 nodes. Beyond that,
|
||
you might need to increase this value to make the interval longer between heartbeat messages.
|
||
<p>
|
||
<b>Default:</b> 1000 (one heartbeat message every second)
|
||
</p>
|
||
</dd>
|
||
|
||
</dlentry>
|
||
</dl>
|
||
|
||
<p>
|
||
As of CDH 5.3, not all of these flags are present in the Cloudera Manager user interface. Some must be set
|
||
using the <uicontrol>Advanced Configuration Snippet</uicontrol> fields for the statestore component.
|
||
</p>
|
||
|
||
<p>
|
||
If it takes a very long time for a cluster to start up, and <cmdname>impala-shell</cmdname> consistently
|
||
displays <codeph>This Impala daemon is not ready to accept user requests</codeph>, the statestore might be
|
||
taking too long to send the entire catalog topic to the cluster. In this case, consider adding
|
||
<codeph>--load_catalog_in_background=false</codeph> to your catalog service configuration. This setting
|
||
stops the statestore from loading the entire catalog into memory at cluster startup. Instead, metadata for
|
||
each table is loaded when the table is accessed for the first time.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept audience="Cloudera" id="scalability_cluster_size">
|
||
|
||
<title>Scalability Considerations for Impala Cluster Size and Topology</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept audience="Cloudera" id="concurrent_connections">
|
||
|
||
<title>Scaling the Number of Concurrent Connections</title>
|
||
|
||
<conbody>
|
||
|
||
<p></p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept rev="2.0.0" id="spill_to_disk">
|
||
|
||
<title>SQL Operations that Spill to Disk</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Certain memory-intensive operations write temporary data to disk (known as <term>spilling</term> to disk)
|
||
when Impala is close to exceeding its memory limit on a particular host.
|
||
</p>
|
||
|
||
<p>
|
||
The result is a query that completes successfully, rather than failing with an out-of-memory error. The
|
||
tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back
|
||
in. The slowdown could be potentially be significant. Thus, while this feature improves reliability,
|
||
you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence.
|
||
</p>
|
||
|
||
<p>
|
||
<b>What kinds of queries might spill to disk:</b>
|
||
</p>
|
||
|
||
<p>
|
||
Several SQL clauses and constructs require memory allocations that could activat the spilling mechanism:
|
||
</p>
|
||
<ul>
|
||
<li>
|
||
<p>
|
||
when a query uses a <codeph>GROUP BY</codeph> clause for columns
|
||
with millions or billions of distinct values, Impala keeps a
|
||
similar number of temporary results in memory, to accumulate the
|
||
aggregate results for each value in the group.
|
||
</p>
|
||
</li>
|
||
<li>
|
||
<p>
|
||
When large tables are joined together, Impala keeps the values of
|
||
the join columns from one table in memory, to compare them to
|
||
incoming values from the other table.
|
||
</p>
|
||
</li>
|
||
<li>
|
||
<p>
|
||
When a large result set is sorted by the <codeph>ORDER BY</codeph>
|
||
clause, each node sorts its portion of the result set in memory.
|
||
</p>
|
||
</li>
|
||
<li>
|
||
<p>
|
||
The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators
|
||
build in-memory data structures to represent all values found so
|
||
far, to eliminate duplicates as the query progresses.
|
||
</p>
|
||
</li>
|
||
<!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
|
||
<li>
|
||
<p rev="IMPALA-3471">
|
||
In <keyword keyref="impala26_full"/> and higher, <term>top-N</term> queries (those with
|
||
<codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can also spill.
|
||
Impala allocates enough memory to hold as many rows as specified by the <codeph>LIMIT</codeph>
|
||
clause, plus enough memory to hold as many rows as specified by any <codeph>OFFSET</codeph> clause.
|
||
</p>
|
||
</li>
|
||
-->
|
||
</ul>
|
||
|
||
<p conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
|
||
|
||
<p>
|
||
<b>How Impala handles scratch disk space for spilling:</b>
|
||
</p>
|
||
|
||
<p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
|
||
|
||
<p>
|
||
<b>Memory usage for SQL operators:</b>
|
||
</p>
|
||
|
||
<p>
|
||
The infrastructure of the spilling feature affects the way the affected SQL operators, such as
|
||
<codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory.
|
||
On each host that participates in the query, each such operator in a query accumulates memory
|
||
while building the data structure to process the aggregation or join operation. The amount
|
||
of memory used depends on the portion of the data being handled by that host, and thus might
|
||
be different from one host to another. When the amount of memory being used for the operator
|
||
on a particular host reaches a threshold amount, Impala reserves an additional memory buffer
|
||
to use as a work area in case that operator causes the query to exceed the memory limit for
|
||
that host. After allocating the memory buffer, the memory used by that operator remains
|
||
essentially stable or grows only slowly, until the point where the memory limit is reached
|
||
and the query begins writing temporary data to disk.
|
||
</p>
|
||
|
||
<p rev="2.2.0">
|
||
Prior to Impala 2.2 (CDH 5.4), the extra memory buffer for an operator that might spill to disk
|
||
was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size,
|
||
and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value
|
||
is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, the memory for the buffer
|
||
is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</ph> A query that uses
|
||
multiple such operators might allocate multiple such memory buffers, as the size of the data structure
|
||
for each operator crosses the threshold on a particular host.
|
||
</p>
|
||
|
||
<p>
|
||
Therefore, a query that processes a relatively small amount of data on each host would likely
|
||
never reach the threshold for any operator, and would never allocate any extra memory buffers. A query
|
||
that did process millions of groups, distinct values, join keys, and so on might cross the threshold,
|
||
causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed
|
||
on any particular host, thus reducing the chance of requiring the extra memory allocation.
|
||
</p>
|
||
|
||
<p>
|
||
<b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in Impala 1.4 for CDH 4,
|
||
and in CDH 5.1. This feature was extended to cover join queries, aggregation functions, and analytic
|
||
functions in Impala 2.0 for CDH 4, and in CDH 5.2. The size of the memory work area required by
|
||
each operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2 (CDH 5.4).
|
||
</p>
|
||
|
||
<p>
|
||
<b>Avoiding queries that spill to disk:</b>
|
||
</p>
|
||
|
||
<p>
|
||
Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid
|
||
this situation by using the following steps:
|
||
</p>
|
||
|
||
<ol>
|
||
<li>
|
||
Detect how often queries spill to disk, and how much temporary data is written. Refer to the following
|
||
sources:
|
||
<ul>
|
||
<li>
|
||
The output of the <codeph>PROFILE</codeph> command in the <cmdname>impala-shell</cmdname>
|
||
interpreter. This data shows the memory usage for each host and in total across the cluster. The
|
||
<codeph>BlockMgr.BytesWritten</codeph> counter reports how much data was written to disk during the
|
||
query.
|
||
</li>
|
||
|
||
<li>
|
||
The <uicontrol>Impala Queries</uicontrol> dialog in Cloudera Manager. You can see the peak memory
|
||
usage for a query, combined across all nodes in the cluster.
|
||
</li>
|
||
|
||
<li>
|
||
The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface. Select the query to
|
||
examine and click the corresponding <uicontrol>Profile</uicontrol> link. This data breaks down the
|
||
memory usage for a single host within the cluster, the host whose web interface you are connected to.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
|
||
<li>
|
||
Use one or more techniques to reduce the possibility of the queries spilling to disk:
|
||
<ul>
|
||
<li>
|
||
Increase the Impala memory limit if practical, for example, if you can increase the available memory
|
||
by more than the amount of temporary data written to disk on a particular node. Remember that in
|
||
Impala 2.0 and later, you can issue <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you
|
||
fine-tune the memory usage for queries from JDBC and ODBC applications.
|
||
</li>
|
||
|
||
<li>
|
||
Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and
|
||
reduce the amount of memory required on each node.
|
||
</li>
|
||
|
||
<li>
|
||
Increase the overall memory capacity of each DataNode at the hardware level.
|
||
</li>
|
||
|
||
<li>
|
||
On a cluster with resources shared between Impala and other Hadoop components, use resource
|
||
management features to allocate more memory for Impala. See
|
||
<xref href="impala_resource_management.xml#resource_management"/> for details.
|
||
</li>
|
||
|
||
<li>
|
||
If the memory pressure is due to running many concurrent queries rather than a few memory-intensive
|
||
ones, consider using the Impala admission control feature to lower the limit on the number of
|
||
concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in
|
||
memory usage and improve overall response times. See
|
||
<xref href="impala_admission.xml#admission_control"/> for details.
|
||
</li>
|
||
|
||
<li>
|
||
Tune the queries with the highest memory requirements, using one or more of the following techniques:
|
||
<ul>
|
||
<li>
|
||
Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in large-scale joins and
|
||
aggregation queries.
|
||
</li>
|
||
|
||
<li>
|
||
Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer numeric values
|
||
instead.
|
||
</li>
|
||
|
||
<li>
|
||
Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy being used for the
|
||
most resource-intensive queries. See <xref href="impala_explain_plan.xml#perf_explain"/> for
|
||
details.
|
||
</li>
|
||
|
||
<li>
|
||
If Impala still chooses a suboptimal execution strategy even with statistics available, or if it
|
||
is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints
|
||
to the most resource-intensive queries to select the right execution strategy. See
|
||
<xref href="impala_hints.xml#hints"/> for details.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
|
||
<li>
|
||
If your queries experience substantial performance overhead due to spilling, enable the
|
||
<codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option prevents queries whose memory usage
|
||
is likely to be exorbitant from spilling to disk. See
|
||
<xref href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> for details. As you tune
|
||
problematic queries using the preceding steps, fewer and fewer will be cancelled by this option
|
||
setting.
|
||
</li>
|
||
</ul>
|
||
</li>
|
||
</ol>
|
||
|
||
<p>
|
||
<b>Testing performance implications of spilling to disk:</b>
|
||
</p>
|
||
|
||
<p>
|
||
To artificially provoke spilling, to test this feature and understand the performance implications, use a
|
||
test environment with a memory limit of at least 2 GB. Issue the <codeph>SET</codeph> command with no
|
||
arguments to check the current setting for the <codeph>MEM_LIMIT</codeph> query option. Set the query
|
||
option <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk feature to prevent
|
||
runaway disk usage from queries that are known in advance to be suboptimal. Within
|
||
<cmdname>impala-shell</cmdname>, run a query that you expect to be memory-intensive, based on the criteria
|
||
explained earlier. A self-join of a large table is a good candidate:
|
||
</p>
|
||
|
||
<codeblock>select count(*) from big_table a join big_table b using (column_with_many_values);
|
||
</codeblock>
|
||
|
||
<p>
|
||
Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory usage on each node
|
||
during the query. The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph>
|
||
portion. For example, this profile shows that the query did not quite exceed the memory limit.
|
||
</p>
|
||
|
||
<codeblock>BlockMgr:
|
||
- BlockWritesIssued: 1
|
||
- BlockWritesOutstanding: 0
|
||
- BlocksCreated: 24
|
||
- BlocksRecycled: 1
|
||
- BufferedPins: 0
|
||
- MaxBlockSize: 8.00 MB (8388608)
|
||
<b>- MemoryLimit: 200.00 MB (209715200)</b>
|
||
<b>- PeakMemoryUsage: 192.22 MB (201555968)</b>
|
||
- TotalBufferWaitTime: 0ns
|
||
- TotalEncryptionTime: 0ns
|
||
- TotalIntegrityCheckTime: 0ns
|
||
- TotalReadBlockTime: 0ns
|
||
</codeblock>
|
||
|
||
<p>
|
||
In this case, because the memory limit was already below any recommended value, I increased the volume of
|
||
data for the query rather than reducing the memory limit any further.
|
||
</p>
|
||
|
||
<p>
|
||
Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak memory usage
|
||
reported in the profile output. Do not specify a memory limit lower than about 300 MB, because with such a
|
||
low limit, queries could fail to start for other reasons. Now try the memory-intensive query again.
|
||
</p>
|
||
|
||
<p>
|
||
Check if the query fails with a message like the following:
|
||
</p>
|
||
|
||
<codeblock>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted
|
||
to prevent potentially bad plans from using too many cluster resources. Compute stats on
|
||
these tables, hint the plan or disable this behavior via query options to enable spilling.
|
||
</codeblock>
|
||
|
||
<p>
|
||
If so, the query could have consumed substantial temporary disk space, slowing down so much that it would
|
||
not complete in any reasonable time. Rather than rely on the spill-to-disk feature in this case, issue the
|
||
<codeph>COMPUTE STATS</codeph> statement for the table or tables in your sample query. Then run the query
|
||
again, check the peak memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory
|
||
limit again if necessary to be lower than the peak memory usage.
|
||
</p>
|
||
|
||
<p>
|
||
At this point, you have a query that is memory-intensive, but Impala can optimize it efficiently so that
|
||
the memory usage is not exorbitant. You have set an artificial constraint through the
|
||
<codeph>MEM_LIMIT</codeph> option so that the query would normally fail with an out-of-memory error. But
|
||
the automatic spill-to-disk feature means that the query should actually succeed, at the expense of some
|
||
extra disk I/O to read and write temporary work data.
|
||
</p>
|
||
|
||
<p>
|
||
Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph> output again. This
|
||
time, look for lines of this form:
|
||
</p>
|
||
|
||
<codeblock>- SpilledPartitions: <varname>N</varname>
|
||
</codeblock>
|
||
|
||
<p>
|
||
If you see any such lines with <varname>N</varname> greater than 0, that indicates the query would have
|
||
failed in Impala releases prior to 2.0, but now it succeeded because of the spill-to-disk feature. Examine
|
||
the total time taken by the <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero
|
||
<codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that did not spill, for
|
||
example in the <codeph>PROFILE</codeph> output when the same query is run with a higher memory limit. This
|
||
gives you an idea of the performance penalty of the spill operation for a particular query with a
|
||
particular memory limit. If you make the memory limit just a little lower than the peak memory usage, the
|
||
query only needs to write a small amount of temporary data to disk. The lower you set the memory limit, the
|
||
more temporary data is written and the slower the query becomes.
|
||
</p>
|
||
|
||
<p>
|
||
Now repeat this procedure for actual queries used in your environment. Use the
|
||
<codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more memory than
|
||
necessary due to lack of statistics on the relevant tables and columns, and issue <codeph>COMPUTE
|
||
STATS</codeph> where necessary.
|
||
</p>
|
||
|
||
<p>
|
||
<b>When to use DISABLE_UNSAFE_SPILLS:</b>
|
||
</p>
|
||
|
||
<p>
|
||
You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the time. Whether and
|
||
how frequently to use this option depends on your system environment and workload.
|
||
</p>
|
||
|
||
<p>
|
||
<codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc queries whose performance
|
||
characteristics and memory usage are not known in advance. It prevents <q>worst-case scenario</q> queries
|
||
that use large amounts of memory unnecessarily. Thus, you might turn this option on within a session while
|
||
developing new SQL code, even though it is turned off for existing applications.
|
||
</p>
|
||
|
||
<p>
|
||
Organizations where table and column statistics are generally up-to-date might leave this option turned on
|
||
all the time, again to avoid worst-case scenarios for untested queries or if a problem in the ETL pipeline
|
||
results in a table with no statistics. Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail
|
||
fast</q> in this case and immediately gather statistics or tune the problematic queries.
|
||
</p>
|
||
|
||
<p>
|
||
Some organizations might leave this option turned off. For example, you might have tables large enough that
|
||
the <codeph>COMPUTE STATS</codeph> takes substantial time to run, making it impractical to re-run after
|
||
loading new data. If you have examined the <codeph>EXPLAIN</codeph> plans of your queries and know that
|
||
they are operating efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
|
||
case, you know that any queries that spill will not go overboard with their memory consumption.
|
||
</p>
|
||
|
||
<!--
|
||
<p>
|
||
<b>Turning off the spill-to-disk feature: (<keyword keyref="impala24_full"/> and lower only)</b>
|
||
</p>
|
||
|
||
<p>
|
||
Prior to <keyword keyref="impala25_full"/> certain conditions...
|
||
</p>
|
||
|
||
<p>
|
||
You might turn off the spill-to-disk feature if you are in an environment with constraints on disk space,
|
||
or if you prefer for queries that exceed the memory capacity in your cluster to <q>fail fast</q> so that
|
||
you can tune and retry them.
|
||
</p>
|
||
|
||
<p>
|
||
To turn off this feature, set the following configuration options for each <cmdname>impalad</cmdname>
|
||
daemon, either through the <cmdname>impalad</cmdname> advanced configuration snippet in Cloudera Manager,
|
||
or during <cmdname>impalad</cmdname> startup on each DataNode on systems not managed by Cloudera Manager:
|
||
</p>
|
||
|
||
<codeblock>−−enable_partitioned_aggregation=false
|
||
−−enable_partitioned_hash_join=false
|
||
</codeblock>
|
||
-->
|
||
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="complex_query">
|
||
<title>Limits on Query Size and Complexity</title>
|
||
<conbody>
|
||
<p>
|
||
There are hardcoded limits on the maximum size and complexity of queries.
|
||
Currently, the maximum number of expressions in a query is 2000.
|
||
You might exceed the limits with large or deeply nested queries
|
||
produced by business intelligence tools or other query generators.
|
||
</p>
|
||
<p>
|
||
If you have the ability to customize such queries or the query generation
|
||
logic that produces them, replace sequences of repetitive expressions
|
||
with single operators such as <codeph>IN</codeph> or <codeph>BETWEEN</codeph>
|
||
that can represent multiple values or ranges.
|
||
For example, instead of a large number of <codeph>OR</codeph> clauses:
|
||
</p>
|
||
<codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
|
||
</codeblock>
|
||
<p>
|
||
use a single <codeph>IN</codeph> clause:
|
||
</p>
|
||
<codeblock>WHERE val IN (1,2,6,100,...)</codeblock>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="scalability_io">
|
||
<title>Scalability Considerations for Impala I/O</title>
|
||
<conbody>
|
||
<p>
|
||
Impala parallelizes its I/O operations aggressively,
|
||
therefore the more disks you can attach to each host, the better.
|
||
Impala retrieves data from disk so quickly using
|
||
bulk read operations on large blocks, that most queries
|
||
are CPU-bound rather than I/O-bound.
|
||
</p>
|
||
<p>
|
||
Because the kind of sequential scanning typically done by
|
||
Impala queries does not benefit much from the random-access
|
||
capabilities of SSDs, spinning disks typically provide
|
||
the most cost-effective kind of storage for Impala data,
|
||
with little or no performance penalty as compared to SSDs.
|
||
</p>
|
||
<p>
|
||
Resource management features such as YARN, Llama, and admission control
|
||
typically constrain the amount of memory, CPU, or overall number of
|
||
queries in a high-concurrency environment.
|
||
Currently, there is no throttling mechanism for Impala I/O.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="big_tables">
|
||
<title>Scalability Considerations for Table Layout</title>
|
||
<conbody>
|
||
<p>
|
||
Due to the overhead of retrieving and updating table metadata
|
||
in the metastore database, try to limit the number of columns
|
||
in a table to a maximum of approximately 2000.
|
||
Although Impala can handle wider tables than this, the metastore overhead
|
||
can become significant, leading to query performance that is slower
|
||
than expected based on the actual data volume.
|
||
</p>
|
||
<p>
|
||
To minimize overhead related to the metastore database and Impala query planning,
|
||
try to limit the number of partitions for any partitioned table to a few tens of thousands.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept rev="CDH-38321" id="kerberos_overhead_cluster_size">
|
||
<title>Kerberos-Related Network Overhead for Large Clusters</title>
|
||
<conbody>
|
||
<p>
|
||
When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a number of
|
||
simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might be able to process
|
||
all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the time to process
|
||
the requests would be roughly 500 seconds. Impala also makes a number of DNS requests at the same
|
||
time as these Kerberos-related requests.
|
||
</p>
|
||
<p>
|
||
While these authentication requests are being processed, any submitted Impala queries will fail.
|
||
During this period, the KDC and DNS may be slow to respond to requests from components other than Impala,
|
||
so other secure services might be affected temporarily.
|
||
</p>
|
||
<p>
|
||
To reduce the frequency of the <codeph>kinit</codeph> renewal that initiates a new set of
|
||
authentication requests, increase the <codeph>kerberos_reinit_interval</codeph> configuration setting
|
||
for the <cmdname>impalad</cmdname> daemons. Currently, the default for a cluster not managed by
|
||
Cloudera Manager is 60 minutes, while the default under Cloudera Manager is 10 minutes.
|
||
Consider using a higher value such as 360 (6 hours).
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696">
|
||
<title>Avoiding CPU Hotspots for HDFS Cached Data</title>
|
||
<conbody>
|
||
<p>
|
||
You can use the HDFS caching feature, described in <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>,
|
||
with Impala to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
|
||
</p>
|
||
<p>
|
||
In the early days of this feature, you might have found that enabling HDFS caching
|
||
resulted in little or no performance improvement, because it could result in
|
||
<q>hotspots</q>: instead of the I/O to read the table data being parallelized across
|
||
the cluster, the I/O was reduced but the CPU load to process the data blocks
|
||
might be concentrated on a single host.
|
||
</p>
|
||
<p>
|
||
To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the
|
||
<codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that use HDFS caching.
|
||
This clause allows more than one host to cache the relevant data blocks, so the CPU load
|
||
can be shared, reducing the load on any one host.
|
||
See <xref href="impala_create_table.xml#create_table"/> and <xref href="impala_alter_table.xml#alter_table"/>
|
||
for details.
|
||
</p>
|
||
<p>
|
||
Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to
|
||
the way that Impala schedules the work of processing data blocks on different hosts.
|
||
In <keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work for
|
||
HDFS cached data is divided better among all the hosts that have cached replicas
|
||
for a particular data block. When more than one host has a cached replica for a data block,
|
||
Impala assigns the work of processing that block to whichever host has done the least work
|
||
(in terms of number of bytes read) for the current query. If hotspots persist even with this
|
||
load-based scheduling algorithm, you can enable the query option <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph>
|
||
to further distribute the CPU load. This setting causes Impala to randomly pick a host to process a cached
|
||
data block if the scheduling algorithm encounters a tie when deciding which host has done the
|
||
least work.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
</concept>
|