mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Change-Id: I8342bcb47d4a9aa422e234e488dd1dfbdc1694d4 Reviewed-on: http://gerrit.cloudera.org:8080/10657 Reviewed-by: Balazs Jeszenszky <jeszyb@gmail.com> Reviewed-by: Alex Rodoni <arodoni@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
358 lines
18 KiB
XML
358 lines
18 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="perf_cookbook">
|
||
|
||
<title>Impala Performance Guidelines and Best Practices</title>
|
||
<titlealts audience="PDF"><navtitle>Performance Best Practices</navtitle></titlealts>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Performance"/>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Planning"/>
|
||
<data name="Category" value="Proof of Concept"/>
|
||
<data name="Category" value="Guidelines"/>
|
||
<data name="Category" value="Best Practices"/>
|
||
<data name="Category" value="Proof of Concept"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Data Analysts"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Here are performance guidelines and best practices that you can use during planning, experimentation, and
|
||
performance tuning for an Impala-enabled <keyword keyref="distro"/> cluster. All of this information is also available in more
|
||
detail elsewhere in the Impala documentation; it is gathered together here to serve as a cookbook and
|
||
emphasize which performance techniques typically provide the highest return on investment
|
||
</p>
|
||
|
||
<p outputclass="toc inpage"/>
|
||
|
||
<section id="perf_cookbook_file_format">
|
||
|
||
<title>Choose the appropriate file format for the data</title>
|
||
|
||
<p>
|
||
Typically, for large volumes of data (multiple gigabytes per table or partition), the Parquet file format
|
||
performs best because of its combination of columnar storage layout, large I/O request size, and
|
||
compression and encoding. See <xref href="impala_file_formats.xml#file_formats"/> for comparisons of all
|
||
file formats supported by Impala, and <xref href="impala_parquet.xml#parquet"/> for details about the
|
||
Parquet file format.
|
||
</p>
|
||
|
||
<note>
|
||
For smaller volumes of data, a few gigabytes or less for each table or partition, you might not see
|
||
significant performance differences between file formats. At small data volumes, reduced I/O from an
|
||
efficient compressed file format can be counterbalanced by reduced opportunity for parallel execution. When
|
||
planning for a production deployment or conducting benchmarks, always use realistic data volumes to get a
|
||
true picture of performance and scalability.
|
||
</note>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_small_files">
|
||
|
||
<title>Avoid data ingestion processes that produce many small files</title>
|
||
|
||
<p>
|
||
When producing data files outside of Impala, prefer either text format or Avro, where you can build up the
|
||
files row by row. Once the data is in Impala, you can convert it to the more efficient Parquet format and
|
||
split into multiple data files using a single <codeph>INSERT ... SELECT</codeph> statement. Or, if you have
|
||
the infrastructure to produce multi-megabyte Parquet files as part of your data preparation process, do
|
||
that and skip the conversion step inside Impala.
|
||
</p>
|
||
|
||
<p>
|
||
Always use <codeph>INSERT ... SELECT</codeph> to copy significant volumes of data from table to table
|
||
within Impala. Avoid <codeph>INSERT ... VALUES</codeph> for any substantial volume of data or
|
||
performance-critical tables, because each such statement produces a separate tiny data file. See
|
||
<xref href="impala_insert.xml#insert"/> for examples of the <codeph>INSERT ... SELECT</codeph> syntax.
|
||
</p>
|
||
|
||
<p>
|
||
For example, if you have thousands of partitions in a Parquet table, each with less than
|
||
<ph rev="parquet_block_size">256 MB</ph> of data, consider partitioning in a less granular way, such as by
|
||
year / month rather than year / month / day. If an inefficient data ingestion process produces thousands of
|
||
data files in the same table or partition, consider compacting the data by performing an <codeph>INSERT ...
|
||
SELECT</codeph> to copy all the data to a different table; the data will be reorganized into a smaller
|
||
number of larger files by this process.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_partitioning">
|
||
|
||
<title>Choose partitioning granularity based on actual data volume</title>
|
||
|
||
<p>
|
||
Partitioning is a technique that physically divides the data based on values of one or more columns, such
|
||
as by year, month, day, region, city, section of a web site, and so on. When you issue queries that request
|
||
a specific value or range of values for the partition key columns, Impala can avoid reading the irrelevant
|
||
data, potentially yielding a huge savings in disk I/O.
|
||
</p>
|
||
|
||
<p>
|
||
When deciding which column(s) to use for partitioning, choose the right level of granularity. For example,
|
||
should you partition by year, month, and day, or only by year and month? Choose a partitioning strategy
|
||
that puts at least <ph rev="parquet_block_size">256 MB</ph> of data in each partition, to take advantage of
|
||
HDFS bulk I/O and Impala distributed queries.
|
||
</p>
|
||
|
||
<p>
|
||
Over-partitioning can also cause query planning to take longer than necessary, as Impala prunes the
|
||
unnecessary partitions. Ideally, keep the number of partitions in the table under 30 thousand.
|
||
</p>
|
||
|
||
<p>
|
||
When preparing data files to go in a partition directory, create several large files rather than many small
|
||
ones. If you receive data in the form of many small files and have no control over the input format,
|
||
consider using the <codeph>INSERT ... SELECT</codeph> syntax to copy data from one table or partition to
|
||
another, which compacts the files into a relatively small number (based on the number of nodes in the
|
||
cluster).
|
||
</p>
|
||
|
||
<p>
|
||
If you need to reduce the overall number of partitions and increase the amount of data in each partition,
|
||
first look for partition key columns that are rarely referenced or are referenced in non-critical queries
|
||
(not subject to an SLA). For example, your web site log data might be partitioned by year, month, day, and
|
||
hour, but if most queries roll up the results by day, perhaps you only need to partition by year, month,
|
||
and day.
|
||
</p>
|
||
|
||
<p>
|
||
If you need to reduce the granularity even more, consider creating <q>buckets</q>, computed values
|
||
corresponding to different sets of partition key values. For example, you can use the
|
||
<codeph>TRUNC()</codeph> function with a <codeph>TIMESTAMP</codeph> column to group date and time values
|
||
based on intervals such as week or quarter. See
|
||
<xref href="impala_datetime_functions.xml#datetime_functions"/> for details.
|
||
</p>
|
||
|
||
<p>
|
||
See <xref href="impala_partitioning.xml#partitioning"/> for full details and performance considerations for
|
||
partitioning.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_partition_keys">
|
||
|
||
<title>Use smallest appropriate integer types for partition key columns</title>
|
||
|
||
<p>
|
||
Although it is tempting to use strings for partition key columns, since those values are turned into HDFS
|
||
directory names anyway, you can minimize memory usage by using numeric values for common partition key
|
||
fields such as <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>. Use the smallest
|
||
integer type that holds the appropriate range of values, typically <codeph>TINYINT</codeph> for
|
||
<codeph>MONTH</codeph> and <codeph>DAY</codeph>, and <codeph>SMALLINT</codeph> for <codeph>YEAR</codeph>.
|
||
Use the <codeph>EXTRACT()</codeph> function to pull out individual date and time fields from a
|
||
<codeph>TIMESTAMP</codeph> value, and <codeph>CAST()</codeph> the return value to the appropriate integer
|
||
type.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_parquet_block_size">
|
||
|
||
<title>Choose an appropriate Parquet block size</title>
|
||
|
||
<p rev="parquet_block_size">
|
||
By default, the Impala <codeph>INSERT ... SELECT</codeph> statement creates Parquet files with a 256 MB
|
||
block size. (This default was changed in Impala 2.0. Formerly, the limit was 1 GB, but Impala made
|
||
conservative estimates about compression, resulting in files that were smaller than 1 GB.)
|
||
</p>
|
||
|
||
<p>
|
||
Each Parquet file written by Impala is a single block, allowing the whole file to be processed as a unit by a single host.
|
||
As you copy Parquet files into HDFS or between HDFS filesystems, use <codeph>hdfs dfs -pb</codeph> to preserve the original
|
||
block size.
|
||
</p>
|
||
|
||
<p>
|
||
If there is only one or a few data block in your Parquet table, or in a partition that is the only one
|
||
accessed by a query, then you might experience a slowdown for a different reason: not enough data to take
|
||
advantage of Impala's parallel distributed queries. Each data block is processed by a single core on one of
|
||
the DataNodes. In a 100-node cluster of 16-core machines, you could potentially process thousands of data
|
||
files simultaneously. You want to find a sweet spot between <q>many tiny files</q> and <q>single giant
|
||
file</q> that balances bulk I/O and parallel processing. You can set the <codeph>PARQUET_FILE_SIZE</codeph>
|
||
query option before doing an <codeph>INSERT ... SELECT</codeph> statement to reduce the size of each
|
||
generated Parquet file. <ph rev="2.0.0">(Specify the file size as an absolute number of bytes, or in Impala
|
||
2.0 and later, in units ending with <codeph>m</codeph> for megabytes or <codeph>g</codeph> for
|
||
gigabytes.)</ph> Run benchmarks with different file sizes to find the right balance point for your
|
||
particular data volume.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_stats">
|
||
|
||
<title>Gather statistics for all tables used in performance-critical or high-volume join queries</title>
|
||
|
||
<p>
|
||
Gather the statistics with the <codeph>COMPUTE STATS</codeph> statement. See
|
||
<xref href="impala_perf_joins.xml#perf_joins"/> for details.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_network">
|
||
|
||
<title>Minimize the overhead of transmitting results back to the client</title>
|
||
|
||
<p>
|
||
Use techniques such as:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
Aggregation. If you need to know how many rows match a condition, the total values of matching values
|
||
from some column, the lowest or highest matching value, and so on, call aggregate functions such as
|
||
<codeph>COUNT()</codeph>, <codeph>SUM()</codeph>, and <codeph>MAX()</codeph> in the query rather than
|
||
sending the result set to an application and doing those computations there. Remember that the size of an
|
||
unaggregated result set could be huge, requiring substantial time to transmit across the network.
|
||
</li>
|
||
|
||
<li>
|
||
Filtering. Use all applicable tests in the <codeph>WHERE</codeph> clause of a query to eliminate rows
|
||
that are not relevant, rather than producing a big result set and filtering it using application logic.
|
||
</li>
|
||
|
||
<li>
|
||
<codeph>LIMIT</codeph> clause. If you only need to see a few sample values from a result set, or the top
|
||
or bottom values from a query using <codeph>ORDER BY</codeph>, include the <codeph>LIMIT</codeph> clause
|
||
to reduce the size of the result set rather than asking for the full result set and then throwing most of
|
||
the rows away.
|
||
</li>
|
||
|
||
<li>
|
||
Avoid overhead from pretty-printing the result set and displaying it on the screen. When you retrieve the
|
||
results through <cmdname>impala-shell</cmdname>, use <cmdname>impala-shell</cmdname> options such as
|
||
<codeph>-B</codeph> and <codeph>--output_delimiter</codeph> to produce results without special
|
||
formatting, and redirect output to a file rather than printing to the screen. Consider using
|
||
<codeph>INSERT ... SELECT</codeph> to write the results directly to new files in HDFS. See
|
||
<xref href="impala_shell_options.xml#shell_options"/> for details about the
|
||
<cmdname>impala-shell</cmdname> command-line options.
|
||
</li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_explain">
|
||
|
||
<title>Verify that your queries are planned in an efficient logical manner</title>
|
||
|
||
<p>
|
||
Examine the <codeph>EXPLAIN</codeph> plan for a query before actually running it. See
|
||
<xref href="impala_explain.xml#explain"/> and <xref href="impala_explain_plan.xml#perf_explain"/> for
|
||
details.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_profile">
|
||
|
||
<title>Verify performance characteristics of queries</title>
|
||
|
||
<p>
|
||
Verify that the low-level aspects of I/O, memory usage, network bandwidth, CPU utilization, and so on are
|
||
within expected ranges by examining the query profile for a query after running it. See
|
||
<xref href="impala_explain_plan.xml#perf_profile"/> for details.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="perf_cookbook_os">
|
||
|
||
<title>Use appropriate operating system settings</title>
|
||
|
||
<p>
|
||
See <xref keyref="cdh_admin_performance"/> for recommendations about operating system
|
||
settings that you can change to influence Impala performance. In particular, you might find
|
||
that changing the <codeph>vm.swappiness</codeph> Linux kernel setting to a non-zero value improves
|
||
overall performance.
|
||
</p>
|
||
</section>
|
||
<section id="perf_cookbook_hotspot">
|
||
<title>Hotspot analysis</title>
|
||
|
||
<p>
|
||
In the context of Impala, a hotspot is defined as “an Impala daemon
|
||
that for a single query or a workload is spending a far greater amount
|
||
of time processing data relative to its neighbours”.
|
||
</p>
|
||
|
||
<p>
|
||
Before discussing the options to tackle this issue some background is
|
||
first required to understand how this problem can occur.
|
||
</p>
|
||
|
||
<p>
|
||
By default, the scheduling of scan based plan fragments is
|
||
deterministic. This means that for multiple queries needing to read the
|
||
same block of data, the same node will be picked to host the scan. The
|
||
default scheduling logic does not take into account node workload from
|
||
prior queries. The complexity of materializing a tuple depends on a few
|
||
factors, namely: decoding and decompression. If the tuples are densely
|
||
packed into data pages due to good encoding/compression ratios, there
|
||
will be more work required when reconstructing the data. Each
|
||
compression codec offers different performance tradeoffs and should be
|
||
considered before writing the data. Due to the deterministic nature of
|
||
the scheduler, single nodes can become bottlenecks for highly concurrent
|
||
queries that use the same tables.
|
||
</p>
|
||
|
||
<p>
|
||
If, for example, a Parquet based dataset is tiny, e.g. a small
|
||
dimension table, such that it fits into a single HDFS block (Impala by
|
||
default will create 256 MB blocks when Parquet is used, each containing
|
||
a single row group) then there are a number of options that can be
|
||
considered to resolve the potential scheduling hotspots when querying
|
||
this data:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
In <keyword keyref="impala25"/> and higher, the scheduler’s
|
||
deterministic behaviour can be changed using the following query
|
||
options: <codeph>REPLICA_PREFERENCE</codeph> and
|
||
<codeph>RANDOM_REPLICA</codeph>. For a detailed description of each
|
||
of these modes see <keyword keyref="impala-2696">IMPALA-2696</keyword>.
|
||
</li>
|
||
|
||
<li>
|
||
HDFS caching can be used to cache block replicas. This will cause
|
||
the Impala scheduler to randomly pick (from <keyword keyref="impala22"
|
||
/> and higher) a node that is hosting a cached block replica for the
|
||
scan. Note, although HDFS caching has benefits, it serves only to help
|
||
with the reading of raw block data and not cached tuple data, but with
|
||
the right number of cached replicas (by default, HDFS only caches one
|
||
replica), even load distribution can be achieved for smaller
|
||
datasets.
|
||
</li>
|
||
|
||
<li>
|
||
Do not compress the table data. The uncompressed table data spans more
|
||
nodes and eliminates skew caused by compression.
|
||
</li>
|
||
|
||
<li>
|
||
Reduce the Parquet file size via the
|
||
<codeph>PARQUET_FILE_SIZE</codeph> query option when writing the
|
||
table data. Using this approach the data will span more nodes. However
|
||
it’s not recommended to drop the size below 32 MB.
|
||
</li>
|
||
</ul>
|
||
</section>
|
||
|
||
</conbody>
|
||
</concept>
|