mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
210 lines
7.4 KiB
XML
210 lines
7.4 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="performance">
|
|
|
|
<title>Tuning Impala for Performance</title>
|
|
<titlealts audience="PDF"><navtitle>Performance Tuning</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Performance"/>
|
|
<data name="Category" value="Databases"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Querying"/>
|
|
<data name="Category" value="Developers"/>
|
|
<!-- Like Impala Administration, this page has a fair bit of info already, but it could benefit from wiki-style embedded of intro text from those other pages. -->
|
|
<data name="Category" value="Stub Pages"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The following sections explain the factors affecting the performance of Impala features, and procedures for
|
|
tuning, monitoring, and benchmarking Impala queries and other SQL operations.
|
|
</p>
|
|
|
|
<p>
|
|
This section also describes techniques for maximizing Impala scalability. Scalability is tied to performance:
|
|
it means that performance remains high as the system workload increases. For example, reducing the disk I/O
|
|
performed by a query can speed up an individual query, and at the same time improve scalability by making it
|
|
practical to run more queries simultaneously. Sometimes, an optimization technique improves scalability more
|
|
than performance. For example, reducing memory usage for a query might not change the query performance much,
|
|
but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time
|
|
without running out of memory.
|
|
</p>
|
|
|
|
<note>
|
|
<p>
|
|
Before starting any performance tuning or benchmarking, make sure your system is configured with all the
|
|
recommended minimum hardware requirements from <xref href="impala_prereqs.xml#prereqs_hardware"/> and
|
|
software settings from <xref href="impala_config_performance.xml#config_performance"/>.
|
|
</p>
|
|
</note>
|
|
|
|
<ul>
|
|
<li>
|
|
<xref href="impala_partitioning.xml#partitioning"/>. This technique physically divides the data based on
|
|
the different values in frequently queried columns, allowing queries to skip reading a large percentage of
|
|
the data in a table.
|
|
</li>
|
|
|
|
<li>
|
|
<xref href="impala_perf_joins.xml#perf_joins"/>. Joins are the main class of queries that you can tune at
|
|
the SQL level, as opposed to changing physical factors such as the file format or the hardware
|
|
configuration. The related topics <xref href="impala_perf_stats.xml#perf_column_stats"/> and
|
|
<xref href="impala_perf_stats.xml#perf_table_stats"/> are also important primarily for join performance.
|
|
</li>
|
|
|
|
<li>
|
|
<xref href="impala_perf_stats.xml#perf_table_stats"/> and
|
|
<xref href="impala_perf_stats.xml#perf_column_stats"/>. Gathering table and column statistics, using the
|
|
<codeph>COMPUTE STATS</codeph> statement, helps Impala automatically optimize the performance for join
|
|
queries, without requiring changes to SQL query statements. (This process is greatly simplified in Impala
|
|
1.2.2 and higher, because the <codeph>COMPUTE STATS</codeph> statement gathers both kinds of statistics in
|
|
one operation, and does not require any setup and configuration as was previously necessary for the
|
|
<codeph>ANALYZE TABLE</codeph> statement in Hive.)
|
|
</li>
|
|
|
|
<li>
|
|
<xref href="impala_perf_testing.xml#performance_testing"/>. Do some post-setup testing to ensure Impala is
|
|
using optimal settings for performance, before conducting any benchmark tests.
|
|
</li>
|
|
|
|
<li>
|
|
<xref href="impala_perf_benchmarking.xml#perf_benchmarks"/>. The configuration and sample data that you use
|
|
for initial experiments with Impala is often not appropriate for doing performance tests.
|
|
</li>
|
|
|
|
<li>
|
|
<xref href="impala_perf_resources.xml#mem_limits"/>. The more memory Impala can utilize, the better query
|
|
performance you can expect. In a cluster running other kinds of workloads as well, you must make tradeoffs
|
|
to make sure all Hadoop components have enough memory to perform well, so you might cap the memory that
|
|
Impala can use.
|
|
</li>
|
|
|
|
<li rev="1.2" audience="hidden">
|
|
<xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>. Impala can use the HDFS caching feature to pin
|
|
frequently accessed data in memory, reducing disk I/O.
|
|
</li>
|
|
|
|
<li rev="2.2.0">
|
|
<xref href="impala_s3.xml#s3"/>. Queries against data stored in the Amazon Simple Storage Service (S3)
|
|
have different performance characteristics than when the data is stored in HDFS.
|
|
</li>
|
|
</ul>
|
|
|
|
<p outputclass="toc"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/cookbook_blurb"/>
|
|
|
|
</conbody>
|
|
|
|
<!-- Empty/hidden stub sections that might be worth expanding later. -->
|
|
|
|
<concept id="perf_network" audience="hidden">
|
|
|
|
<title>Network Traffic</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_partition_schema" audience="hidden">
|
|
|
|
<title>Designing Partitioned Tables</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_partition_query" audience="hidden">
|
|
|
|
<title>Queries on Partitioned Tables</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_monitoring" audience="hidden">
|
|
|
|
<title>Monitoring Performance through the Impala Web Interface</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Monitoring"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_query_coord" audience="hidden">
|
|
|
|
<title>Query Coordination</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_bottlenecks" audience="hidden">
|
|
|
|
<title>Performance Bottlenecks</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_long_queries" audience="hidden">
|
|
|
|
<title>Managing Long-Running Queries</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_load" audience="hidden">
|
|
|
|
<title>Performance Considerations for Loading Data</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_file_formats" audience="hidden">
|
|
|
|
<title>Performance Considerations for File Formats</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_compression" audience="hidden">
|
|
|
|
<title>Performance Considerations for Compression</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Compression"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
|
|
<concept id="perf_codegen" audience="hidden">
|
|
|
|
<title>Native Code Generation</title>
|
|
|
|
<conbody/>
|
|
</concept>
|
|
</concept>
|