mirror of
https://github.com/apache/impala.git
synced 2026-01-05 12:01:11 -05:00
Obsolete reference to draft-comment from a non-Impala file. Remove it. Change-Id: I16947eb7d67accb2db7a9bdbc3596b8d94379ee7 Reviewed-on: http://gerrit.cloudera.org:8080/6380 Reviewed-by: Ambreen Kazi <ambreen.kazi@cloudera.com> Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
313 lines
15 KiB
XML
313 lines
15 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="concepts">
|
||
|
||
<title>Impala Concepts and Architecture</title>
|
||
<titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Concepts"/>
|
||
<data name="Category" value="Data Analysts"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Stub Pages"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
The following sections provide background information to help you become productive using Impala and
|
||
its features. Where appropriate, the explanations include context to help understand how aspects of Impala
|
||
relate to other technologies you might already be familiar with, such as relational database management
|
||
systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase.
|
||
</p>
|
||
|
||
<p outputclass="toc"/>
|
||
</conbody>
|
||
|
||
<!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. -->
|
||
|
||
<concept id="intro_data_lifecycle" audience="hidden">
|
||
|
||
<title>Overview of the Data Lifecycle for Impala</title>
|
||
|
||
<conbody/>
|
||
</concept>
|
||
|
||
<concept id="intro_etl" audience="hidden">
|
||
|
||
<title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="ETL"/>
|
||
<data name="Category" value="Ingest"/>
|
||
<data name="Category" value="Concepts"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody/>
|
||
</concept>
|
||
|
||
<concept id="intro_hadoop_data" audience="hidden">
|
||
|
||
<title>How Impala Works with Hadoop Data Files</title>
|
||
|
||
<conbody/>
|
||
</concept>
|
||
|
||
<concept id="intro_web_ui" audience="hidden">
|
||
|
||
<title>Overview of the Impala Web Interface</title>
|
||
|
||
<conbody/>
|
||
</concept>
|
||
|
||
<concept id="intro_bi" audience="hidden">
|
||
|
||
<title>Using Impala with Business Intelligence Tools</title>
|
||
|
||
<conbody/>
|
||
</concept>
|
||
|
||
<concept id="intro_ha" audience="hidden">
|
||
|
||
<title>Overview of Impala Availability and Fault Tolerance</title>
|
||
|
||
<conbody/>
|
||
</concept>
|
||
|
||
<!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance",
|
||
and if it should be split out into a separate file, and then take out the audience= attribute
|
||
to make it visible.
|
||
-->
|
||
|
||
<concept id="intro_llvm" audience="hidden">
|
||
|
||
<title>Overview of Impala Runtime Code Generation</title>
|
||
|
||
<conbody>
|
||
|
||
<!-- Adapted from the CIDR15 paper written by the Impala team. -->
|
||
|
||
<p>
|
||
Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time
|
||
(JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation
|
||
technique improves query execution times by generating native code optimized for the architecture of each
|
||
host in your particular cluster. Performance gains of 5 times or more are typical for representative
|
||
workloads.
|
||
</p>
|
||
|
||
<p>
|
||
Impala uses runtime code generation to produce query-specific versions of functions that are critical to
|
||
performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those
|
||
that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the
|
||
total time the query takes to execute. For example, when Impala scans a data file, it calls a function to
|
||
parse each record into Impala’s in-memory tuple format. For queries scanning large tables, billions of
|
||
records could result in billions of function calls. This function must therefore be extremely efficient for
|
||
good query performance, and removing even a few instructions from each function call can result in large
|
||
query speedups.
|
||
</p>
|
||
|
||
<p>
|
||
Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it
|
||
eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions.
|
||
Inlining is especially valuable for functions used internally to evaluate expressions, where the function
|
||
call itself is more expensive than the function body (for example, a function that adds two numbers).
|
||
Inlining functions also increases instruction-level parallelism, and allows the compiler to make further
|
||
optimizations such as subexpression elimination across expressions.
|
||
</p>
|
||
|
||
<p>
|
||
Impala generates runtime query code automatically, so you do not need to do anything special to get this
|
||
performance benefit. This technique is most effective for complex and long-running queries that process
|
||
large numbers of rows. If you need to issue a series of short, small queries, you might turn off this
|
||
feature to avoid the overhead of compilation time for each query. In this case, issue the statement
|
||
<codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the
|
||
current session.
|
||
</p>
|
||
|
||
<!--
|
||
<p>
|
||
Without code generation,
|
||
functions tend to be suboptimal
|
||
to handle situations that cannot be predicted in advance.
|
||
For example,
|
||
a record-parsing function that
|
||
only handles integer types will be faster at parsing an integer-only file
|
||
than a function that handles other data types
|
||
such as strings and floating-point numbers.
|
||
However, the schemas of the files to
|
||
be scanned are unknown at compile time,
|
||
and so a general-purpose function must be used, even if at runtime
|
||
it is known that more limited functionality is sufficient.
|
||
</p>
|
||
|
||
<p>
|
||
A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance
|
||
penalty, particularly when the called function is very simple, as the calls cannot be inlined.
|
||
If the type of the object instance is known at runtime, we can use code generation to replace the virtual
|
||
function call with a call directly to the correct function, which can then be inlined. This is especially
|
||
valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a
|
||
tree of individual operators and functions.
|
||
</p>
|
||
|
||
<p>
|
||
Each type of expression that can appear in a query is implemented internally by overriding a virtual function.
|
||
Many of these expression functions are quite simple, for example, adding two numbers.
|
||
The virtual function call can be more expensive than the function body itself. By resolving the virtual
|
||
function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions
|
||
directly with no function call overhead. Inlining functions also increases
|
||
instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression
|
||
elimination across expressions.
|
||
</p>
|
||
-->
|
||
</conbody>
|
||
</concept>
|
||
|
||
<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
|
||
|
||
<concept audience="hidden" id="intro_io">
|
||
|
||
<title>Overview of Impala I/O</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform
|
||
data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called
|
||
<term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala
|
||
can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all
|
||
available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec.
|
||
Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed,
|
||
and saves CPU cycles as there is no need to copy or checksum data blocks within memory.
|
||
</p>
|
||
|
||
<p>
|
||
The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a
|
||
fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per
|
||
SSD), providing an asynchronous interface to clients (<term>scanner threads</term>).
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
|
||
|
||
<!-- Although good idea to get some answers from Henry first. -->
|
||
|
||
<concept audience="hidden" id="intro_state_distribution">
|
||
|
||
<title>State distribution</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize
|
||
its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept
|
||
and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of
|
||
which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and
|
||
remote procedure calls to retrieve metadata during query planning, Impala implements a simple
|
||
publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of
|
||
subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes).
|
||
</p>
|
||
|
||
<p>
|
||
The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>,
|
||
<varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where
|
||
<varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a
|
||
64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the
|
||
contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not
|
||
persisted across service restarts. Processes that receive updates to any topic are called
|
||
<term>subscribers</term>, and express their interest by registering with the statestore at startup and
|
||
providing a list of topics. The statestore responds to registration by sending the subscriber an initial
|
||
topic update for each registered topic, which consists of all the entries currently in that topic.
|
||
</p>
|
||
|
||
<!-- Henry: OK, but in practice, what is in these topic messages for Impala? -->
|
||
|
||
<p>
|
||
After registration, the statestore periodically sends two kinds of messages to each subscriber. The first
|
||
kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries
|
||
and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a
|
||
per-topic most-recent-version identifier which allows the statestore to only send the delta between
|
||
updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its
|
||
subscribed topics. Those changes are guaranteed to have been applied by the time the next update is
|
||
received.
|
||
</p>
|
||
|
||
<p>
|
||
The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called
|
||
<term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each
|
||
subscriber, which would otherwise time out its subscription and attempt to re-register.
|
||
</p>
|
||
|
||
<p>
|
||
Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these
|
||
messages could be very large in instances with thousands of tables, partitions, data files, and so on,
|
||
Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted
|
||
and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala
|
||
nodes become unavailable.
|
||
</p>
|
||
|
||
<p>
|
||
If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it
|
||
stops sending updates to that node.
|
||
<!-- Henry: what are examples of these transient topic entries? -->
|
||
Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are
|
||
removed.
|
||
</p>
|
||
|
||
<p>
|
||
Although the asynchronous nature of this mechanism means that metadata updates might take some time to
|
||
propagate across the entire cluster, that does not affect the consistency of query planning or results.
|
||
Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of
|
||
the existence of the relevant tables, data files, and so on, it can distribute the query work to other
|
||
nodes even if those other nodes have not received the latest metadata updates.
|
||
<!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? -->
|
||
<!--
|
||
For example, query planning is performed on a single node based on the
|
||
catalog metadata topic, and once a full plan has been computed, all information required to execute that
|
||
plan is distributed directly to the executing nodes.
|
||
There is no requirement that an executing node should
|
||
know about the same version of the catalog metadata topic.
|
||
-->
|
||
</p>
|
||
|
||
<p>
|
||
We have found that the statestore process with default settings scales well to medium sized clusters, and
|
||
can serve our largest deployments with some configuration changes.
|
||
<!-- Henry: elaborate on the configuration changes. -->
|
||
</p>
|
||
|
||
<p>
|
||
<!-- Henry: other examples like load information? How is load information used? -->
|
||
The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by
|
||
its subscribers (for example, load information). Therefore, should a statestore restart, its state can be
|
||
recovered during the initial subscriber registration phase. Or if the machine that the statestore is
|
||
running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it.
|
||
There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS
|
||
entry to force subscribers to automatically move to the new process instance.
|
||
<!-- Henry: translate that last sentence into instructions / guidelines. -->
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
</concept>
|