impala/docs/topics/impala_concepts.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="concepts">

  <title>Impala Concepts and Architecture</title>
  <titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts>
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="Concepts"/>
      <data name="Category" value="Data Analysts"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Stub Pages"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      The following sections provide background information to help you become productive using Impala and
      its features. Where appropriate, the explanations include context to help understand how aspects of Impala
      relate to other technologies you might already be familiar with, such as relational database management
      systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase.
    </p>

    <p outputclass="toc"/>
  </conbody>

<!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. -->

  <concept id="intro_data_lifecycle" audience="hidden">

    <title>Overview of the Data Lifecycle for Impala</title>

    <conbody/>
  </concept>

  <concept id="intro_etl" audience="hidden">

    <title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title>
  <prolog>
    <metadata>
      <data name="Category" value="ETL"/>
      <data name="Category" value="Ingest"/>
      <data name="Category" value="Concepts"/>
    </metadata>
  </prolog>

    <conbody/>
  </concept>

  <concept id="intro_hadoop_data" audience="hidden">

    <title>How Impala Works with Hadoop Data Files</title>

    <conbody/>
  </concept>

  <concept id="intro_web_ui" audience="hidden">

    <title>Overview of the Impala Web Interface</title>

    <conbody/>
  </concept>

  <concept id="intro_bi" audience="hidden">

    <title>Using Impala with Business Intelligence Tools</title>

    <conbody/>
  </concept>

  <concept id="intro_ha" audience="hidden">

    <title>Overview of Impala Availability and Fault Tolerance</title>

    <conbody/>
  </concept>

<!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance",
     and if it should be split out into a separate file, and then take out the audience= attribute
     to make it visible.
-->

  <concept id="intro_llvm" audience="hidden">

    <title>Overview of Impala Runtime Code Generation</title>

    <conbody>

<!-- Adapted from the CIDR15 paper written by the Impala team. -->

      <p>
        Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time
        (JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation
        technique improves query execution times by generating native code optimized for the architecture of each
        host in your particular cluster. Performance gains of 5 times or more are typical for representative
        workloads.
      </p>

      <p>
        Impala uses runtime code generation to produce query-specific versions of functions that are critical to
        performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those
        that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the
        total time the query takes to execute. For example, when Impala scans a data file, it calls a function to
        parse each record into Impala’s in-memory tuple format. For queries scanning large tables, billions of
        records could result in billions of function calls. This function must therefore be extremely efficient for
        good query performance, and removing even a few instructions from each function call can result in large
        query speedups.
      </p>

      <p>
        Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it
        eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions.
        Inlining is especially valuable for functions used internally to evaluate expressions, where the function
        call itself is more expensive than the function body (for example, a function that adds two numbers).
        Inlining functions also increases instruction-level parallelism, and allows the compiler to make further
        optimizations such as subexpression elimination across expressions.
      </p>

      <p>
        Impala generates runtime query code automatically, so you do not need to do anything special to get this
        performance benefit. This technique is most effective for complex and long-running queries that process
        large numbers of rows. If you need to issue a series of short, small queries, you might turn off this
        feature to avoid the overhead of compilation time for each query. In this case, issue the statement
        <codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the
        current session.
      </p>

<!--
      <p>
        Without code generation,
        functions tend to be suboptimal
        to handle situations that cannot be predicted in advance.
        For example,
        a record-parsing function that
        only handles integer types will be faster at parsing an integer-only file
        than a function that handles other data types
        such as strings and floating-point numbers.
        However, the schemas of the files to
        be scanned are unknown at compile time,
        and so a general-purpose function must be used, even if at runtime
        it is known that more limited functionality is sufficient.
      </p>

      <p>
        A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance
        penalty, particularly when the called function is very simple, as the calls cannot be inlined.
        If the type of the object instance is known at runtime, we can use code generation to replace the virtual
        function call with a call directly to the correct function, which can then be inlined. This is especially
        valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a
        tree of individual operators and functions.
      </p>

      <p>
        Each type of expression that can appear in a query is implemented internally by overriding a virtual function.
        Many of these expression functions are quite simple, for example, adding two numbers.
        The virtual function call can be more expensive than the function body itself. By resolving the virtual
        function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions
        directly with no function call overhead. Inlining functions also increases
        instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression
        elimination across expressions.
      </p>
-->
    </conbody>
  </concept>

<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->

  <concept audience="hidden" id="intro_io">

    <title>Overview of Impala I/O</title>

    <conbody>

      <p>
        Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform
        data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called
        <term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala
        can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all
        available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec.
        Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed,
        and saves CPU cycles as there is no need to copy or checksum data blocks within memory.
      </p>

      <p>
        The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a
        fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per
        SSD), providing an asynchronous interface to clients (<term>scanner threads</term>).
      </p>
    </conbody>
  </concept>

<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->

<!-- Although good idea to get some answers from Henry first. -->

  <concept audience="hidden" id="intro_state_distribution">

    <title>State distribution</title>

    <conbody>

      <p>
        As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize
        its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept
        and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of
        which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and
        remote procedure calls to retrieve metadata during query planning, Impala implements a simple
        publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of
        subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes).
      </p>

      <p>
        The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>,
        <varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where
        <varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a
        64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the
        contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not
        persisted across service restarts. Processes that receive updates to any topic are called
        <term>subscribers</term>, and express their interest by registering with the statestore at startup and
        providing a list of topics. The statestore responds to registration by sending the subscriber an initial
        topic update for each registered topic, which consists of all the entries currently in that topic.
      </p>

<!-- Henry: OK, but in practice, what is in these topic messages for Impala? -->

      <p>
        After registration, the statestore periodically sends two kinds of messages to each subscriber. The first
        kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries
        and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a
        per-topic most-recent-version identifier which allows the statestore to only send the delta between
        updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its
        subscribed topics. Those changes are guaranteed to have been applied by the time the next update is
        received.
      </p>

      <p>
        The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called
        <term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each
        subscriber, which would otherwise time out its subscription and attempt to re-register.
      </p>

      <p>
        Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these
        messages could be very large in instances with thousands of tables, partitions, data files, and so on,
        Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted
        and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala
        nodes become unavailable.
      </p>

      <p>
        If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it
        stops sending updates to that node.
<!-- Henry: what are examples of these transient topic entries? -->
        Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are
        removed.
      </p>

      <p>
        Although the asynchronous nature of this mechanism means that metadata updates might take some time to
        propagate across the entire cluster, that does not affect the consistency of query planning or results.
        Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of
        the existence of the relevant tables, data files, and so on, it can distribute the query work to other
        nodes even if those other nodes have not received the latest metadata updates.
<!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? -->
<!--
        For example, query planning is performed on a single node based on the
        catalog metadata topic, and once a full plan has been computed, all information required to execute that
        plan is distributed directly to the executing nodes.
        There is no requirement that an executing node should
        know about the same version of the catalog metadata topic.
-->
      </p>

      <p>
        We have found that the statestore process with default settings scales well to medium sized clusters, and
        can serve our largest deployments with some configuration changes.
<!-- Henry: elaborate on the configuration changes. -->
      </p>

      <p>
<!-- Henry: other examples like load information? How is load information used? -->
        The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by
        its subscribers (for example, load information). Therefore, should a statestore restart, its state can be
        recovered during the initial subscriber registration phase. Or if the machine that the statestore is
        running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it.
        There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS
        entry to force subscribers to automatically move to the new process instance.
<!-- Henry: translate that last sentence into instructions / guidelines. -->
      </p>
    </conbody>
  </concept>
</concept>