impala/docs/topics/impala_perf_cookbook.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="perf_cookbook">

  <title>Impala Performance Guidelines and Best Practices</title>
  <titlealts audience="PDF"><navtitle>Performance Best Practices</navtitle></titlealts>
  <prolog>
    <metadata>
      <data name="Category" value="Performance"/>
      <data name="Category" value="Impala"/>
      <data name="Category" value="Planning"/>
      <data name="Category" value="Proof of Concept"/>
      <data name="Category" value="Guidelines"/>
      <data name="Category" value="Best Practices"/>
      <data name="Category" value="Proof of Concept"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      Here are performance guidelines and best practices that you can use during planning, experimentation, and
      performance tuning for an Impala-enabled <keyword keyref="distro"/> cluster. All of this information is also available in more
      detail elsewhere in the Impala documentation; it is gathered together here to serve as a cookbook and
      emphasize which performance techniques typically provide the highest return on investment
    </p>

    <p outputclass="toc inpage"/>

    <section id="perf_cookbook_file_format">

      <title>Choose the appropriate file format for the data</title>

      <p>
        Typically, for large volumes of data (multiple gigabytes per table or partition), the Parquet file format
        performs best because of its combination of columnar storage layout, large I/O request size, and
        compression and encoding. See <xref href="impala_file_formats.xml#file_formats"/> for comparisons of all
        file formats supported by Impala, and <xref href="impala_parquet.xml#parquet"/> for details about the
        Parquet file format.
      </p>

      <note>
        For smaller volumes of data, a few gigabytes or less for each table or partition, you might not see
        significant performance differences between file formats. At small data volumes, reduced I/O from an
        efficient compressed file format can be counterbalanced by reduced opportunity for parallel execution. When
        planning for a production deployment or conducting benchmarks, always use realistic data volumes to get a
        true picture of performance and scalability.
      </note>
    </section>

    <section id="perf_cookbook_small_files">

      <title>Avoid data ingestion processes that produce many small files</title>

      <p>
        When producing data files outside of Impala, prefer either text format or Avro, where you can build up the
        files row by row. Once the data is in Impala, you can convert it to the more efficient Parquet format and
        split into multiple data files using a single <codeph>INSERT ... SELECT</codeph> statement. Or, if you have
        the infrastructure to produce multi-megabyte Parquet files as part of your data preparation process, do
        that and skip the conversion step inside Impala.
      </p>

      <p>
        Always use <codeph>INSERT ... SELECT</codeph> to copy significant volumes of data from table to table
        within Impala. Avoid <codeph>INSERT ... VALUES</codeph> for any substantial volume of data or
        performance-critical tables, because each such statement produces a separate tiny data file. See
        <xref href="impala_insert.xml#insert"/> for examples of the <codeph>INSERT ... SELECT</codeph> syntax.
      </p>

      <p>
        For example, if you have thousands of partitions in a Parquet table, each with less than
        <ph rev="parquet_block_size">256 MB</ph> of data, consider partitioning in a less granular way, such as by
        year / month rather than year / month / day. If an inefficient data ingestion process produces thousands of
        data files in the same table or partition, consider compacting the data by performing an <codeph>INSERT ...
        SELECT</codeph> to copy all the data to a different table; the data will be reorganized into a smaller
        number of larger files by this process.
      </p>
    </section>

    <section id="perf_cookbook_partitioning">

      <title>Choose partitioning granularity based on actual data volume</title>

      <p>
        Partitioning is a technique that physically divides the data based on values of one or more columns, such
        as by year, month, day, region, city, section of a web site, and so on. When you issue queries that request
        a specific value or range of values for the partition key columns, Impala can avoid reading the irrelevant
        data, potentially yielding a huge savings in disk I/O.
      </p>

      <p>
        When deciding which column(s) to use for partitioning, choose the right level of granularity. For example,
        should you partition by year, month, and day, or only by year and month? Choose a partitioning strategy
        that puts at least <ph rev="parquet_block_size">256 MB</ph> of data in each partition, to take advantage of
        HDFS bulk I/O and Impala distributed queries.
      </p>

      <p>
        Over-partitioning can also cause query planning to take longer than necessary, as Impala prunes the
        unnecessary partitions. Ideally, keep the number of partitions in the table under 30 thousand.
      </p>

      <p>
        When preparing data files to go in a partition directory, create several large files rather than many small
        ones. If you receive data in the form of many small files and have no control over the input format,
        consider using the <codeph>INSERT ... SELECT</codeph> syntax to copy data from one table or partition to
        another, which compacts the files into a relatively small number (based on the number of nodes in the
        cluster).
      </p>

      <p>
        If you need to reduce the overall number of partitions and increase the amount of data in each partition,
        first look for partition key columns that are rarely referenced or are referenced in non-critical queries
        (not subject to an SLA). For example, your web site log data might be partitioned by year, month, day, and
        hour, but if most queries roll up the results by day, perhaps you only need to partition by year, month,
        and day.
      </p>

      <p>
        If you need to reduce the granularity even more, consider creating <q>buckets</q>, computed values
        corresponding to different sets of partition key values. For example, you can use the
        <codeph>TRUNC()</codeph> function with a <codeph>TIMESTAMP</codeph> column to group date and time values
        based on intervals such as week or quarter. See
        <xref href="impala_datetime_functions.xml#datetime_functions"/> for details.
      </p>

      <p>
        See <xref href="impala_partitioning.xml#partitioning"/> for full details and performance considerations for
        partitioning.
      </p>
    </section>

    <section id="perf_cookbook_partition_keys">

      <title>Use smallest appropriate integer types for partition key columns</title>

      <p>
        Although it is tempting to use strings for partition key columns, since those values are turned into HDFS
        directory names anyway, you can minimize memory usage by using numeric values for common partition key
        fields such as <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>. Use the smallest
        integer type that holds the appropriate range of values, typically <codeph>TINYINT</codeph> for
        <codeph>MONTH</codeph> and <codeph>DAY</codeph>, and <codeph>SMALLINT</codeph> for <codeph>YEAR</codeph>.
        Use the <codeph>EXTRACT()</codeph> function to pull out individual date and time fields from a
        <codeph>TIMESTAMP</codeph> value, and <codeph>CAST()</codeph> the return value to the appropriate integer
        type.
      </p>
    </section>

    <section id="perf_cookbook_parquet_block_size">

      <title>Choose an appropriate Parquet block size</title>

      <p rev="parquet_block_size">
        By default, the Impala <codeph>INSERT ... SELECT</codeph> statement creates Parquet files with a 256 MB
        block size. (This default was changed in Impala 2.0. Formerly, the limit was 1 GB, but Impala made
        conservative estimates about compression, resulting in files that were smaller than 1 GB.)
      </p>

      <p>
        Each Parquet file written by Impala is a single block, allowing the whole file to be processed as a unit by a single host.
        As you copy Parquet files into HDFS or between HDFS filesystems, use <codeph>hdfs dfs -pb</codeph> to preserve the original
        block size.
      </p>

      <p>
        If there is only one or a few data block in your Parquet table, or in a partition that is the only one
        accessed by a query, then you might experience a slowdown for a different reason: not enough data to take
        advantage of Impala's parallel distributed queries. Each data block is processed by a single core on one of
        the DataNodes. In a 100-node cluster of 16-core machines, you could potentially process thousands of data
        files simultaneously. You want to find a sweet spot between <q>many tiny files</q> and <q>single giant
        file</q> that balances bulk I/O and parallel processing. You can set the <codeph>PARQUET_FILE_SIZE</codeph>
        query option before doing an <codeph>INSERT ... SELECT</codeph> statement to reduce the size of each
        generated Parquet file. <ph rev="2.0.0">(Specify the file size as an absolute number of bytes, or in Impala
        2.0 and later, in units ending with <codeph>m</codeph> for megabytes or <codeph>g</codeph> for
        gigabytes.)</ph> Run benchmarks with different file sizes to find the right balance point for your
        particular data volume.
      </p>
    </section>

    <section id="perf_cookbook_stats">

      <title>Gather statistics for all tables used in performance-critical or high-volume join queries</title>

      <p>
        Gather the statistics with the <codeph>COMPUTE STATS</codeph> statement. See
        <xref href="impala_perf_joins.xml#perf_joins"/> for details.
      </p>
    </section>

    <section id="perf_cookbook_network">

      <title>Minimize the overhead of transmitting results back to the client</title>

      <p>
        Use techniques such as:
      </p>

      <ul>
        <li>
          Aggregation. If you need to know how many rows match a condition, the total values of matching values
          from some column, the lowest or highest matching value, and so on, call aggregate functions such as
          <codeph>COUNT()</codeph>, <codeph>SUM()</codeph>, and <codeph>MAX()</codeph> in the query rather than
          sending the result set to an application and doing those computations there. Remember that the size of an
          unaggregated result set could be huge, requiring substantial time to transmit across the network.
        </li>

        <li>
          Filtering. Use all applicable tests in the <codeph>WHERE</codeph> clause of a query to eliminate rows
          that are not relevant, rather than producing a big result set and filtering it using application logic.
        </li>

        <li>
          <codeph>LIMIT</codeph> clause. If you only need to see a few sample values from a result set, or the top
          or bottom values from a query using <codeph>ORDER BY</codeph>, include the <codeph>LIMIT</codeph> clause
          to reduce the size of the result set rather than asking for the full result set and then throwing most of
          the rows away.
        </li>

        <li>
          Avoid overhead from pretty-printing the result set and displaying it on the screen. When you retrieve the
          results through <cmdname>impala-shell</cmdname>, use <cmdname>impala-shell</cmdname> options such as
          <codeph>-B</codeph> and <codeph>--output_delimiter</codeph> to produce results without special
          formatting, and redirect output to a file rather than printing to the screen. Consider using
          <codeph>INSERT ... SELECT</codeph> to write the results directly to new files in HDFS. See
          <xref href="impala_shell_options.xml#shell_options"/> for details about the
          <cmdname>impala-shell</cmdname> command-line options.
        </li>
      </ul>
    </section>

    <section id="perf_cookbook_explain">

      <title>Verify that your queries are planned in an efficient logical manner</title>

      <p>
        Examine the <codeph>EXPLAIN</codeph> plan for a query before actually running it. See
        <xref href="impala_explain.xml#explain"/> and <xref href="impala_explain_plan.xml#perf_explain"/> for
        details.
      </p>
    </section>

    <section id="perf_cookbook_profile">

      <title>Verify performance characteristics of queries</title>

      <p>
        Verify that the low-level aspects of I/O, memory usage, network bandwidth, CPU utilization, and so on are
        within expected ranges by examining the query profile for a query after running it. See
        <xref href="impala_explain_plan.xml#perf_profile"/> for details.
      </p>
    </section>

    <section id="perf_cookbook_os">

      <title>Use appropriate operating system settings</title>

      <p>
        See <xref keyref="cdh_admin_performance"/> for recommendations about operating system
        settings that you can change to influence Impala performance. In particular, you might find
        that changing the <codeph>vm.swappiness</codeph> Linux kernel setting to a non-zero value improves
        overall performance.
      </p>
    </section>
    <section id="perf_cookbook_hotspot">
      <title>Hotspot analysis</title>

      <p>
        In the context of Impala, a hotspot is defined as “an Impala daemon
        that for a single query or a workload is spending a far greater amount
        of time processing data relative to its neighbours”.
      </p>

      <p>
        Before discussing the options to tackle this issue some background is
        first required to understand how this problem can occur.
      </p>

      <p>
        By default, the scheduling of scan based plan fragments is
        deterministic. This means that for multiple queries needing to read the
        same block of data, the same node will be picked to host the scan. The
        default scheduling logic does not take into account node workload from
        prior queries. The complexity of materializing a tuple depends on a few
        factors, namely: decoding and decompression. If the tuples are densely
        packed into data pages due to good encoding/compression ratios, there
        will be more work required when reconstructing the data. Each
        compression codec offers different performance tradeoffs and should be
        considered before writing the data. Due to the deterministic nature of
        the scheduler, single nodes can become bottlenecks for highly concurrent
        queries that use the same tables.
      </p>

      <p>
        If, for example, a Parquet based dataset is tiny, e.g. a small
        dimension table, such that it fits into a single HDFS block (Impala by
        default will create 256 MB blocks when Parquet is used, each containing
        a single row group) then there are a number of options that can be
        considered to resolve the potential scheduling hotspots when querying
        this data:
      </p>

      <ul>
        <li>
          In <keyword keyref="impala25"/> and higher, the scheduler’s
          deterministic behaviour can be changed using the following query
          options: <codeph>REPLICA_PREFERENCE</codeph> and
            <codeph>RANDOM_REPLICA</codeph>. For a detailed description of each
          of these modes see <keyword keyref="impala-2696">IMPALA-2696</keyword>.
        </li>

        <li>
          HDFS caching can be used to cache block replicas. This will cause
          the Impala scheduler to randomly pick (from <keyword keyref="impala22"
          /> and higher) a node that is hosting a cached block replica for the
          scan. Note, although HDFS caching has benefits, it serves only to help
          with the reading of raw block data and not cached tuple data, but with
          the right number of cached replicas (by default, HDFS only caches one
          replica), even load distribution can be achieved for smaller
          datasets.
        </li>

        <li>
          Do not compress the table data. The uncompressed table data spans more
          nodes and eliminates skew caused by compression.
        </li>

        <li>
          Reduce the Parquet file size via the
            <codeph>PARQUET_FILE_SIZE</codeph> query option when writing the
          table data. Using this approach the data will span more nodes. However
          it’s not recommended to drop the size below 32 MB.
        </li>
      </ul>
    </section>

  </conbody>
</concept>