impala/docs/topics/impala_scalability.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="scalability">

  <title>Scalability Considerations for Impala</title>

  <titlealts audience="PDF">

    <navtitle>Scalability Considerations</navtitle>

  </titlealts>

  <prolog>
    <metadata>
      <data name="Category" value="Performance"/>
      <data name="Category" value="Impala"/>
      <data name="Category" value="Planning"/>
      <data name="Category" value="Querying"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Memory"/>
      <data name="Category" value="Scalability"/>
<!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. -->
      <data name="Category" value="Proof of Concept"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      This section explains how the size of your cluster and the volume of data influences SQL
      performance and schema design for Impala tables. Typically, adding more cluster capacity
      reduces problems due to memory limits or disk throughput. On the other hand, larger
      clusters are more likely to have other kinds of scalability issues, such as a single slow
      node that causes performance problems for queries.
    </p>

    <p outputclass="toc inpage"/>

    <p conref="../shared/impala_common.xml#common/cookbook_blurb"/>

  </conbody>

  <concept audience="hidden" id="scalability_memory">

    <title>Overview and Guidelines for Impala Memory Usage</title>

    <prolog>
      <metadata>
        <data name="Category" value="Memory"/>
        <data name="Category" value="Concepts"/>
        <data name="Category" value="Best Practices"/>
        <data name="Category" value="Guidelines"/>
      </metadata>
    </prolog>

    <conbody>

<!--
Outline adapted from Alan Choi's "best practices" and/or "performance cookbook" papers.
-->

<codeblock>Memory Usage – the Basics
*  Memory is used by:
*  Hash join – RHS tables after decompression, filtering and projection
*  Group by – proportional to the #groups
*  Parquet writer buffer – 1GB per partition
*  IO buffer (shared across queries)
*  Metadata cache (no more than 1GB typically)
*  Memory held and reused by later query
*  Impala releases memory from time to time starting in 1.4.

Memory Usage – Estimating Memory Usage
*  Use Explain Plan
* Requires statistics! Mem estimate without stats is meaningless.
* Reports per-host memory requirement for this cluster size.
*  Re-run if you’ve re-sized the cluster!
[image of explain plan]

Memory Usage – Estimating Memory Usage
*  EXPLAIN’s memory estimate issues
*  Can be way off – much higher or much lower.
*  group by’s estimate can be particularly off – when there’s a large number of group by columns.
*  Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... NDV of group by column n
*  Ignore EXPLAIN’s estimate if it’s too high! •  Do your own estimate for group by
*  GROUP BY mem usage = (total number of groups * size of each row) + (total number of groups * size of each row) / num node

Memory Usage – Finding Actual Memory Usage
*  Search for “Per Node Peak Memory Usage” in the profile.
This is accurate. Use it for production capacity planning.

Memory Usage – Actual Memory Usage
*  For complex queries, how do I know which part of my query is using too much memory?
*  Use the ExecSummary from the query profile!
- But is that "Peak Mem" number aggregate or per-node?
[image of executive summary]

Memory Usage – Hitting Mem-limit
*  Top causes (in order) of hitting mem-limit even when running a single query:
1. Lack of statistics
2. Lots of joins within a single query
3. Big-table joining big-table
4. Gigantic group by

Memory Usage – Hitting Mem-limit
Lack of stats
*  Wrong join order, wrong join strategy, wrong insert strategy
*  Explain Plan tells you that!
[image of explain plan]
*  Fix: Compute Stats table

Memory Usage – Hitting Mem-limit
Lots of joins within a single query
* select...from fact, dim1, dim2,dim3,...dimN where ...
* Each dim tbl can fit in memory, but not all of them together
* As of Impala 1.4, Impala might choose the wrong plan – BROADCAST
FIX 1: use shuffle hint
select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ...
FIX 2: pre-join the dim tables (if possible)
- How about an example to illustrate that technique?
* few join=&gt;better perf!

Memory Usage: Hitting Mem-limit
Big-table joining big-table
*  Big-table (after decompression, filtering, and projection) is a table that is bigger than total cluster memory size.
*  Impala 2.0 will do this (via disk-based join). Consider using Hive for now.
*  (Advanced) For a simple query, you can try this advanced workaround – per-partition join
*  Requires the partition key be part of the join key
select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3)
   union all
select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6)

Memory Usage: Hitting Mem-limit
Gigantic group by
* The total number of distinct groups is huge, such as group by userid.
* Impala 2.0 will do this (via disk-based agg). Consider using Hive for now.
- Is this one of the cases where people were unhappy we recommended Hive?
* (Advanced) For a simple query, you can try this advanced workaround – per-partition agg
*  Requires the partition key be part of the group by
select part_key, col1, col2, ...agg(..) from tbl where
       part_key in (1,2,3)
       Union all
       Select part_key, col1, col2, ...agg(..) from tbl where
       part_key in (4,5,6)
- But where's the GROUP BY in the preceding query? Need a real example.

Memory Usage: Additional Notes
*  Use explain plan for estimate; use profile for accurate measure
*  Data skew can use uneven memory usage
*  Review previous common issues on out-of-memory
*  Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently
</codeblock>

    </conbody>

  </concept>

  <concept id="scalability_catalog">

    <title>Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</title>

    <conbody>

      <p>
        Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized
        for tables containing relatively few, large data files. Schemas containing thousands of
        tables, or tables containing thousands of partitions, can encounter performance issues
        during startup or during DDL operations such as <codeph>ALTER TABLE</codeph> statements.
      </p>

      <note type="important" rev="TSB-168">
        <p>
          Because of a change in the default heap size for the <cmdname>catalogd</cmdname>
          daemon in <keyword
            keyref="impala25_full"/> and higher, the following
          procedure to increase the <cmdname>catalogd</cmdname> memory limit might be required
          following an upgrade to <keyword keyref="impala25_full"/> even if not needed
          previously.
        </p>
      </note>

      <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"
      />

    </conbody>

  </concept>

  <concept rev="2.1.0" id="statestore_scalability">

    <title>Scalability Considerations for the Impala Statestore</title>

    <conbody>

      <p>
        Before <keyword keyref="impala21_full"/>, the statestore sent only one kind of message
        to its subscribers. This message contained all updates for any topics that a subscriber
        had subscribed to. It also served to let subscribers know that the statestore had not
        failed, and conversely the statestore used the success of sending a heartbeat to a
        subscriber to decide whether or not the subscriber had failed.
      </p>

      <p>
        Combining topic updates and failure detection in a single message led to bottlenecks in
        clusters with large numbers of tables, partitions, and HDFS data blocks. When the
        statestore was overloaded with metadata updates to transmit, heartbeat messages were
        sent less frequently, sometimes causing subscribers to time out their connection with
        the statestore. Increasing the subscriber timeout and decreasing the frequency of
        statestore heartbeats worked around the problem, but reduced responsiveness when the
        statestore failed or restarted.
      </p>

      <p>
        As of <keyword keyref="impala21_full"/>, the statestore now sends topic updates and
        heartbeats in separate messages. This allows the statestore to send and receive a steady
        stream of lightweight heartbeats, and removes the requirement to send topic updates
        according to a fixed schedule, reducing statestore network overhead.
      </p>

      <p>
        The statestore now has the following relevant configuration flags for the
        <cmdname>statestored</cmdname> daemon:
      </p>

      <dl>
        <dlentry id="statestore_num_update_threads">

          <dt>
            <codeph>-statestore_num_update_threads</codeph>
          </dt>

          <dd>
            The number of threads inside the statestore dedicated to sending topic updates. You
            should not typically need to change this value.
            <p>
              <b>Default:</b> 10
            </p>
          </dd>

        </dlentry>

        <dlentry id="statestore_update_frequency_ms">

          <dt>
            <codeph>-statestore_update_frequency_ms</codeph>
          </dt>

          <dd>
            The frequency, in milliseconds, with which the statestore tries to send topic
            updates to each subscriber. This is a best-effort value; if the statestore is unable
            to meet this frequency, it sends topic updates as fast as it can. You should not
            typically need to change this value.
            <p>
              <b>Default:</b> 2000
            </p>
          </dd>

        </dlentry>

        <dlentry id="statestore_num_heartbeat_threads">

          <dt>
            <codeph>-statestore_num_heartbeat_threads</codeph>
          </dt>

          <dd>
            The number of threads inside the statestore dedicated to sending heartbeats. You
            should not typically need to change this value.
            <p>
              <b>Default:</b> 10
            </p>
          </dd>

        </dlentry>

        <dlentry id="statestore_heartbeat_frequency_ms">

          <dt>
            <codeph>-statestore_heartbeat_frequency_ms</codeph>
          </dt>

          <dd>
            The frequency, in milliseconds, with which the statestore tries to send heartbeats
            to each subscriber. This value should be good for large catalogs and clusters up to
            approximately 150 nodes. Beyond that, you might need to increase this value to make
            the interval longer between heartbeat messages.
            <p>
              <b>Default:</b> 1000 (one heartbeat message every second)
            </p>
          </dd>

        </dlentry>

        <dlentry id="statestore_heartbeat_tcp_timeout_seconds">

          <dt>
            <codeph>-statestore_heartbeat_tcp_timeout_seconds</codeph>
          </dt>

          <dd>
            The time after which a heartbeat RPC to a subscriber will timeout. This setting
            protects against badly hung machines that are not able to respond to the heartbeat
            RPC in short order. Increase this if there are intermittent heartbeat RPC timeouts
            shown in statestore's log. You can reference the max value of
            "statestore.priority-topic-update-durations" metric on statestore to get a
            reasonable value. Note that priority topic updates are assumed to be small amounts
            of data that take a small amount of time to process (similar to the heartbeat
            complexity).
            <p>
              <b>Default:</b> 3
            </p>
          </dd>

        </dlentry>

        <dlentry id="statestore_max_missed_heartbeats">

          <dt>
            <codeph>-statestore_max_missed_heartbeats</codeph>
          </dt>

          <dd>
            Maximum number of consecutive heartbeat messages an impalad can miss before being
            declared failed by the statestore. You should not typically need to change this
            value.
            <p>
              <b>Default:</b> 10
            </p>
          </dd>

        </dlentry>

        <dlentry id="statestore_subscriber_timeout_secs">

          <dt>
            <codeph>-statestore_subscriber_timeout_secs</codeph>
          </dt>

          <dd>
            The amount of time (in seconds) that may elapse before the connection with the
            statestore is considered lost by subscribers (impalad/catalogd). Impalad will
            reregister itself to statestore, which may cause its absence in the next round of
            cluster membership update. This will cause query failures like "Cancelled due to
            unreachable impalad(s)". The value of this flag should be comparable to
            <codeph>
            (statestore_heartbeat_frequency_ms / 1000 + statestore_heartbeat_tcp_timeout_seconds)
            * statestore_max_missed_heartbeats</codeph>,
            so subscribers won't reregister themselves too early and allow statestore to
            resend heartbeats. You can also reference the max value of
            "statestore-subscriber.heartbeat-interval-time" metrics on impalads to get a
            reasonable value.
            <p>
              <b>Default:</b> 30
            </p>
          </dd>

        </dlentry>
      </dl>

      <p>
        If it takes a very long time for a cluster to start up, and
        <cmdname>impala-shell</cmdname> consistently displays <codeph>This Impala daemon is not
        ready to accept user requests</codeph>, the statestore might be taking too long to send
        the entire catalog topic to the cluster. In this case, consider adding
        <codeph>--load_catalog_in_background=false</codeph> to your catalog service
        configuration. This setting stops the statestore from loading the entire catalog into
        memory at cluster startup. Instead, metadata for each table is loaded when the table is
        accessed for the first time.
      </p>

    </conbody>

  </concept>

  <concept id="scalability_buffer_pool" rev="2.10.0 IMPALA-3200">

    <title>Effect of Buffer Pool on Memory Usage (<keyword keyref="impala210"/> and higher)</title>

    <conbody>

      <p>
        The buffer pool feature, available in <keyword keyref="impala210"/> and higher, changes
        the way Impala allocates memory during a query. Most of the memory needed is reserved at
        the beginning of the query, avoiding cases where a query might run for a long time
        before failing with an out-of-memory error. The actual memory estimates and memory
        buffers are typically smaller than before, so that more queries can run concurrently or
        process larger volumes of data than previously.
      </p>

      <p>
        The buffer pool feature includes some query options that you can fine-tune:
        <xref keyref="buffer_pool_limit"/>,
        <xref
          keyref="default_spillable_buffer_size"/>,
        <xref keyref="max_row_size"
        />, and <xref keyref="min_spillable_buffer_size"/>.
      </p>

      <p>
        Most of the effects of the buffer pool are transparent to you as an Impala user. Memory
        use during spilling is now steadier and more predictable, instead of increasing rapidly
        as more data is spilled to disk. The main change from a user perspective is the need to
        increase the <codeph>MAX_ROW_SIZE</codeph> query option setting when querying tables
        with columns containing long strings, many columns, or other combinations of factors
        that produce very large rows. If Impala encounters rows that are too large to process
        with the default query option settings, the query fails with an error message suggesting
        to increase the <codeph>MAX_ROW_SIZE</codeph> setting.
      </p>

    </conbody>

  </concept>

  <concept audience="hidden" id="scalability_cluster_size">

    <title>Scalability Considerations for Impala Cluster Size and Topology</title>

    <conbody>

      <p/>

    </conbody>

  </concept>

  <concept audience="hidden" id="concurrent_connections">

    <title>Scaling the Number of Concurrent Connections</title>

    <conbody>

      <p/>

    </conbody>

  </concept>

  <concept rev="2.0.0" id="spill_to_disk">

    <title>SQL Operations that Spill to Disk</title>

    <conbody>

      <p>
        Certain memory-intensive operations write temporary data to disk (known as
        <term>spilling</term> to disk) when Impala is close to exceeding its memory limit on a
        particular host.
      </p>

      <p>
        The result is a query that completes successfully, rather than failing with an
        out-of-memory error. The tradeoff is decreased performance due to the extra disk I/O to
        write the temporary data and read it back in. The slowdown could be potentially be
        significant. Thus, while this feature improves reliability, you should optimize your
        queries, system parameters, and hardware configuration to make this spilling a rare
        occurrence.
      </p>

      <note rev="2.10.0 IMPALA-3200">
        <p>
          In <keyword keyref="impala210"/> and higher, also see
          <xref
            keyref="scalability_buffer_pool"/> for changes to Impala memory
          allocation that might change the details of which queries spill to disk, and how much
          memory and disk space is involved in the spilling operation.
        </p>
      </note>

      <p>
        <b>What kinds of queries might spill to disk:</b>
      </p>

      <p>
        Several SQL clauses and constructs require memory allocations that could activat the
        spilling mechanism:
      </p>

      <ul>
        <li>
          <p>
            when a query uses a <codeph>GROUP BY</codeph> clause for columns with millions or
            billions of distinct values, Impala keeps a similar number of temporary results in
            memory, to accumulate the aggregate results for each value in the group.
          </p>
        </li>

        <li>
          <p>
            When large tables are joined together, Impala keeps the values of the join columns
            from one table in memory, to compare them to incoming values from the other table.
          </p>
        </li>

        <li>
          <p>
            When a large result set is sorted by the <codeph>ORDER BY</codeph> clause, each node
            sorts its portion of the result set in memory.
          </p>
        </li>

        <li>
          <p>
            The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators build in-memory
            data structures to represent all values found so far, to eliminate duplicates as the
            query progresses.
          </p>
        </li>

<!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
        <li>
          <p rev="IMPALA-3471">
            In <keyword keyref="impala26_full"/> and higher, <term>top-N</term> queries (those with
            <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can also spill.
            Impala allocates enough memory to hold as many rows as specified by the <codeph>LIMIT</codeph>
            clause, plus enough memory to hold as many rows as specified by any <codeph>OFFSET</codeph> clause.
          </p>
        </li>
        -->
      </ul>

      <p
        conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>

      <p>
        <b>How Impala handles scratch disk space for spilling:</b>
      </p>

      <p rev="obwl"
        conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>

      <p>
        <b>Memory usage for SQL operators:</b>
      </p>

      <p rev="2.10.0 IMPALA-3200">
        In <keyword keyref="impala210_full"/> and higher, the way SQL operators such as
        <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, transition between
        using additional memory or activating the spill-to-disk feature is changed. The memory
        required to spill to disk is reserved up front, and you can examine it in the
        <codeph>EXPLAIN</codeph> plan when the <codeph>EXPLAIN_LEVEL</codeph> query option is
        set to 2 or higher.
      </p>

      <p>
        The infrastructure of the spilling feature affects the way the affected SQL operators,
        such as <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. On
        each host that participates in the query, each such operator in a query requires memory
        to store rows of data and other data structures. Impala reserves a certain amount of
        memory up front for each operator that supports spill-to-disk that is sufficient to
        execute the operator. If an operator accumulates more data than can fit in the reserved
        memory, it can either reserve more memory to continue processing data in memory or start
        spilling data to temporary scratch files on disk. Thus, operators with spill-to-disk
        support can adapt to different memory constraints by using however much memory is
        available to speed up execution, yet tolerate low memory conditions by spilling data to
        disk.
      </p>

      <p>
        The amount data depends on the portion of the data being handled by that host, and thus
        the operator may end up consuming different amounts of memory on different hosts.
      </p>

<!--
      <p>
        The infrastructure of the spilling feature affects the way the affected SQL operators, such as
        <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory.
        On each host that participates in the query, each such operator in a query accumulates memory
        while building the data structure to process the aggregation or join operation. The amount
        of memory used depends on the portion of the data being handled by that host, and thus might
        be different from one host to another. When the amount of memory being used for the operator
        on a particular host reaches a threshold amount, Impala reserves an additional memory buffer
        to use as a work area in case that operator causes the query to exceed the memory limit for
        that host. After allocating the memory buffer, the memory used by that operator remains
        essentially stable or grows only slowly, until the point where the memory limit is reached
        and the query begins writing temporary data to disk.
      </p>

      <p rev="2.2.0">
        Prior to Impala 2.2, the extra memory buffer for an operator that might spill to disk
        was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size,
        and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value
        is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, the memory for the buffer
        is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</ph> A query that uses
        multiple such operators might allocate multiple such memory buffers, as the size of the data structure
        for each operator crosses the threshold on a particular host.
      </p>

      <p>
        Therefore, a query that processes a relatively small amount of data on each host would likely
        never reach the threshold for any operator, and would never allocate any extra memory buffers. A query
        that did process millions of groups, distinct values, join keys, and so on might cross the threshold,
        causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed
        on any particular host, thus reducing the chance of requiring the extra memory allocation.
      </p>
-->

      <p>
        <b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in
        Impala 1.4. This feature was extended to cover join queries, aggregation functions, and
        analytic functions in Impala 2.0. The size of the memory work area required by each
        operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2.
        <ph
          rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to take
        advantage of the Impala buffer pool feature and be more predictable and stable in
        <keyword keyref="impala210_full"/>.</ph>
      </p>

      <p>
        <b>Avoiding queries that spill to disk:</b>
      </p>

      <p>
        Because the extra I/O can impose significant performance overhead on these types of
        queries, try to avoid this situation by using the following steps:
      </p>

      <ol>
        <li>
          Detect how often queries spill to disk, and how much temporary data is written. Refer
          to the following sources:
          <ul>
            <li>
              The output of the <codeph>PROFILE</codeph> command in the
              <cmdname>impala-shell</cmdname> interpreter. This data shows the memory usage for
              each host and in total across the cluster. The <codeph>WriteIoBytes</codeph>
              counter reports how much data was written to disk for each operator during the
              query. (In <keyword
                keyref="impala29_full"/>, the counter was
              named <codeph>ScratchBytesWritten</codeph>; in
              <keyword
                keyref="impala28_full"/> and earlier, it was named
              <codeph>BytesWritten</codeph>.)
            </li>

            <li>
              The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface.
              Select the query to examine and click the corresponding
              <uicontrol>Profile</uicontrol> link. This data breaks down the memory usage for a
              single host within the cluster, the host whose web interface you are connected to.
            </li>
          </ul>
        </li>

        <li>
          Use one or more techniques to reduce the possibility of the queries spilling to disk:
          <ul>
            <li>
              Increase the Impala memory limit if practical, for example, if you can increase
              the available memory by more than the amount of temporary data written to disk on
              a particular node. Remember that in Impala 2.0 and later, you can issue
              <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you fine-tune the
              memory usage for queries from JDBC and ODBC applications.
            </li>

            <li>
              Increase the number of nodes in the cluster, to increase the aggregate memory
              available to Impala and reduce the amount of memory required on each node.
            </li>

            <li>
              Add more memory to the hosts running Impala daemons.
            </li>

            <li>
              On a cluster with resources shared between Impala and other Hadoop components, use
              resource management features to allocate more memory for Impala. See
              <xref
                href="impala_resource_management.xml#resource_management"/>
              for details.
            </li>

            <li>
              If the memory pressure is due to running many concurrent queries rather than a few
              memory-intensive ones, consider using the Impala admission control feature to
              lower the limit on the number of concurrent queries. By spacing out the most
              resource-intensive queries, you can avoid spikes in memory usage and improve
              overall response times. See
              <xref
                href="impala_admission.xml#admission_control"/> for details.
            </li>

            <li>
              Tune the queries with the highest memory requirements, using one or more of the
              following techniques:
              <ul>
                <li>
                  Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in
                  large-scale joins and aggregation queries.
                </li>

                <li>
                  Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer
                  numeric values instead.
                </li>

                <li>
                  Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy
                  being used for the most resource-intensive queries. See
                  <xref href="impala_explain_plan.xml#perf_explain"
                  /> for
                  details.
                </li>

                <li>
                  If Impala still chooses a suboptimal execution strategy even with statistics
                  available, or if it is impractical to keep the statistics up to date for huge
                  or rapidly changing tables, add hints to the most resource-intensive queries
                  to select the right execution strategy. See
                  <xref
                    href="impala_hints.xml#hints"/> for details.
                </li>
              </ul>
            </li>

            <li>
              If your queries experience substantial performance overhead due to spilling,
              enable the <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option
              prevents queries whose memory usage is likely to be exorbitant from spilling to
              disk. See
              <xref
                href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/>
              for details. As you tune problematic queries using the preceding steps, fewer and
              fewer will be cancelled by this option setting.
            </li>
          </ul>
        </li>
      </ol>

      <p>
        <b>Testing performance implications of spilling to disk:</b>
      </p>

      <p>
        To artificially provoke spilling, to test this feature and understand the performance
        implications, use a test environment with a memory limit of at least 2 GB. Issue the
        <codeph>SET</codeph> command with no arguments to check the current setting for the
        <codeph>MEM_LIMIT</codeph> query option. Set the query option
        <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk
        feature to prevent runaway disk usage from queries that are known in advance to be
        suboptimal. Within <cmdname>impala-shell</cmdname>, run a query that you expect to be
        memory-intensive, based on the criteria explained earlier. A self-join of a large table
        is a good candidate:
      </p>

<codeblock>select count(*) from big_table a join big_table b using (column_with_many_values);
</codeblock>

      <p>
        Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory
        usage on each node during the query.
<!--
        The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph>
        portion. For example, this profile shows that the query did not quite exceed the memory limit.
        -->
      </p>

<!-- Commenting out because now stale due to changes from the buffer pool (IMPALA-3200).
     To do: Revisit these details later if indicated by user feedback.

<codeblock>BlockMgr:
   - BlockWritesIssued: 1
   - BlockWritesOutstanding: 0
   - BlocksCreated: 24
   - BlocksRecycled: 1
   - BufferedPins: 0
   - MaxBlockSize: 8.00 MB (8388608)
   <b>- MemoryLimit: 200.00 MB (209715200)</b>
   <b>- PeakMemoryUsage: 192.22 MB (201555968)</b>
   - TotalBufferWaitTime: 0ns
   - TotalEncryptionTime: 0ns
   - TotalIntegrityCheckTime: 0ns
   - TotalReadBlockTime: 0ns
</codeblock>

      <p>
        In this case, because the memory limit was already below any recommended value, I increased the volume of
        data for the query rather than reducing the memory limit any further.
      </p>
-->

      <p>
        Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak
        memory usage reported in the profile output. Now try the memory-intensive query again.
      </p>

      <p>
        Check if the query fails with a message like the following:
      </p>

<codeblock>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted
to prevent potentially bad plans from using too many cluster resources. Compute stats on
these tables, hint the plan or disable this behavior via query options to enable spilling.
</codeblock>

      <p>
        If so, the query could have consumed substantial temporary disk space, slowing down so
        much that it would not complete in any reasonable time. Rather than rely on the
        spill-to-disk feature in this case, issue the <codeph>COMPUTE STATS</codeph> statement
        for the table or tables in your sample query. Then run the query again, check the peak
        memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory limit
        again if necessary to be lower than the peak memory usage.
      </p>

      <p>
        At this point, you have a query that is memory-intensive, but Impala can optimize it
        efficiently so that the memory usage is not exorbitant. You have set an artificial
        constraint through the <codeph>MEM_LIMIT</codeph> option so that the query would
        normally fail with an out-of-memory error. But the automatic spill-to-disk feature means
        that the query should actually succeed, at the expense of some extra disk I/O to read
        and write temporary work data.
      </p>

      <p>
        Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph>
        output again. This time, look for lines of this form:
      </p>

<codeblock>- SpilledPartitions: <varname>N</varname>
</codeblock>

      <p>
        If you see any such lines with <varname>N</varname> greater than 0, that indicates the
        query would have failed in Impala releases prior to 2.0, but now it succeeded because of
        the spill-to-disk feature. Examine the total time taken by the
        <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero
        <codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that
        did not spill, for example in the <codeph>PROFILE</codeph> output when the same query is
        run with a higher memory limit. This gives you an idea of the performance penalty of the
        spill operation for a particular query with a particular memory limit. If you make the
        memory limit just a little lower than the peak memory usage, the query only needs to
        write a small amount of temporary data to disk. The lower you set the memory limit, the
        more temporary data is written and the slower the query becomes.
      </p>

      <p>
        Now repeat this procedure for actual queries used in your environment. Use the
        <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more
        memory than necessary due to lack of statistics on the relevant tables and columns, and
        issue <codeph>COMPUTE STATS</codeph> where necessary.
      </p>

      <p>
        <b>When to use DISABLE_UNSAFE_SPILLS:</b>
      </p>

      <p>
        You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the
        time. Whether and how frequently to use this option depends on your system environment
        and workload.
      </p>

      <p>
        <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc
        queries whose performance characteristics and memory usage are not known in advance. It
        prevents <q>worst-case scenario</q> queries that use large amounts of memory
        unnecessarily. Thus, you might turn this option on within a session while developing new
        SQL code, even though it is turned off for existing applications.
      </p>

      <p>
        Organizations where table and column statistics are generally up-to-date might leave
        this option turned on all the time, again to avoid worst-case scenarios for untested
        queries or if a problem in the ETL pipeline results in a table with no statistics.
        Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail fast</q> in this case
        and immediately gather statistics or tune the problematic queries.
      </p>

      <p>
        Some organizations might leave this option turned off. For example, you might have
        tables large enough that the <codeph>COMPUTE STATS</codeph> takes substantial time to
        run, making it impractical to re-run after loading new data. If you have examined the
        <codeph>EXPLAIN</codeph> plans of your queries and know that they are operating
        efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
        case, you know that any queries that spill will not go overboard with their memory
        consumption.
      </p>

    </conbody>

  </concept>

  <concept id="complex_query">

    <title>Limits on Query Size and Complexity</title>

    <conbody>

      <p>
        There are hardcoded limits on the maximum size and complexity of queries. Currently, the
        maximum number of expressions in a query is 2000. You might exceed the limits with large
        or deeply nested queries produced by business intelligence tools or other query
        generators.
      </p>

      <p>
        If you have the ability to customize such queries or the query generation logic that
        produces them, replace sequences of repetitive expressions with single operators such as
        <codeph>IN</codeph> or <codeph>BETWEEN</codeph> that can represent multiple values or
        ranges. For example, instead of a large number of <codeph>OR</codeph> clauses:
      </p>

<codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
</codeblock>

      <p>
        use a single <codeph>IN</codeph> clause:
      </p>

<codeblock>WHERE val IN (1,2,6,100,...)</codeblock>

    </conbody>

  </concept>

  <concept id="scalability_io">

    <title>Scalability Considerations for Impala I/O</title>

    <conbody>

      <p>
        Impala parallelizes its I/O operations aggressively, therefore the more disks you can
        attach to each host, the better. Impala retrieves data from disk so quickly using bulk
        read operations on large blocks, that most queries are CPU-bound rather than I/O-bound.
      </p>

      <p>
        Because the kind of sequential scanning typically done by Impala queries does not
        benefit much from the random-access capabilities of SSDs, spinning disks typically
        provide the most cost-effective kind of storage for Impala data, with little or no
        performance penalty as compared to SSDs.
      </p>

      <p>
        Resource management features such as YARN, Llama, and admission control typically
        constrain the amount of memory, CPU, or overall number of queries in a high-concurrency
        environment. Currently, there is no throttling mechanism for Impala I/O.
      </p>

    </conbody>

  </concept>

  <concept id="big_tables">

    <title>Scalability Considerations for Table Layout</title>

    <conbody>

      <p>
        Due to the overhead of retrieving and updating table metadata in the metastore database,
        try to limit the number of columns in a table to a maximum of approximately 2000.
        Although Impala can handle wider tables than this, the metastore overhead can become
        significant, leading to query performance that is slower than expected based on the
        actual data volume.
      </p>

      <p>
        To minimize overhead related to the metastore database and Impala query planning, try to
        limit the number of partitions for any partitioned table to a few tens of thousands.
      </p>

      <p rev="IMPALA-5309">
        If the volume of data within a table makes it impractical to run exploratory queries,
        consider using the <codeph>TABLESAMPLE</codeph> clause to limit query processing to only
        a percentage of data within the table. This technique reduces the overhead for query
        startup, I/O to read the data, and the amount of network, CPU, and memory needed to
        process intermediate results during the query. See <xref keyref="tablesample"/> for
        details.
      </p>

    </conbody>

  </concept>

  <concept rev="" id="kerberos_overhead_cluster_size">

    <title>Kerberos-Related Network Overhead for Large Clusters</title>

    <conbody>

      <p>
        When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a
        number of simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might
        be able to process all the requests within roughly 5 seconds. For a cluster with 1000
        hosts, the time to process the requests would be roughly 500 seconds. Impala also makes
        a number of DNS requests at the same time as these Kerberos-related requests.
      </p>

      <p>
        While these authentication requests are being processed, any submitted Impala queries
        will fail. During this period, the KDC and DNS may be slow to respond to requests from
        components other than Impala, so other secure services might be affected temporarily.
      </p>

      <p>
        In <keyword keyref="impala212_full"/> or earlier, to reduce the frequency of the
        <codeph>kinit</codeph> renewal that initiates a new set of authentication requests,
        increase the <codeph>kerberos_reinit_interval</codeph> configuration setting for the
        <codeph>impalad</codeph> daemons. Currently, the default is 60 minutes. Consider using a
        higher value such as 360 (6 hours).
      </p>

      <p>
        The <codeph>kerberos_reinit_interval</codeph> configuration setting is removed in
        <keyword keyref="impala30_full"/>, and the above step is no longer needed.
      </p>

    </conbody>

  </concept>

  <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696">

    <title>Avoiding CPU Hotspots for HDFS Cached Data</title>

    <conbody>

      <p>
        You can use the HDFS caching feature, described in
        <xref
          href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, with Impala to
        reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
      </p>

      <p>
        In the early days of this feature, you might have found that enabling HDFS caching
        resulted in little or no performance improvement, because it could result in
        <q>hotspots</q>: instead of the I/O to read the table data being parallelized across the
        cluster, the I/O was reduced but the CPU load to process the data blocks might be
        concentrated on a single host.
      </p>

      <p>
        To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the
        <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that
        use HDFS caching. This clause allows more than one host to cache the relevant data
        blocks, so the CPU load can be shared, reducing the load on any one host. See
        <xref
          href="impala_create_table.xml#create_table"/> and
        <xref
          href="impala_alter_table.xml#alter_table"/> for details.
      </p>

      <p>
        Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to
        the way that Impala schedules the work of processing data blocks on different hosts. In
        <keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work
        for HDFS cached data is divided better among all the hosts that have cached replicas for
        a particular data block. When more than one host has a cached replica for a data block,
        Impala assigns the work of processing that block to whichever host has done the least
        work (in terms of number of bytes read) for the current query. If hotspots persist even
        with this load-based scheduling algorithm, you can enable the query option
        <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph> to further distribute the CPU load. This
        setting causes Impala to randomly pick a host to process a cached data block if the
        scheduling algorithm encounters a tie when deciding which host has done the least work.
      </p>

    </conbody>

  </concept>

  <concept id="scalability_file_handle_cache" rev="2.10.0 IMPALA-4623">

    <title>Scalability Considerations for File Handle Caching</title>

    <conbody>

      <p>
        One scalability aspect that affects heavily loaded clusters is the load on the metadata
        layer from looking up the details as each file is opened. On HDFS, that can lead to
        increased load on the NameNode, and on S3, this can lead to an excessive number of S3
        metadata requests. For example, a query that does a full table scan on a partitioned
        table may need to read thousands of partitions, each partition containing multiple data
        files. Accessing each column of a Parquet file also involves a separate <q>open</q>
        call, further increasing the load on the NameNode. High NameNode overhead can add
        startup time (that is, increase latency) to Impala queries, and reduce overall
        throughput for non-Impala workloads that also require accessing HDFS files.
      </p>

      <p>
        You can reduce the number of calls made to your file system's metadata layer by enabling
        the file handle caching feature. Data files that are accessed by different queries, or
        even multiple times within the same query, can be accessed without a new <q>open</q>
        call and without fetching the file details multiple times.
      </p>

      <p>
        Impala supports file handle caching for the following file systems:
        <ul>
          <li>
            HDFS in <keyword keyref="impala210_full"/> and higher
            <p>
              In Impala 3.2 and higher, file handle caching also applies to remote HDFS file
              handles. This is controlled by the <codeph>cache_remote_file_handles</codeph> flag
              for an <codeph>impalad</codeph>. It is recommended that you use the default value
              of <codeph>true</codeph> as this caching prevents your NameNode from overloading
              when your cluster has many remote HDFS reads.
            </p>
          </li>

          <li>
            S3 in <keyword keyref="impala33_full"/> and higher
            <p>
              The <codeph>cache_s3_file_handles</codeph> <codeph>impalad</codeph> flag controls
              the S3 file handle caching. The feature is enabled by default with the flag set to
              <codeph>true</codeph>.
            </p>
          </li>
        </ul>
      </p>

      <p>
        The feature is enabled by default with 20,000 file handles to be cached. To change the
        value, set the configuration option <codeph>max_cached_file_handles</codeph> to a
        non-zero value for each <cmdname>impalad</cmdname> daemon. From the initial default
        value of 20000, adjust upward if NameNode request load is still significant, or downward
        if it is more important to reduce the extra memory usage on each host. Each cache entry
        consumes 6 KB, meaning that caching 20,000 file handles requires up to 120 MB on each
        Impala executor. The exact memory usage varies depending on how many file handles have
        actually been cached; memory is freed as file handles are evicted from the cache.
      </p>

      <p>
        If a manual operation moves a file to the trashcan while the file handle is cached,
        Impala still accesses the contents of that file. This is a change from prior behavior.
        Previously, accessing a file that was in the trashcan would cause an error. This
        behavior only applies to non-Impala methods of removing files, not the Impala mechanisms
        such as <codeph>TRUNCATE TABLE</codeph> or <codeph>DROP TABLE</codeph>.
      </p>

      <p>
        If files are removed, replaced, or appended by operations outside of Impala, the way to
        bring the file information up to date is to run the <codeph>REFRESH</codeph> statement
        on the table.
      </p>

      <p>
        File handle cache entries are evicted as the cache fills up, or based on a timeout
        period when they have not been accessed for some time.
      </p>

      <p>
        To evaluate the effectiveness of file handle caching for a particular workload, issue
        the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> or examine
        query profiles in the Impala Web UI. Look for the ratio of
        <codeph>CachedFileHandlesHitCount</codeph> (ideally, should be high) to
        <codeph>CachedFileHandlesMissCount</codeph> (ideally, should be low). Before starting
        any evaluation, run several representative queries to <q>warm up</q> the cache because
        the first time each data file is accessed is always recorded as a cache miss.
      </p>

      <p>
        To see metrics about file handle caching for each <cmdname>impalad</cmdname> instance,
        examine the following fields on the <uicontrol>/metrics</uicontrol> page in the Impala
        Web UI:
      </p>

      <ul>
        <li>
          <uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol>
        </li>

        <li>
          <uicontrol>impala-server.io.mgr.num-cached-file-handles</uicontrol>
        </li>
      </ul>

    </conbody>

  </concept>

</concept>