impala/docs/topics/impala_cluster_sizing.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="cluster_sizing">

  <title>Cluster Sizing Guidelines for Impala</title>
  <titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="Clusters"/>
      <data name="Category" value="Planning"/>
      <data name="Category" value="Sizing"/>
      <data name="Category" value="Deploying"/>
      <!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
      <data name="Category" value="Sectionated Pages"/>
      <data name="Category" value="Memory"/>
      <data name="Category" value="Scalability"/>
      <data name="Category" value="Proof of Concept"/>
      <data name="Category" value="Requirements"/>
      <data name="Category" value="Guidelines"/>
      <data name="Category" value="Best Practices"/>
      <data name="Category" value="Administrators"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      <indexterm audience="hidden">cluster sizing</indexterm>
      This document provides a very rough guideline to estimate the size of a cluster needed for a specific
      customer application. You can use this information when planning how much and what type of hardware to
      acquire for a new cluster, or when adding Impala workloads to an existing cluster.
    </p>

    <note>
      Before making purchase or deployment decisions, consult your Cloudera representative to verify the
      conclusions about hardware requirements based on your data volume and workload.
    </note>

<!--    <p outputclass="toc inpage"/> -->

    <p>
      Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
      Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
      Because work can be distributed in different ways for different queries, if some hosts are overloaded
      compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
      and overall slowness
    </p>

    <p>
      For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
      12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
      DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
      similar to TPC-DS benchmark queries:
    </p>

    <table>
      <title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
      <tgroup cols="6">
        <colspec colnum="1" colname="col1"/>
        <colspec colnum="2" colname="col2"/>
        <colspec colnum="3" colname="col3"/>
        <colspec colnum="4" colname="col4"/>
        <colspec colnum="5" colname="col5"/>
        <colspec colnum="6" colname="col6"/>
        <thead>
          <row>
            <entry>
              Data Size
            </entry>
            <entry>
              1 query
            </entry>
            <entry>
              10 queries
            </entry>
            <entry>
              100 queries
            </entry>
            <entry>
              1000 queries
            </entry>
            <entry>
              2000 queries
            </entry>
          </row>
        </thead>
        <tbody>
          <row>
            <entry>
              <b>250 GB</b>
            </entry>
            <entry>
              2
            </entry>
            <entry>
              2
            </entry>
            <entry>
              5
            </entry>
            <entry>
              35
            </entry>
            <entry>
              70
            </entry>
          </row>
          <row>
            <entry>
              <b>500 GB</b>
            </entry>
            <entry>
              2
            </entry>
            <entry>
              2
            </entry>
            <entry>
              10
            </entry>
            <entry>
              70
            </entry>
            <entry>
              135
            </entry>
          </row>
          <row>
            <entry>
              <b>1 TB</b>
            </entry>
            <entry>
              2
            </entry>
            <entry>
              2
            </entry>
            <entry>
              15
            </entry>
            <entry>
              135
            </entry>
            <entry>
              270
            </entry>
          </row>
          <row>
            <entry>
              <b>15 TB</b>
            </entry>
            <entry>
              2
            </entry>
            <entry>
              20
            </entry>
            <entry>
              200
            </entry>
            <entry>
              N/A
            </entry>
            <entry>
              N/A
            </entry>
          </row>
          <row>
            <entry>
              <b>30 TB</b>
            </entry>
            <entry>
              4
            </entry>
            <entry>
              40
            </entry>
            <entry>
              400
            </entry>
            <entry>
              N/A
            </entry>
            <entry>
              N/A
            </entry>
          </row>
          <row>
            <entry>
              <b>60 TB</b>
            </entry>
            <entry>
              8
            </entry>
            <entry>
              80
            </entry>
            <entry>
              800
            </entry>
            <entry>
              N/A
            </entry>
            <entry>
              N/A
            </entry>
          </row>
        </tbody>
      </tgroup>
    </table>

    <section id="sizing_factors">

      <title>Factors Affecting Scalability</title>

      <p>
        A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
        node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
        cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
        memory.
      </p>

      <p>
        If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce
        the network load; in fact, a larger cluster could increase network traffic because some queries involve
        <q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
        throughput in a network-constrained environment.
      </p>

      <p>
        Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
        concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
        network is saturated yet. This can happen because currently Impala uses only a single core per node to
        process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
        cannot run more than 2 such queries at the same time.
      </p>

      <p>
        Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
        workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
        GB/sec. Consider increasing the memory per node.
      </p>

      <p>
        As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
        throughput estimate.
      </p>
    </section>

    <section id="sizing_details">

      <title>A More Precise Approach</title>

      <p>
        A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
        size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
        data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
      </p>

<codeblock>Eq 1: N &gt; QPM * D / 100 GB
</codeblock>

      <p>
        Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
        required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
        estimate the number of node using our equation above.
      </p>

<codeblock>N &gt; QPM * D / 100GB
N &gt; 400 * 50GB / 100GB
N &gt; 200
</codeblock>

      <p>
        Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
      </p>

      <p>
        Depending on the complexity of the query, the processing rate of query might change. If the query has more
        joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
        process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
        filtering on numbers, the processing rate can be higher.
      </p>
    </section>

    <section id="sizing_mem_estimate">

      <title>Estimating Memory Requirements</title>
      <!--
  <prolog>
    <metadata>
      <data name="Category" value="Memory"/>
    </metadata>
  </prolog>
      -->

      <p>
        Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
        joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
        STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
        below to calculate the minimum memory requirement.
      </p>

      <p>
        Suppose you are running the following join:
      </p>

<codeblock>select a.*, b.col_1, b.col_2, … b.col_n
from a, b
where a.key = b.key
and b.col_1 in (1,2,4...)
and b.col_4 in (....);
</codeblock>

      <p>
        And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
      </p>

      <p>
        The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
        filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
        than the total memory of the entire cluster.
      </p>

<codeblock>Cluster Total Memory Requirement  = Size of the smaller table *
  selectivity factor from the predicate *
  projection factor * compression ratio
</codeblock>

      <p>
        In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
        predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
        the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
        Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
      </p>

<codeblock>Cluster Total Memory Requirement  = Size of the smaller table *
  selectivity factor from the predicate *
  projection factor * compression ratio
  = 100TB * 10% * 5/200 * 3
  = 0.75TB
  = 750GB
</codeblock>

      <p>
        So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
        TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
        of this magnitude.
      </p>
    </section>
  </conbody>
</concept>