mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
372 lines
12 KiB
XML
372 lines
12 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="cluster_sizing">
|
||
|
||
<title>Cluster Sizing Guidelines for Impala</title>
|
||
<titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Clusters"/>
|
||
<data name="Category" value="Planning"/>
|
||
<data name="Category" value="Sizing"/>
|
||
<data name="Category" value="Deploying"/>
|
||
<!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
|
||
<data name="Category" value="Sectionated Pages"/>
|
||
<data name="Category" value="Memory"/>
|
||
<data name="Category" value="Scalability"/>
|
||
<data name="Category" value="Proof of Concept"/>
|
||
<data name="Category" value="Requirements"/>
|
||
<data name="Category" value="Guidelines"/>
|
||
<data name="Category" value="Best Practices"/>
|
||
<data name="Category" value="Administrators"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">cluster sizing</indexterm>
|
||
This document provides a very rough guideline to estimate the size of a cluster needed for a specific
|
||
customer application. You can use this information when planning how much and what type of hardware to
|
||
acquire for a new cluster, or when adding Impala workloads to an existing cluster.
|
||
</p>
|
||
|
||
<note>
|
||
Before making purchase or deployment decisions, consult your Cloudera representative to verify the
|
||
conclusions about hardware requirements based on your data volume and workload.
|
||
</note>
|
||
|
||
<!-- <p outputclass="toc inpage"/> -->
|
||
|
||
<p>
|
||
Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
|
||
Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
|
||
Because work can be distributed in different ways for different queries, if some hosts are overloaded
|
||
compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
|
||
and overall slowness
|
||
</p>
|
||
|
||
<p>
|
||
For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
|
||
12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
|
||
DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
|
||
similar to TPC-DS benchmark queries:
|
||
</p>
|
||
|
||
<table>
|
||
<title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
|
||
<tgroup cols="6">
|
||
<colspec colnum="1" colname="col1"/>
|
||
<colspec colnum="2" colname="col2"/>
|
||
<colspec colnum="3" colname="col3"/>
|
||
<colspec colnum="4" colname="col4"/>
|
||
<colspec colnum="5" colname="col5"/>
|
||
<colspec colnum="6" colname="col6"/>
|
||
<thead>
|
||
<row>
|
||
<entry>
|
||
Data Size
|
||
</entry>
|
||
<entry>
|
||
1 query
|
||
</entry>
|
||
<entry>
|
||
10 queries
|
||
</entry>
|
||
<entry>
|
||
100 queries
|
||
</entry>
|
||
<entry>
|
||
1000 queries
|
||
</entry>
|
||
<entry>
|
||
2000 queries
|
||
</entry>
|
||
</row>
|
||
</thead>
|
||
<tbody>
|
||
<row>
|
||
<entry>
|
||
<b>250 GB</b>
|
||
</entry>
|
||
<entry>
|
||
2
|
||
</entry>
|
||
<entry>
|
||
2
|
||
</entry>
|
||
<entry>
|
||
5
|
||
</entry>
|
||
<entry>
|
||
35
|
||
</entry>
|
||
<entry>
|
||
70
|
||
</entry>
|
||
</row>
|
||
<row>
|
||
<entry>
|
||
<b>500 GB</b>
|
||
</entry>
|
||
<entry>
|
||
2
|
||
</entry>
|
||
<entry>
|
||
2
|
||
</entry>
|
||
<entry>
|
||
10
|
||
</entry>
|
||
<entry>
|
||
70
|
||
</entry>
|
||
<entry>
|
||
135
|
||
</entry>
|
||
</row>
|
||
<row>
|
||
<entry>
|
||
<b>1 TB</b>
|
||
</entry>
|
||
<entry>
|
||
2
|
||
</entry>
|
||
<entry>
|
||
2
|
||
</entry>
|
||
<entry>
|
||
15
|
||
</entry>
|
||
<entry>
|
||
135
|
||
</entry>
|
||
<entry>
|
||
270
|
||
</entry>
|
||
</row>
|
||
<row>
|
||
<entry>
|
||
<b>15 TB</b>
|
||
</entry>
|
||
<entry>
|
||
2
|
||
</entry>
|
||
<entry>
|
||
20
|
||
</entry>
|
||
<entry>
|
||
200
|
||
</entry>
|
||
<entry>
|
||
N/A
|
||
</entry>
|
||
<entry>
|
||
N/A
|
||
</entry>
|
||
</row>
|
||
<row>
|
||
<entry>
|
||
<b>30 TB</b>
|
||
</entry>
|
||
<entry>
|
||
4
|
||
</entry>
|
||
<entry>
|
||
40
|
||
</entry>
|
||
<entry>
|
||
400
|
||
</entry>
|
||
<entry>
|
||
N/A
|
||
</entry>
|
||
<entry>
|
||
N/A
|
||
</entry>
|
||
</row>
|
||
<row>
|
||
<entry>
|
||
<b>60 TB</b>
|
||
</entry>
|
||
<entry>
|
||
8
|
||
</entry>
|
||
<entry>
|
||
80
|
||
</entry>
|
||
<entry>
|
||
800
|
||
</entry>
|
||
<entry>
|
||
N/A
|
||
</entry>
|
||
<entry>
|
||
N/A
|
||
</entry>
|
||
</row>
|
||
</tbody>
|
||
</tgroup>
|
||
</table>
|
||
|
||
<section id="sizing_factors">
|
||
|
||
<title>Factors Affecting Scalability</title>
|
||
|
||
<p>
|
||
A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
|
||
node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
|
||
cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
|
||
memory.
|
||
</p>
|
||
|
||
<p>
|
||
If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce
|
||
the network load; in fact, a larger cluster could increase network traffic because some queries involve
|
||
<q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
|
||
throughput in a network-constrained environment.
|
||
</p>
|
||
|
||
<p>
|
||
Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
|
||
concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
|
||
network is saturated yet. This can happen because currently Impala uses only a single core per node to
|
||
process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
|
||
cannot run more than 2 such queries at the same time.
|
||
</p>
|
||
|
||
<p>
|
||
Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
|
||
workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
|
||
GB/sec. Consider increasing the memory per node.
|
||
</p>
|
||
|
||
<p>
|
||
As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
|
||
throughput estimate.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="sizing_details">
|
||
|
||
<title>A More Precise Approach</title>
|
||
|
||
<p>
|
||
A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
|
||
size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
|
||
data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
|
||
</p>
|
||
|
||
<codeblock>Eq 1: N > QPM * D / 100 GB
|
||
</codeblock>
|
||
|
||
<p>
|
||
Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
|
||
required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
|
||
estimate the number of node using our equation above.
|
||
</p>
|
||
|
||
<codeblock>N > QPM * D / 100GB
|
||
N > 400 * 50GB / 100GB
|
||
N > 200
|
||
</codeblock>
|
||
|
||
<p>
|
||
Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
|
||
</p>
|
||
|
||
<p>
|
||
Depending on the complexity of the query, the processing rate of query might change. If the query has more
|
||
joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
|
||
process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
|
||
filtering on numbers, the processing rate can be higher.
|
||
</p>
|
||
</section>
|
||
|
||
<section id="sizing_mem_estimate">
|
||
|
||
<title>Estimating Memory Requirements</title>
|
||
<!--
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Memory"/>
|
||
</metadata>
|
||
</prolog>
|
||
-->
|
||
|
||
<p>
|
||
Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
|
||
joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
|
||
STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
|
||
below to calculate the minimum memory requirement.
|
||
</p>
|
||
|
||
<p>
|
||
Suppose you are running the following join:
|
||
</p>
|
||
|
||
<codeblock>select a.*, b.col_1, b.col_2, … b.col_n
|
||
from a, b
|
||
where a.key = b.key
|
||
and b.col_1 in (1,2,4...)
|
||
and b.col_4 in (....);
|
||
</codeblock>
|
||
|
||
<p>
|
||
And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
|
||
</p>
|
||
|
||
<p>
|
||
The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
|
||
filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
|
||
than the total memory of the entire cluster.
|
||
</p>
|
||
|
||
<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
|
||
selectivity factor from the predicate *
|
||
projection factor * compression ratio
|
||
</codeblock>
|
||
|
||
<p>
|
||
In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
|
||
predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
|
||
the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
|
||
Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
|
||
</p>
|
||
|
||
<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
|
||
selectivity factor from the predicate *
|
||
projection factor * compression ratio
|
||
= 100TB * 10% * 5/200 * 3
|
||
= 0.75TB
|
||
= 750GB
|
||
</codeblock>
|
||
|
||
<p>
|
||
So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
|
||
TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
|
||
of this magnitude.
|
||
</p>
|
||
</section>
|
||
</conbody>
|
||
</concept>
|