mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Separated the memory requirement and the JVM heap requirement for catalogd. Change-Id: Ic4b0d60f0c8d426569386b12f92d8447b535c095 Reviewed-on: http://gerrit.cloudera.org:8080/12081 Reviewed-by: Alex Rodoni <arodoni@cloudera.com> Tested-by: Alex Rodoni <arodoni@cloudera.com>
387 lines
13 KiB
XML
387 lines
13 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="prereqs">
|
||
|
||
<title>Impala Requirements</title>
|
||
|
||
<titlealts audience="PDF">
|
||
|
||
<navtitle>Requirements</navtitle>
|
||
|
||
</titlealts>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Requirements"/>
|
||
<data name="Category" value="Planning"/>
|
||
<data name="Category" value="Installing"/>
|
||
<data name="Category" value="Upgrading"/>
|
||
<data name="Category" value="Administrators"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Data Analysts"/>
|
||
<!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. -->
|
||
<data name="Category" value="Duplicate Topics"/>
|
||
<!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. -->
|
||
<data name="Category" value="Multimap"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
To perform as expected, Impala depends on the availability of the software, hardware, and
|
||
configurations described in the following sections.
|
||
</p>
|
||
|
||
<p outputclass="toc inpage"/>
|
||
|
||
</conbody>
|
||
|
||
<concept id="prereqs_os">
|
||
|
||
<title>Supported Operating Systems</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Apache Impala runs on Linux systems only. See the <filepath>README.md</filepath> file
|
||
for more information.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="prereqs_hive">
|
||
|
||
<title>Hive Metastore and Related Configuration</title>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Metastore"/>
|
||
<data name="Category" value="Hive"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Impala can interoperate with data stored in Hive, and uses the same infrastructure as
|
||
Hive for tracking metadata about schema objects such as tables and columns. The
|
||
following components are prerequisites for Impala:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.
|
||
<p>
|
||
Always configure a <b>Hive metastore service</b> rather than connecting directly to
|
||
the metastore database. The Hive metastore service is required to interoperate
|
||
between different levels of metastore APIs if this is necessary for your
|
||
environment, and using it avoids known issues with connecting directly to the
|
||
metastore database.
|
||
</p>
|
||
<p>
|
||
See below for a summary of the metastore installation process.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
Hive (optional). Although only the Hive metastore database is required for Impala to
|
||
function, you might install Hive on some client machines to create and load data into
|
||
tables that use certain file formats. See
|
||
<xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to
|
||
be installed on the same DataNodes as Impala; it just needs access to the same
|
||
metastore database.
|
||
</li>
|
||
</ul>
|
||
|
||
<p>
|
||
To install the metastore:
|
||
</p>
|
||
|
||
<ol>
|
||
<li>
|
||
Install a MySQL or PostgreSQL database. Start the database if it is not started after
|
||
installation.
|
||
</li>
|
||
|
||
<li>
|
||
Download the
|
||
<xref href="http://www.mysql.com/products/connector/"
|
||
scope="external" format="html">MySQL
|
||
connector</xref> or the
|
||
<xref
|
||
href="http://jdbc.postgresql.org/download.html" scope="external"
|
||
format="html">PostgreSQL
|
||
connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory.
|
||
</li>
|
||
|
||
<li>
|
||
Use the appropriate command line tool for your database to create the metastore
|
||
database.
|
||
</li>
|
||
|
||
<li>
|
||
Use the appropriate command line tool for your database to grant privileges for the
|
||
metastore database to the <codeph>hive</codeph> user.
|
||
</li>
|
||
|
||
<li>
|
||
Modify <codeph>hive-site.xml</codeph> to include information matching your particular
|
||
database: its URL, username, and password. You will copy the
|
||
<codeph>hive-site.xml</codeph> file to the Impala Configuration Directory later in the
|
||
Impala installation process.
|
||
</li>
|
||
</ol>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="prereqs_java">
|
||
|
||
<title>Java Dependencies</title>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Java"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Although Impala is primarily written in C++, it does use Java to communicate with
|
||
various Hadoop components:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause
|
||
issues, typically resulting in a failure at <cmdname>impalad</cmdname> startup. In
|
||
particular, the JamVM used by default on certain levels of Ubuntu systems can cause
|
||
<cmdname>impalad</cmdname> to fail to start.
|
||
</li>
|
||
|
||
<li>
|
||
Internally, the <cmdname>impalad</cmdname> daemon relies on the
|
||
<codeph>JAVA_HOME</codeph> environment variable to locate the system Java libraries.
|
||
Make sure the <cmdname>impalad</cmdname> service is not run from an environment with
|
||
an incorrect setting for this variable.
|
||
</li>
|
||
|
||
<li>
|
||
All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph>
|
||
file, which is located at <codeph>/usr/lib/impala/lib/</codeph>. These map to
|
||
everything that is built under <codeph>fe/target/dependency</codeph>.
|
||
</li>
|
||
</ul>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="prereqs_network">
|
||
|
||
<title>Networking Configuration Requirements</title>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Network"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
As part of ensuring best performance, Impala attempts to complete tasks on local data,
|
||
as opposed to using network connections to work with remote data. To support this goal,
|
||
Impala matches the <b>hostname</b> provided to each Impala daemon with the <b>IP
|
||
address</b> of each DataNode by resolving the hostname flag to an IP address. For
|
||
Impala to work with local data, use a single IP interface for the DataNode and the
|
||
Impala daemon on each machine. Ensure that the Impala daemon's hostname flag resolves to
|
||
the IP address of the DataNode. For single-homed machines, this is usually automatic,
|
||
but for multi-homed machines, ensure that the Impala daemon's hostname resolves to the
|
||
correct interface. Impala tries to detect the correct hostname at start-up, and prints
|
||
the derived hostname at the start of the log in a message of the form:
|
||
</p>
|
||
|
||
<codeblock>Using hostname: impala-daemon-1.example.com</codeblock>
|
||
|
||
<p>
|
||
In the majority of cases, this automatic detection works correctly. If you need to
|
||
explicitly set the hostname, do so by setting the <codeph>--hostname</codeph> flag.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
<concept id="prereqs_hardware">
|
||
|
||
<title>Hardware Requirements</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
The memory allocation should be consistent across Impala executor nodes. A single Impala
|
||
executor with a lower memory limit than the rest can easily become a bottleneck and lead
|
||
to suboptimal performance.
|
||
</p>
|
||
|
||
<p>
|
||
This guideline does not apply to coordinator-only nodes.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
<concept id="concept_dqy_n1w_zdb">
|
||
|
||
<title>Hardware Requirements for Optimal Join Performance</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
During join operations, portions of data from each joined table are loaded into
|
||
memory. Data sets can be very large, so ensure your hardware has sufficient memory to
|
||
accommodate the joins you anticipate completing.
|
||
</p>
|
||
|
||
<p>
|
||
While requirements vary according to data set size, the following is generally
|
||
recommended:
|
||
</p>
|
||
|
||
<ul>
|
||
<li rev="2.0.0">
|
||
CPU
|
||
<p>
|
||
Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in
|
||
newer processors.
|
||
</p>
|
||
|
||
<note>
|
||
This required level of processor is the same as in Impala version 1.x. The Impala
|
||
2.0 and 2.1 releases had a stricter requirement for the SSE4.1 instruction set,
|
||
which has now been relaxed.
|
||
</note>
|
||
<!--
|
||
For best performance use:
|
||
<ul>
|
||
<li>
|
||
Intel - Nehalem (released 2008) or later processors.
|
||
</li>
|
||
|
||
<li>
|
||
AMD - Bulldozer (released 2011) or later processors.
|
||
</li>
|
||
</ul>
|
||
-->
|
||
</li>
|
||
|
||
<li rev="1.2">
|
||
Memory
|
||
<p>
|
||
128 GB or more recommended, ideally 256 GB or more. If the intermediate results
|
||
during query processing on a particular node exceed the amount of memory available
|
||
to Impala on that node, the query writes temporary work data to disk, which can
|
||
lead to long query times. Note that because the work is parallelized, and
|
||
intermediate results for aggregate queries are typically smaller than the original
|
||
data, Impala can query and join tables that are much larger than the memory
|
||
available on an individual node.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
JVM Heap Size for Catalog Server
|
||
<p>
|
||
4 GB or more recommended, ideally 8 GB or more, to accommodate the maximum numbers
|
||
of tables, partitions, and data files you are planning to use with Impala.
|
||
</p>
|
||
</li>
|
||
|
||
<li>
|
||
Storage
|
||
<p>
|
||
DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for
|
||
disk performance with Impala. Ensure that you have sufficient disk space to store
|
||
the data Impala will be querying.
|
||
</p>
|
||
</li>
|
||
</ul>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
</concept>
|
||
|
||
<concept id="prereqs_account">
|
||
|
||
<title>User Account Requirements</title>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Users"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete
|
||
this account or group and do not modify the account's or group's permissions and rights.
|
||
Ensure no existing systems obstruct the functioning of these accounts and groups. For
|
||
example, if you have scripts that delete user accounts not in a white-list, add these
|
||
accounts to the list of permitted accounts.
|
||
</p>
|
||
|
||
<p>
|
||
For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be
|
||
able to move files to the HDFS trashcan. You might need to create an HDFS directory
|
||
<filepath>/user/impala</filepath>, writeable by the <codeph>impala</codeph> user, so
|
||
that the trashcan can be created. Otherwise, data files might remain behind after a
|
||
<codeph>DROP TABLE</codeph> statement.
|
||
</p>
|
||
|
||
<p>
|
||
Impala should not run as root. Best Impala performance is achieved using direct reads,
|
||
but root is not permitted to use direct reads. Therefore, running Impala as root
|
||
negatively affects performance.
|
||
</p>
|
||
|
||
<p>
|
||
By default, any user can connect to Impala and access all the associated databases and
|
||
tables. You can enable authorization and authentication based on the Linux OS user who
|
||
connects to the Impala server, and the associated groups for that user.
|
||
<xref href="impala_security.xml#security"/> for details. These security features do not
|
||
change the underlying file permission requirements; the <codeph>impala</codeph> user
|
||
still needs to be able to access the data files.
|
||
</p>
|
||
|
||
</conbody>
|
||
|
||
</concept>
|
||
|
||
</concept>
|