mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
376 lines
16 KiB
XML
376 lines
16 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="prereqs">
|
||
|
||
<title>Impala Requirements</title>
|
||
<titlealts audience="PDF"><navtitle>Requirements</navtitle></titlealts>
|
||
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Requirements"/>
|
||
<data name="Category" value="Planning"/>
|
||
<data name="Category" value="Installing"/>
|
||
<data name="Category" value="Upgrading"/>
|
||
<data name="Category" value="Administrators"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Data Analysts"/>
|
||
<!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. -->
|
||
<data name="Category" value="Duplicate Topics"/>
|
||
<!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. -->
|
||
<data name="Category" value="Multimap"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">prerequisites</indexterm>
|
||
<indexterm audience="hidden">requirements</indexterm>
|
||
To perform as expected, Impala depends on the availability of the software, hardware, and configurations
|
||
described in the following sections.
|
||
</p>
|
||
|
||
<p outputclass="toc inpage"/>
|
||
</conbody>
|
||
|
||
<concept id="product_compatibility_matrix">
|
||
|
||
<title>Product Compatibility Matrix</title>
|
||
|
||
<conbody>
|
||
|
||
<p> The ultimate source of truth about compatibility between various
|
||
versions of CDH, Cloudera Manager, and various CDH components is the <ph
|
||
audience="integrated"><xref
|
||
href="rn_consolidated_pcm.xml"
|
||
>Product Compatibility Matrix for CDH and Cloudera
|
||
Manager</xref></ph><ph audience="standalone">online <xref
|
||
href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html"
|
||
format="html" scope="external">Product Compatibility
|
||
Matrix</xref></ph>. </p>
|
||
|
||
<p>
|
||
For Impala, see the
|
||
<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/pcm_impala.html" scope="external" format="html">Impala
|
||
compatibility matrix page</xref>.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="prereqs_os">
|
||
|
||
<title>Supported Operating Systems</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">software requirements</indexterm>
|
||
<indexterm audience="hidden">Red Hat Enterprise Linux</indexterm>
|
||
<indexterm audience="hidden">RHEL</indexterm>
|
||
<indexterm audience="hidden">CentOS</indexterm>
|
||
<indexterm audience="hidden">SLES</indexterm>
|
||
<indexterm audience="hidden">Ubuntu</indexterm>
|
||
<indexterm audience="hidden">SUSE</indexterm>
|
||
<indexterm audience="hidden">Debian</indexterm> The relevant supported operating systems
|
||
and versions for Impala are the same as for the corresponding CDH 5 platforms. For
|
||
details, see the <cite>Supported Operating Systems</cite> page for
|
||
<ph audience="integrated"><xref href="rn_consolidated_pcm.xml#cdh_cm_supported_os">CDH
|
||
5</xref></ph><ph audience="standalone"><xref
|
||
href="http://www.cloudera.com/documentation/enterprise/latest/topics/rn_consolidated_pcm.html#cdh_cm_supported_os"
|
||
scope="external" format="html">CDH 5</xref></ph>. </p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="prereqs_hive">
|
||
|
||
<title>Hive Metastore and Related Configuration</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Metastore"/>
|
||
<data name="Category" value="Hive"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">Hive</indexterm>
|
||
<indexterm audience="hidden">MySQL</indexterm>
|
||
<indexterm audience="hidden">PostgreSQL</indexterm>
|
||
Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking
|
||
metadata about schema objects such as tables and columns. The following components are prerequisites for
|
||
Impala:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.
|
||
<note>
|
||
<p>
|
||
Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without
|
||
the metastore database. For the process of installing and configuring the metastore, see
|
||
<xref href="impala_install.xml#install"/>.
|
||
</p>
|
||
<p>
|
||
Always configure a <b>Hive metastore service</b> rather than connecting directly to the metastore
|
||
database. The Hive metastore service is required to interoperate between possibly different levels of
|
||
metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to the
|
||
metastore database. The Hive metastore service is set up for you by default if you install through
|
||
Cloudera Manager 4.5 or higher.
|
||
</p>
|
||
<p>
|
||
A summary of the metastore installation process is as follows:
|
||
</p>
|
||
<ul>
|
||
<li>
|
||
Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.
|
||
</li>
|
||
|
||
<li>
|
||
Download the
|
||
<xref href="http://www.mysql.com/products/connector/" scope="external" format="html">MySQL
|
||
connector</xref> or the
|
||
<xref href="http://jdbc.postgresql.org/download.html" scope="external" format="html">PostgreSQL
|
||
connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory.
|
||
</li>
|
||
|
||
<li>
|
||
Use the appropriate command line tool for your database to create the metastore database.
|
||
</li>
|
||
|
||
<li>
|
||
Use the appropriate command line tool for your database to grant privileges for the metastore
|
||
database to the <codeph>hive</codeph> user.
|
||
</li>
|
||
|
||
<li>
|
||
Modify <codeph>hive-site.xml</codeph> to include information matching your particular database: its
|
||
URL, username, and password. You will copy the <codeph>hive-site.xml</codeph> file to the Impala
|
||
Configuration Directory later in the Impala installation process.
|
||
</li>
|
||
</ul>
|
||
</note>
|
||
</li>
|
||
|
||
<li>
|
||
<b>Optional:</b> Hive. Although only the Hive metastore database is required for Impala to function, you
|
||
might install Hive on some client machines to create and load data into tables that use certain file
|
||
formats. See <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to be
|
||
installed on the same DataNodes as Impala; it just needs access to the same metastore database.
|
||
</li>
|
||
</ul>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="prereqs_java">
|
||
|
||
<title>Java Dependencies</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Java"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">Java</indexterm>
|
||
<indexterm audience="hidden">impala-dependencies.jar</indexterm>
|
||
Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop
|
||
components:
|
||
</p>
|
||
|
||
<ul>
|
||
<li>
|
||
The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically
|
||
resulting in a failure at <cmdname>impalad</cmdname> startup. In particular, the JamVM used by default on
|
||
certain levels of Ubuntu systems can cause <cmdname>impalad</cmdname> to fail to start.
|
||
<!-- To do:
|
||
Could say something here about JDK 6 vs. JDK 7 in CDH 5. Since we didn't specify the JDK version before,
|
||
don't know the impact from the user perspective so not calling it out at the moment.
|
||
-->
|
||
</li>
|
||
|
||
<li>
|
||
Internally, the <cmdname>impalad</cmdname> daemon relies on the <codeph>JAVA_HOME</codeph> environment
|
||
variable to locate the system Java libraries. Make sure the <cmdname>impalad</cmdname> service is not run
|
||
from an environment with an incorrect setting for this variable.
|
||
</li>
|
||
|
||
<li>
|
||
All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph> file, which is located
|
||
at <codeph>/usr/lib/impala/lib/</codeph>. These map to everything that is built under
|
||
<codeph>fe/target/dependency</codeph>.
|
||
</li>
|
||
</ul>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="prereqs_network">
|
||
|
||
<title>Networking Configuration Requirements</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Network"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">network configuration</indexterm>
|
||
As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using
|
||
network connections to work with remote data. To support this goal, Impala matches
|
||
the <b>hostname</b> provided to each Impala daemon with the <b>IP address</b> of each DataNode by
|
||
resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface
|
||
for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag
|
||
resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for
|
||
multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala
|
||
tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log
|
||
in a message of the form:
|
||
</p>
|
||
|
||
<codeblock>Using hostname: impala-daemon-1.example.com</codeblock>
|
||
|
||
<p>
|
||
In the majority of cases, this automatic detection works correctly. If you need to explicitly set the
|
||
hostname, do so by setting the <codeph>--hostname</codeph> flag.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="prereqs_hardware">
|
||
|
||
<title>Hardware Requirements</title>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">hardware requirements</indexterm>
|
||
<indexterm audience="hidden">capacity</indexterm>
|
||
<indexterm audience="hidden">RAM</indexterm>
|
||
<indexterm audience="hidden">memory</indexterm>
|
||
<indexterm audience="hidden">CPU</indexterm>
|
||
<indexterm audience="hidden">processor</indexterm>
|
||
<indexterm audience="hidden">Intel</indexterm>
|
||
<indexterm audience="hidden">AMD</indexterm>
|
||
During join operations, portions of data from each joined table are loaded into memory. Data sets can be
|
||
very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate
|
||
completing.
|
||
</p>
|
||
|
||
<p>
|
||
While requirements vary according to data set size, the following is generally recommended:
|
||
</p>
|
||
|
||
<ul>
|
||
<li rev="2.0.0">
|
||
CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.
|
||
<note>
|
||
This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases
|
||
had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed.
|
||
</note>
|
||
<!--
|
||
For best performance use:
|
||
<ul>
|
||
<li>
|
||
Intel - Nehalem (released 2008) or later processors.
|
||
</li>
|
||
|
||
<li>
|
||
AMD - Bulldozer (released 2011) or later processors.
|
||
</li>
|
||
</ul>
|
||
-->
|
||
</li>
|
||
|
||
<li rev="1.2">
|
||
Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query
|
||
processing on a particular node exceed the amount of memory available to Impala on that node, the query
|
||
writes temporary work data to disk, which can lead to long query times. Note that because the work is
|
||
parallelized, and intermediate results for aggregate queries are typically smaller than the original
|
||
data, Impala can query and join tables that are much larger than the memory available on an individual
|
||
node.
|
||
</li>
|
||
|
||
<li>
|
||
Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk
|
||
performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be
|
||
querying.
|
||
</li>
|
||
</ul>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="prereqs_account">
|
||
|
||
<title>User Account Requirements</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Users"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">impala user</indexterm>
|
||
<indexterm audience="hidden">impala group</indexterm>
|
||
<indexterm audience="hidden">root user</indexterm>
|
||
Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete this account or group
|
||
and do not modify the account's or group's permissions and rights. Ensure no existing systems obstruct the
|
||
functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in
|
||
a white-list, add these accounts to the list of permitted accounts.
|
||
</p>
|
||
|
||
<!-- Taking out because no longer applicable in CDH 5.5 and up. -->
|
||
<p id="impala_hdfs_group" rev="1.2" audience="hidden">
|
||
For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components),
|
||
the <codeph>impala</codeph> user must be a member of the <codeph>hdfs</codeph> group. This setup is
|
||
performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala
|
||
1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually add the
|
||
<codeph>impala</codeph> user to the <codeph>hdfs</codeph> group.
|
||
</p>
|
||
|
||
<p>
|
||
For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be able to move files
|
||
to the HDFS trashcan. You might need to create an HDFS directory <filepath>/user/impala</filepath>,
|
||
writeable by the <codeph>impala</codeph> user, so that the trashcan can be created. Otherwise, data files
|
||
might remain behind after a <codeph>DROP TABLE</codeph> statement.
|
||
</p>
|
||
|
||
<p>
|
||
Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not
|
||
permitted to use direct reads. Therefore, running Impala as root negatively affects performance.
|
||
</p>
|
||
|
||
<p>
|
||
By default, any user can connect to Impala and access all the associated databases and tables. You can
|
||
enable authorization and authentication based on the Linux OS user who connects to the Impala server, and
|
||
the associated groups for that user. <xref href="impala_security.xml#security"/> for details. These
|
||
security features do not change the underlying file permission requirements; the <codeph>impala</codeph>
|
||
user still needs to be able to access the data files.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
</concept>
|