impala/docs/topics/impala_prereqs.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="prereqs">

  <title>Impala Requirements</title>

  <titlealts audience="PDF">

    <navtitle>Requirements</navtitle>

  </titlealts>

  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="Requirements"/>
      <data name="Category" value="Planning"/>
      <data name="Category" value="Installing"/>
      <data name="Category" value="Upgrading"/>
      <data name="Category" value="Administrators"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
<!-- Another instance of a topic pulled into the map twice, resulting in a second HTML page with a *1.html filename. -->
      <data name="Category" value="Duplicate Topics"/>
<!-- Using a separate category, 'Multimap', to flag those pages that are duplicate because of multiple DITA map references. -->
      <data name="Category" value="Multimap"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      To perform as expected, Impala depends on the availability of the software, hardware, and
      configurations described in the following sections.
    </p>

    <p outputclass="toc inpage"/>

  </conbody>

  <concept id="prereqs_os">

    <title>Supported Operating Systems</title>

    <conbody>

      <p>
        Apache Impala runs on Linux systems only. See the <filepath>README.md</filepath> file
        for more information.
      </p>

    </conbody>

  </concept>

  <concept id="prereqs_hive">

    <title>Hive Metastore and Related Configuration</title>

    <prolog>
      <metadata>
        <data name="Category" value="Metastore"/>
        <data name="Category" value="Hive"/>
      </metadata>
    </prolog>

    <conbody>

      <p>
        Impala can interoperate with data stored in Hive, and uses the same infrastructure as
        Hive for tracking metadata about schema objects such as tables and columns. The
        following components are prerequisites for Impala:
      </p>

      <ul>
        <li>
          MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.
          <p>
            Always configure a <b>Hive metastore service</b> rather than connecting directly to
            the metastore database. The Hive metastore service is required to interoperate
            between different levels of metastore APIs if this is necessary for your
            environment, and using it avoids known issues with connecting directly to the
            metastore database.
          </p>
          <p>
            See below for a summary of the metastore installation process.
          </p>
        </li>

        <li>
          Hive (optional). Although only the Hive metastore database is required for Impala to
          function, you might install Hive on some client machines to create and load data into
          tables that use certain file formats. See
          <xref href="impala_file_formats.xml#file_formats"/> for details. Hive does not need to
          be installed on the same DataNodes as Impala; it just needs access to the same
          metastore database.
        </li>
      </ul>

      <p>
        To install the metastore:
      </p>

      <ol>
        <li>
          Install a MySQL or PostgreSQL database. Start the database if it is not started after
          installation.
        </li>

        <li>
          Download the
          <xref href="http://www.mysql.com/products/connector/"
            scope="external" format="html">MySQL
          connector</xref> or the
          <xref
            href="http://jdbc.postgresql.org/download.html" scope="external"
            format="html">PostgreSQL
          connector</xref> and place it in the <codeph>/usr/share/java/</codeph> directory.
        </li>

        <li>
          Use the appropriate command line tool for your database to create the metastore
          database.
        </li>

        <li>
          Use the appropriate command line tool for your database to grant privileges for the
          metastore database to the <codeph>hive</codeph> user.
        </li>

        <li>
          Modify <codeph>hive-site.xml</codeph> to include information matching your particular
          database: its URL, username, and password. You will copy the
          <codeph>hive-site.xml</codeph> file to the Impala Configuration Directory later in the
          Impala installation process.
        </li>
      </ol>

    </conbody>

  </concept>

  <concept id="prereqs_java">

    <title>Java Dependencies</title>

    <prolog>
      <metadata>
        <data name="Category" value="Java"/>
      </metadata>
    </prolog>

    <conbody>

      <p>
        Although Impala is primarily written in C++, it does use Java to communicate with
        various Hadoop components:
      </p>

      <ul>
        <li>
          The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause
          issues, typically resulting in a failure at <cmdname>impalad</cmdname> startup. In
          particular, the JamVM used by default on certain levels of Ubuntu systems can cause
          <cmdname>impalad</cmdname> to fail to start.
        </li>

        <li>
          Internally, the <cmdname>impalad</cmdname> daemon relies on the
          <codeph>JAVA_HOME</codeph> environment variable to locate the system Java libraries.
          Make sure the <cmdname>impalad</cmdname> service is not run from an environment with
          an incorrect setting for this variable.
        </li>

        <li>
          All Java dependencies are packaged in the <codeph>impala-dependencies.jar</codeph>
          file, which is located at <codeph>/usr/lib/impala/lib/</codeph>. These map to
          everything that is built under <codeph>fe/target/dependency</codeph>.
        </li>
      </ul>

    </conbody>

  </concept>

  <concept id="prereqs_network">

    <title>Networking Configuration Requirements</title>

    <prolog>
      <metadata>
        <data name="Category" value="Network"/>
      </metadata>
    </prolog>

    <conbody>

      <p>
        As part of ensuring best performance, Impala attempts to complete tasks on local data,
        as opposed to using network connections to work with remote data. To support this goal,
        Impala matches the <b>hostname</b> provided to each Impala daemon with the <b>IP
        address</b> of each DataNode by resolving the hostname flag to an IP address. For
        Impala to work with local data, use a single IP interface for the DataNode and the
        Impala daemon on each machine. Ensure that the Impala daemon's hostname flag resolves to
        the IP address of the DataNode. For single-homed machines, this is usually automatic,
        but for multi-homed machines, ensure that the Impala daemon's hostname resolves to the
        correct interface. Impala tries to detect the correct hostname at start-up, and prints
        the derived hostname at the start of the log in a message of the form:
      </p>

<codeblock>Using hostname: impala-daemon-1.example.com</codeblock>

      <p>
        In the majority of cases, this automatic detection works correctly. If you need to
        explicitly set the hostname, do so by setting the <codeph>--hostname</codeph> flag.
      </p>

    </conbody>

  </concept>

  <concept id="prereqs_hardware">

    <title>Hardware Requirements</title>

    <conbody>

      <p>
        The memory allocation should be consistent across Impala executor nodes. A single Impala
        executor with a lower memory limit than the rest can easily become a bottleneck and lead
        to suboptimal performance.
      </p>

      <p>
        This guideline does not apply to coordinator-only nodes.
      </p>

    </conbody>

    <concept id="concept_dqy_n1w_zdb">

      <title>Hardware Requirements for Optimal Join Performance</title>

      <conbody>

        <p>
          During join operations, portions of data from each joined table are loaded into
          memory. Data sets can be very large, so ensure your hardware has sufficient memory to
          accommodate the joins you anticipate completing.
        </p>

        <p>
          While requirements vary according to data set size, the following is generally
          recommended:
        </p>

        <ul>
          <li rev="2.0.0">
            CPU
            <p>
              Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in
              newer processors.
            </p>

            <note>
              This required level of processor is the same as in Impala version 1.x. The Impala
              2.0 and 2.1 releases had a stricter requirement for the SSE4.1 instruction set,
              which has now been relaxed.
            </note>
<!--
          For best performance use:
          <ul>
            <li>
              Intel - Nehalem (released 2008) or later processors.
            </li>

            <li>
              AMD - Bulldozer (released 2011) or later processors.
            </li>
          </ul>
-->
          </li>

          <li rev="1.2">
            Memory
            <p>
              128 GB or more recommended, ideally 256 GB or more. If the intermediate results
              during query processing on a particular node exceed the amount of memory available
              to Impala on that node, the query writes temporary work data to disk, which can
              lead to long query times. Note that because the work is parallelized, and
              intermediate results for aggregate queries are typically smaller than the original
              data, Impala can query and join tables that are much larger than the memory
              available on an individual node.
            </p>
          </li>

          <li>
            JVM Heap Size for Catalog Server
            <p>
              4 GB or more recommended, ideally 8 GB or more, to accommodate the maximum numbers
              of tables, partitions, and data files you are planning to use with Impala.
            </p>
          </li>

          <li>
            Storage
            <p>
              DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for
              disk performance with Impala. Ensure that you have sufficient disk space to store
              the data Impala will be querying.
            </p>
          </li>
        </ul>

      </conbody>

    </concept>

  </concept>

  <concept id="prereqs_account">

    <title>User Account Requirements</title>

    <prolog>
      <metadata>
        <data name="Category" value="Users"/>
      </metadata>
    </prolog>

    <conbody>

      <p>
        Impala creates and uses a user and group named <codeph>impala</codeph>. Do not delete
        this account or group and do not modify the account's or group's permissions and rights.
        Ensure no existing systems obstruct the functioning of these accounts and groups. For
        example, if you have scripts that delete user accounts not in a white-list, add these
        accounts to the list of permitted accounts.
      </p>

      <p>
        For correct file deletion during <codeph>DROP TABLE</codeph> operations, Impala must be
        able to move files to the HDFS trashcan. You might need to create an HDFS directory
        <filepath>/user/impala</filepath>, writeable by the <codeph>impala</codeph> user, so
        that the trashcan can be created. Otherwise, data files might remain behind after a
        <codeph>DROP TABLE</codeph> statement.
      </p>

      <p>
        Impala should not run as root. Best Impala performance is achieved using direct reads,
        but root is not permitted to use direct reads. Therefore, running Impala as root
        negatively affects performance.
      </p>

      <p>
        By default, any user can connect to Impala and access all the associated databases and
        tables. You can enable authorization and authentication based on the Linux OS user who
        connects to the Impala server, and the associated groups for that user.
        <xref href="impala_security.xml#security"/> for details. These security features do not
        change the underlying file permission requirements; the <codeph>impala</codeph> user
        still needs to be able to access the data files.
      </p>

    </conbody>

  </concept>

</concept>