impala/docs/topics/impala_schema_design.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="schema_design">

  <title>Guidelines for Designing Impala Schemas</title>
  <titlealts audience="PDF"><navtitle>Designing Schemas</navtitle></titlealts>
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="Planning"/>
      <data name="Category" value="Sectionated Pages"/>
      <data name="Category" value="Proof of Concept"/>
      <data name="Category" value="Checklists"/>
      <data name="Category" value="Guidelines"/>
      <data name="Category" value="Best Practices"/>
      <data name="Category" value="Performance"/>
      <data name="Category" value="Compression"/>
      <data name="Category" value="Tables"/>
      <data name="Category" value="Schemas"/>
      <data name="Category" value="SQL"/>
      <data name="Category" value="Porting"/>
      <data name="Category" value="Proof of Concept"/>
      <data name="Category" value="Administrators"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      The guidelines in this topic help you to construct an optimized and scalable schema, one that integrates well
      with your existing data management processes. Use these guidelines as a checklist when doing any
      proof-of-concept work, porting exercise, or before deploying to production.
    </p>

    <p>
      If you are adapting an existing database or Hive schema for use with Impala, read the guidelines in this
      section and then see <xref href="impala_porting.xml#porting"/> for specific porting and compatibility tips.
    </p>

    <p outputclass="toc inpage"/>

    <section id="schema_design_text_vs_binary">

      <title>Prefer binary file formats over text-based formats.</title>

      <p>
        To save space and improve memory usage and query performance, use binary file formats for any large or
        intensively queried tables. Parquet file format is the most efficient for data warehouse-style analytic
        queries. Avro is the other binary file format that Impala supports, that you might already have as part of
        a Hadoop ETL pipeline.
      </p>

      <p>
        Although Impala can create and query tables with the RCFile and SequenceFile file formats, such tables are
        relatively bulky due to the text-based nature of those formats, and are not optimized for data
        warehouse-style queries due to their row-oriented layout. Impala does not support <codeph>INSERT</codeph>
        operations for tables with these file formats.
      </p>

      <p>
        Guidelines:
      </p>

      <ul>
        <li>
          For an efficient and scalable format for large, performance-critical tables, use the Parquet file format.
        </li>

        <li>
          To deliver intermediate data during the ETL process, in a format that can also be used by other Hadoop
          components, Avro is a reasonable choice.
        </li>

        <li>
          For convenient import of raw data, use a text table instead of RCFile or SequenceFile, and convert to
          Parquet in a later stage of the ETL process.
        </li>
      </ul>
    </section>

    <section id="schema_design_compression">

      <title>Use Snappy compression where practical.</title>

      <p>
        Snappy compression involves low CPU overhead to decompress, while still providing substantial space
        savings. In cases where you have a choice of compression codecs, such as with the Parquet and Avro file
        formats, use Snappy compression unless you find a compelling reason to use a different codec.
      </p>
    </section>

    <section id="schema_design_numeric_types">

      <title>Prefer numeric types over strings.</title>

      <p>
        If you have numeric values that you could treat as either strings or numbers (such as
        <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph> for partition key columns), define
        them as the smallest applicable integer types. For example, <codeph>YEAR</codeph> can be
        <codeph>SMALLINT</codeph>, <codeph>MONTH</codeph> and <codeph>DAY</codeph> can be <codeph>TINYINT</codeph>.
        Although you might not see any difference in the way partitioned tables or text files are laid out on disk,
        using numeric types will save space in binary formats such as Parquet, and in memory when doing queries,
        particularly resource-intensive queries such as joins.
      </p>
    </section>

<!-- Alan suggests not making this recommendation.
<section id="schema_design_decimal">
<title>Prefer DECIMAL types over FLOAT and DOUBLE.</title>
<p>
</p>
</section>
-->

    <section id="schema_design_partitioning">

      <title>Partition, but do not over-partition.</title>

      <p>
        Partitioning is an important aspect of performance tuning for Impala. Follow the procedures in
        <xref href="impala_partitioning.xml#partitioning"/> to set up partitioning for your biggest, most
        intensively queried tables.
      </p>

      <p>
        If you are moving to Impala from a traditional database system, or just getting started in the Big Data
        field, you might not have enough data volume to take advantage of Impala parallel queries with your
        existing partitioning scheme. For example, if you have only a few tens of megabytes of data per day,
        partitioning by <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph> columns might be
        too granular. Most of your cluster might be sitting idle during queries that target a single day, or each
        node might have very little work to do. Consider reducing the number of partition key columns so that each
        partition directory contains several gigabytes worth of data.
      </p>

      <p rev="parquet_block_size">
        For example, consider a Parquet table where each data file is 1 HDFS block, with a maximum block size of 1
        GB. (In Impala 2.0 and later, the default Parquet block size is reduced to 256 MB. For this exercise, let's
        assume you have bumped the size back up to 1 GB by setting the query option
        <codeph>PARQUET_FILE_SIZE=1g</codeph>.) if you have a 10-node cluster, you need 10 data files (up to 10 GB)
        to give each node some work to do for a query. But each core on each machine can process a separate data
        block in parallel. With 16-core machines on a 10-node cluster, a query could process up to 160 GB fully in
        parallel. If there are only a few data files per partition, not only are most cluster nodes sitting idle
        during queries, so are most cores on those machines.
      </p>

      <p>
        You can reduce the Parquet block size to as low as 128 MB or 64 MB to increase the number of files per
        partition and improve parallelism. But also consider reducing the level of partitioning so that analytic
        queries have enough data to work with.
      </p>
    </section>

    <section id="schema_design_compute_stats">

      <title>Always compute stats after loading data.</title>

      <p>
        Impala makes extensive use of statistics about data in the overall table and in each column, to help plan
        resource-intensive operations such as join queries and inserting into partitioned Parquet tables. Because
        this information is only available after data is loaded, run the <codeph>COMPUTE STATS</codeph> statement
        on a table after loading or replacing data in a table or partition.
      </p>

      <p>
        Having accurate statistics can make the difference between a successful operation, or one that fails due to
        an out-of-memory error or a timeout. When you encounter performance or capacity issues, always use the
        <codeph>SHOW STATS</codeph> statement to check if the statistics are present and up-to-date for all tables
        in the query.
      </p>

      <p>
        When doing a join query, Impala consults the statistics for each joined table to determine their relative
        sizes and to estimate the number of rows produced in each join stage. When doing an <codeph>INSERT</codeph>
        into a Parquet table, Impala consults the statistics for the source table to determine how to distribute
        the work of constructing the data files for each partition.
      </p>

      <p>
        See <xref href="impala_compute_stats.xml#compute_stats"/> for the syntax of the <codeph>COMPUTE
        STATS</codeph> statement, and <xref href="impala_perf_stats.xml#perf_stats"/> for all the performance
        considerations for table and column statistics.
      </p>
    </section>

    <section id="schema_design_explain">

      <title>Verify sensible execution plans with EXPLAIN and SUMMARY.</title>

      <p>
        Before executing a resource-intensive query, use the <codeph>EXPLAIN</codeph> statement to get an overview
        of how Impala intends to parallelize the query and distribute the work. If you see that the query plan is
        inefficient, you can take tuning steps such as changing file formats, using partitioned tables, running the
        <codeph>COMPUTE STATS</codeph> statement, or adding query hints. For information about all of these
        techniques, see <xref href="impala_performance.xml#performance"/>.
      </p>

      <p>
        After you run a query, you can see performance-related information about how it actually ran by issuing the
        <codeph>SUMMARY</codeph> command in <cmdname>impala-shell</cmdname>. Prior to Impala 1.4, you would use
        the <codeph>PROFILE</codeph> command, but its highly technical output was only useful for the most
        experienced users. <codeph>SUMMARY</codeph>, new in Impala 1.4, summarizes the most useful information for
        all stages of execution, for all nodes rather than splitting out figures for each node.
      </p>
    </section>

<!--
<section id="schema_design_mem_limits">
<title>Allocate resources Between Impala and batch jobs (MapReduce, Hive, Pig).</title>
<p>
</p>
</section>
-->
  </conbody>
</concept>