impala/docs/topics/impala_perf_stats.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="perf_stats">

  <title>Table and Column Statistics</title>

  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="Performance"/>
      <data name="Category" value="Querying"/>
      <data name="Category" value="Concepts"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      Impala can do better optimization for complex or multi-table queries when it has access to
      statistics about the volume of data and how the values are distributed. Impala uses this
      information to help parallelize and distribute the work for a query. For example,
      optimizing join queries requires a way of determining if one table is <q>bigger</q> than
      another, which is a function of the number of rows and the average row size for each
      table. The following sections describe the categories of statistics Impala can work with,
      and how to produce them and keep them up to date.
    </p>

    <p outputclass="toc inpage all"/>

  </conbody>

  <concept id="perf_table_stats">

    <title id="table_stats">Overview of Table Statistics</title>

    <prolog>
      <metadata>
        <data name="Category" value="Concepts"/>
      </metadata>
    </prolog>

    <conbody>

      <p>
        The Impala query planner can make use of statistics about entire tables and partitions.
        This information includes physical characteristics such as the number of rows, number of
        data files, the total size of the data files, and the file format. For partitioned
        tables, the numbers are calculated per partition, and as totals for the whole table.
        This metadata is stored in the metastore database, and can be updated by either Impala
        or Hive. If a number is not available, the value -1 is used as a placeholder. Some
        numbers, such as number and total sizes of data files, are always kept up to date
        because they can be calculated cheaply, as part of gathering HDFS block metadata.
      </p>

      <p>
        The following example shows table stats for an unpartitioned Parquet table. The values
        for the number and sizes of files are always available. Initially, the number of rows is
        not known, because it requires a potentially expensive scan through the entire table,
        and so that value is displayed as -1. The <codeph>COMPUTE STATS</codeph> statement fills
        in any unknown table stats values.
      </p>

<codeblock>
show table stats parquet_snappy;
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                   |
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| -1    | 96     | 23.35GB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+

compute stats parquet_snappy;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 6 column(s). |
+-----------------------------------------+


show table stats parquet_snappy;
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| #Rows      | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                   |
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 1000000000 | 96     | 23.35GB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
</codeblock>

      <p>
        Impala performs some optimizations using this metadata on its own, and other
        optimizations by using a combination of table and column statistics.
      </p>

      <p rev="1.2.1">
        To check that table statistics are available for a table, and see the details of those
        statistics, use the statement <codeph>SHOW TABLE STATS
        <varname>table_name</varname></codeph>. See <xref href="impala_show.xml#show"/> for
        details.
      </p>

      <p>
        If you use the Hive-based methods of gathering statistics, see
        <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the
        Hive wiki</xref> for information about the required configuration on the Hive side.
        Where practical, use the Impala <codeph>COMPUTE STATS</codeph> statement to avoid
        potential configuration and scalability issues with the statistics-gathering process.
      </p>

      <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>

    </conbody>

  </concept>

  <concept id="perf_column_stats">

    <title id="column_stats">Overview of Column Statistics</title>

    <conbody>

      <p>
        The Impala query planner can make use of statistics about individual columns when that
        metadata is available in the metastore database. This technique is most valuable for
        columns compared across tables in <xref href="impala_perf_joins.xml#perf_joins">join
        queries</xref>, to help estimate how many rows the query will retrieve from each table.
        <ph rev="2.0.0"> These statistics are also important for correlated subqueries using the
        <codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed
        internally the same way as join queries.</ph>
      </p>

      <p>
        The following example shows column stats for an unpartitioned Parquet table. The values
        for the maximum and average sizes of some types are always available, because those
        figures are constant for numeric and other fixed-size types. Initially, the number of
        distinct values is not known, because it requires a potentially expensive scan through
        the entire table, and so that value is displayed as -1. The same applies to maximum and
        average sizes of variable-sized types, such as <codeph>STRING</codeph>. The
        <codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values. (It
        does not record the number of <codeph>NULL</codeph> values, because currently Impala
        does not use that figure for query optimization.)
      </p>

<codeblock>
show column stats parquet_snappy;
+-------------+----------+------------------+--------+----------+----------+
| Column      | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
+-------------+----------+------------------+--------+----------+----------+
| id          | BIGINT   | -1               | -1     | 8        | 8        |
| val         | INT      | -1               | -1     | 4        | 4        |
| zerofill    | STRING   | -1               | -1     | -1       | -1       |
| name        | STRING   | -1               | -1     | -1       | -1       |
| assertion   | BOOLEAN  | -1               | -1     | 1        | 1        |
| location_id | SMALLINT | -1               | -1     | 2        | 2        |
+-------------+----------+------------------+--------+----------+----------+

compute stats parquet_snappy;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 6 column(s). |
+-----------------------------------------+

show column stats parquet_snappy;
+-------------+----------+------------------+--------+----------+-------------------+
| Column      | Type     | #Distinct Values | #Nulls | Max Size | Avg Size          |
+-------------+----------+------------------+--------+----------+-------------------+
| id          | BIGINT   | 183861280        | -1     | 8        | 8                 |
| val         | INT      | 139017           | -1     | 4        | 4                 |
| zerofill    | STRING   | 101761           | -1     | 6        | 6                 |
| name        | STRING   | 145636240        | -1     | 22       | 13.00020027160645 |
| assertion   | BOOLEAN  | 2                | -1     | 1        | 1                 |
| location_id | SMALLINT | 339              | -1     | 2        | 2                 |
+-------------+----------+------------------+--------+----------+-------------------+
</codeblock>

      <note>
        <p>
          For column statistics to be effective in Impala, you also need to have table
          statistics for the applicable tables, as described in
          <xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use the Impala
          <codeph>COMPUTE STATS</codeph> statement, both table and column statistics are
          automatically gathered at the same time, for all columns in the table.
        </p>
      </note>

      <note conref="../shared/impala_common.xml#common/compute_stats_nulls"/>

      <p rev="1.2.1">
        To check whether column statistics are available for a particular set of columns, use
        the <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statement, or check
        the extended <codeph>EXPLAIN</codeph> output for a query against that table that refers
        to those columns. See <xref href="impala_show.xml#show"/> and
        <xref href="impala_explain.xml#explain"/> for details.
      </p>

      <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>

    </conbody>

  </concept>

  <concept id="perf_stats_partitions">

    <title id="stats_partitions">How Table and Column Statistics Work for Partitioned Tables</title>

    <conbody>

      <p>
        When you use Impala for <q>big data</q>, you are highly likely to use partitioning for
        your biggest tables, the ones representing data that can be logically divided based on
        dates, geographic regions, or similar criteria. The table and column statistics are
        especially useful for optimizing queries on such tables. For example, a query involving
        one year might involve substantially more or less data than a query involving a
        different year, or a range of several years. Each query might be optimized differently
        as a result.
      </p>

      <p>
        The following examples show how table and column stats work with a partitioned table.
        The table for this example is partitioned by year, month, and day. For simplicity, the
        sample data consists of 5 partitions, all from the same year and month. Table stats are
        collected independently for each partition. (In fact, the <codeph>SHOW
        PARTITIONS</codeph> statement displays exactly the same information as <codeph>SHOW
        TABLE STATS</codeph> for a partitioned table.) Column stats apply to the entire table,
        not to individual partitions. Because the partition key column values are represented as
        HDFS directories, their characteristics are typically known in advance, even when the
        values for non-key columns are shown as -1.
      </p>

<codeblock>
show partitions year_month_day;
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| year  | month | day | #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                   |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 2013  | 12    | 1   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 2   | -1    | 1      | 2.53MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 3   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 4   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 5   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| Total |       |     | -1    | 5      | 12.58MB | 0B           |                   |         |                   |                                            |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+

show table stats year_month_day;
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| year  | month | day | #Rows | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                   |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 2013  | 12    | 1   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 2   | -1    | 1      | 2.53MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 3   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 4   | -1    | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 5   | -1    | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| Total |       |     | -1    | 5      | 12.58MB | 0B           |                   |         |                   |                                            |
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+

show column stats year_month_day;
+-----------+---------+------------------+--------+----------+----------+
| Column    | Type    | #Distinct Values | #Nulls | Max Size | Avg Size |
+-----------+---------+------------------+--------+----------+----------+
| id        | INT     | -1               | -1     | 4        | 4        |
| val       | INT     | -1               | -1     | 4        | 4        |
| zfill     | STRING  | -1               | -1     | -1       | -1       |
| name      | STRING  | -1               | -1     | -1       | -1       |
| assertion | BOOLEAN | -1               | -1     | 1        | 1        |
| year      | INT     | 1                | 0      | 4        | 4        |
| month     | INT     | 1                | 0      | 4        | 4        |
| day       | INT     | 5                | 0      | 4        | 4        |
+-----------+---------+------------------+--------+----------+----------+

compute stats year_month_day;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 5 partition(s) and 5 column(s). |
+-----------------------------------------+

show table stats year_month_day;
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| year  | month | day | #Rows  | #Files | Size    | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                   |
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
| 2013  | 12    | 1   | 93606  | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 2   | 94158  | 1      | 2.53MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 3   | 94122  | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 4   | 93559  | 1      | 2.51MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| 2013  | 12    | 5   | 93845  | 1      | 2.52MB  | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://&lt;server>:8020/&lt;path>/parquet_snappy |
| Total |       |     | 469290 | 5      | 12.58MB | 0B           |                   |         |                   |                                            |
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+

show column stats year_month_day;
+-----------+---------+------------------+--------+----------+-------------------+
| Column    | Type    | #Distinct Values | #Nulls | Max Size | Avg Size          |
+-----------+---------+------------------+--------+----------+-------------------+
| id        | INT     | 511129           | -1     | 4        | 4                 |
| val       | INT     | 364853           | -1     | 4        | 4                 |
| zfill     | STRING  | 311430           | -1     | 6        | 6                 |
| name      | STRING  | 471975           | -1     | 22       | 13.00160026550293 |
| assertion | BOOLEAN | 2                | -1     | 1        | 1                 |
| year      | INT     | 1                | 0      | 4        | 4                 |
| month     | INT     | 1                | 0      | 4        | 4                 |
| day       | INT     | 5                | 0      | 4        | 4                 |
+-----------+---------+------------------+--------+----------+-------------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>

    </conbody>

  </concept>

  <concept id="perf_generating_stats">

    <title>Generating Table and Column Statistics</title>

    <conbody>

      <p>
        Use the <codeph>COMPUTE STATS</codeph> family of commands to collect table and
        column statistics. The <codeph>COMPUTE STATS</codeph> variants offer
        different tradeoffs between computation cost, staleness, and maintenance
        workflows which are explained below.
      </p>

      <note type="important">
        <p conref="../shared/impala_common.xml#common/cs_or_cis"/>
      </note>

      <!-- TODO: Commented out because it is inaccurate and confusing. Leaving this
      material for future refactoring into a Hive-compatibility section.
      <p>
        If you use Hive as part of your ETL workflow, you can also use Hive to generate table
        and column statistics. You might need to do extra configuration within Hive itself, the
        metastore, or even set up a separate database to hold Hive-generated statistics. You
        might need to run multiple statements to generate all the necessary statistics.
        Therefore, prefer the Impala <codeph>COMPUTE STATS</codeph> statement where that
        technique is practical. For details about collecting statistics through Hive, see
        <xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the
        Hive wiki</xref>.
      </p>
      -->

    </conbody>

    <concept id="concept_y2f_nfl_mdb">

      <title>COMPUTE STATS</title>

      <conbody>

        <p>
          The <codeph>COMPUTE STATS</codeph> command collects and sets the table-level
          and partition-level row counts as well as all column statistics for a given
          table. The collection process is CPU-intensive and can take a long time to
          complete for very large tables.
        </p>
        <p>
          To speed up <codeph>COMPUTE STATS</codeph> consider the following options
          which can be combined.
          <ul>
          <li>
            <p>
              Limit the number of columns for which statistics are collected to increase
              the efficiency of COMPUTE STATS. Queries benefit from statistics for those
              columns involved in filters, join conditions, group by or partition by
              clauses. Other columns are good candidates to exclude from COMPUTE STATS.
              This feature is available since Impala 2.12.
            </p>
          </li>
          <li>
            <p>
              Set the MT_DOP query option to use more threads within each participating
              impalad to compute the statistics faster - but not more efficiently. Note
              that computing stats on a large table with a high MT_DOP value can
              negatively affect other queries running at the same time if the
              COMPUTE STATS claims most CPU cycles.
              This feature is available since Impala 2.8.
            </p>
          </li>
          <li>
            <p>
              Consider the experimental extrapolation and sampling features (see below)
              to further increase the efficiency of computing stats.
            </p>
          </li>
          </ul>
        </p>

        <p>
          <codeph>COMPUTE STATS</codeph> is intended to be run periodically,
          e.g. weekly, or on-demand when the contents of a table have changed
          significantly. Due to the high resource utilization and long repsonse
          time of t<codeph>COMPUTE STATS</codeph>, it is most practical to run it
          in a scheduled maintnance window where the Impala cluster is idle
          enough to accommodate the expensive operation. The degree of change that
          qualifies as <q>significant</q> depends on the query workload, but typically,
          if 30% of the rows have changed then it is recommended to recompute
          statistics.
        </p>

        <p>
          If you reload a complete new set of data for a table, but the number of rows and
          number of distinct values for each column is relatively unchanged from before, you
          do not need to recompute stats for the table.
        </p>

      </conbody>

      <concept id="experimental_stats_features">
        <title>Experimental: Extrapolation and Sampling</title>
        <conbody>
          <p>
            Impala 2.12 and higher includes two experimental features to alleviate
            common issues for computing and maintaining statistics on very large tables.
            The following shortcomings are improved upon:
            <ul>
            <li>
              <p>
                Newly added partitions do not have row count statistics. Table scans
                that only access those new partitions are treated as not having stats.
                Similarly, table scans that access both new and old partitions estimate
                the scan cardinality based on those old partitions that have stats, and
                the new partitions without stats are treated as having 0 rows.
              </p>
            </li>
            <li>
              <p>
                The row counts of existing partitions become stale when data is added
                or dropped.
              </p>
            </li>
            <li>
              <p>
                Computing stats for tables with a 100,000 or more partitions might fail
                or be very slow due to the high cost of updating the partition metadata
                in the Hive Metastore.
              </p>
            </li>
            <li>
              <p>
                With transient compute resources it is important to minimize the time
                from starting a new cluster to successfully running queries.
                Since the cluster might be relatively short-lived, users might prefer to
                quickly collect stats that are "good enough" as opposed to spending
                a lot of time and resouces on computing full-fidelity stats.
              </p>
            </li>
            </ul>
            For very large tables, it is often wasteful or impractical to run a full
            COMPUTE STATS to address the scenarios above on a frequent basis.
          </p>
          <p>
            The sampling feature makes COMPUTE STATS more efficient by processing a
            fraction of the table data, and the extrapolation feature aims to reduce
            the frequency at which COMPUTE STATS needs to be re-run by estimating
            the row count of new and modified partitions.
          </p>
          <p>
            The sampling and extrapolation features are disabled by default.
            They can be enabled globally or for specific tables, as follows.
            Set the impalad start-up configuration "--enable_stats_extrapolation" to
            enable the features globally. To enable them only for a specific table, set
            the "impala.enable.stats.extrapolation" table property to "true" for the
            desired table. The table-level property overrides the global setting, so
            it is also possible to enable sampling and extrapolation globally, but
            disable it for specific tables by setting the table property to "false".
            Example:
            ALTER TABLE mytable test_table SET TBLPROPERTIES("impala.enable.stats.extrapolation"="true")
          </p>
          <note>
            Why are these features experimental? Due to their probabilistic nature
            it is possible that these features perform pathologically poorly on tables
            with extreme data/file/size distributions. Since it is not feasible for us
            to test all possible scenarios we only cautiously advertise these new
            capabilities. That said, the features have been thoroughly tested and
            are considered functionally stable. If you decide to give these features
            a try, please tell us about your experience at user@impala.apache.org!
            We rely on user feedback to guide future inprovements in statistics
            collection.
          </note>
        </conbody>

        <concept id="experimental_stats_extrapolation">
          <title>Stats Extrapolation</title>
          <conbody>
            <p>
              The main idea of stats extrapolation is to estimate the row count of new
              and modified partitions based on the result of the last COMPUTE STATS.
              Enabling stats extrapolation changes the behavior of COMPUTE STATS,
              as well as the cardinality estimation of table scans. COMPUTE STATS no
              longer computes and stores per-partition row counts, and instead, only
              computes a table-level row count together with the total number of file
              bytes in the table at that time. No partition metadata is modified. The
              input cardinality of a table scan is estimated by converting the data
              volume of relevant partitions to a row count, based on the table-level
              row count and file bytes statistics. It is assumed that within the same
              table, different sets of files with the same data volume correspond
              to the similar number of rows on average. With extrapolation enabled,
              the scan cardinality estimation ignores per-partition row counts. It
              only relies on the table-level statistics and the scanned data volume.
            </p>
            <p>
              The SHOW TABLE STATS and EXPLAIN commands distinguish between row counts
              stored in the Hive Metastore, and the row counts extrapolated based on the
              above process. Consult the SHOW TABLE STATS and EXPLAIN documentation
              for more details.
            </p>
          </conbody>
        </concept>

        <concept id="experimental_stats_sampling">
          <title>Sampling</title>
          <conbody>
            <p>
              A TABLESAMPLE clause may be added to COMPUTE STATS to limit the
              percentage of data to be processed. The final statistics are obtained
              by extrapolating the statistics from the data sample over the entire table.
              The extrapolated statistics are stored in the Hive Metastore, just as if no
              sampling was used. The following example runs COMPUTE STATS over a 10 percent
              data sample: COMPUTE STATS test_table TABLESAMPLE SYSTEM(10)
            </p>
            <p>
            We have found that a 10 percent sampling rate typically offers a good
            tradeoff between statistics accuracy and execution cost. A sampling rate
            well below 10 percent has shown poor results and is not recommended.
            </p>
            <note type="important">
              Sampling-based techniques sacrifice result accuracy for execution
              efficiency, so your mileage may vary for different tables and columns
              depending on their data distribution. The extrapolation procedure Impala
              uses for estimating the number of distinct values per column is inherently
              non-detetministic, so your results may even vary between runs of
              COMPUTE STATS TABLESAMPLE, even if no data has changed.
            </note>
          </conbody>
        </concept>
      </concept>
    </concept>

    <concept id="concept_bmk_pfl_mdb">

      <title>COMPUTE INCREMENTAL STATS</title>

      <conbody>

        <p>
          In Impala 2.1.0 and higher, you can use the
          <codeph>COMPUTE INCREMENTAL STATS</codeph> and
          <codeph>DROP INCREMENTAL STATS</codeph> commands.
          The <codeph>INCREMENTAL</codeph> clauses work with incremental statistics,
          a specialized feature for partitioned tables.
        </p>

        <p>
          When you compute incremental statistics for a partitioned table, by default Impala only
          processes those partitions that do not yet have incremental statistics. By processing
          only newly added partitions, you can keep statistics up to date without incurring the
          overhead of reprocessing the entire table each time.
        </p>

        <p>
          You can also compute or drop statistics for a specified subset of partitions by
          including a <codeph>PARTITION</codeph> clause in the
          <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph>
          statement.
        </p>

        <note type="important">
          <p
            conref="../shared/impala_common.xml#common/incremental_stats_caveats"/>
          <p
            conref="../shared/impala_common.xml#common/incremental_stats_after_full"/>
        </note>

        <p>
          The metadata for incremental statistics is handled differently from the original style
          of statistics:
        </p>

        <ul>
          <li>
            <p>
              Issuing a <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition
              clause causes Impala to compute incremental stats for all partitions that
              do not already have incremental stats. This might be the entire table when
              running the command for the first time, but subsequent runs should only
              update new partitions. You can force updating a partition that already has
              incremental stats by issuing a <codeph>DROP INCREMENTAL STATS</codeph>
              before running <codeph>COMPUTE INCREMENTAL STATS</codeph>.
            </p>
          </li>

          <li>
            <p>
              The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph>
              statements now include an additional column showing whether incremental statistics
              are available for each column. A partition could already be covered by the original
              type of statistics based on a prior <codeph>COMPUTE STATS</codeph> statement, as
              indicated by a value other than <codeph>-1</codeph> under the <codeph>#Rows</codeph>
              column. Impala query planning uses either kind of statistics when available.
            </p>
          </li>

          <li>
            <p>
              <codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE
              STATS</codeph> for the same volume of data. Therefore it is most suitable for tables
              with large data volume where new partitions are added frequently, making it
              impractical to run a full <codeph>COMPUTE STATS</codeph> operation for each new
              partition. For unpartitioned tables, or partitioned tables that are loaded once and
              not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph>
              syntax.
            </p>
          </li>

          <li>
            <p>
              <codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the
                <cmdname>catalogd</cmdname> process, proportional to the number
              of partitions and number of columns in the applicable table. The
              memory overhead is approximately 400 bytes for each column in each
              partition. This memory is reserved in the
                <cmdname>catalogd</cmdname> daemon, the
                <cmdname>statestored</cmdname> daemon, and in each instance of
              the impalad daemon. </p>
          </li>

          <li>
            <p>
              In cases where new files are added to an existing partition, issue a
              <codeph>REFRESH</codeph> statement for the table, followed by a <codeph>DROP
              INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL STATS</codeph> sequence
              for the changed partition.
            </p>
          </li>

          <li>
            <p>
              The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single
              partition at a time. To remove statistics (whether incremental or not) from all
              partitions of a table, issue a <codeph>DROP STATS</codeph> statement with no
              <codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses.
            </p>
          </li>
        </ul>

        <p>
          The following considerations apply to incremental statistics when the structure of an
          existing table is changed (known as <term>schema evolution</term>):
        </p>

        <ul>
          <li>
            <p>
              If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing
              statistics remain valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not
              rescan any partitions.
            </p>
          </li>

          <li>
            <p>
              If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans
              all partitions and fills in the appropriate column-level values the next time you
              run <codeph>COMPUTE INCREMENTAL STATS</codeph>.
            </p>
          </li>

          <li>
            <p>
              If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a
              column, Impala rescans all partitions and fills in the appropriate column-level
              values the next time you run <codeph>COMPUTE INCREMENTAL STATS</codeph>.
            </p>
          </li>

          <li>
            <p>
              If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a
              table, the existing statistics remain valid and a subsequent <codeph>COMPUTE
              INCREMENTAL STATS</codeph> does not rescan any partitions.
            </p>
          </li>
        </ul>

        <p>
          See <xref href="impala_compute_stats.xml#compute_stats"/> and
          <xref
            href="impala_drop_stats.xml#drop_stats"/> for syntax details.
        </p>

      </conbody>
      <concept id="inc_stats_size_limit_bytes">
        <title>Maximum Serialized Stats Size</title>
        <conbody>
          <p>In Impala 3.0 and lower, when executing <codeph>COMPUTE INCREMENTAL
              STATS</codeph> on very large tables, use the configuration setting
              <codeph>--inc_stats_size_limit_bytes</codeph> to prevent Impala
            from running out of memory while updating table metadata. If this
            limit is reached, Impala will stop loading the table and return an
            error. The error serves as an indication that <codeph>COMPUTE
              INCREMENTAL STATS</codeph> should not be used on the particular
            table. Consider spitting the table and using regular <codeph>COMPUTE
              STATS</codeph> ]if possible. </p>

          <p> The <codeph>--inc_stats_size_limit_bytes</codeph> limit is set as
            a safety check, to prevent Impala from hitting the maximum limit for
            the table metadata. Note that this limit is only one part of the
            entire table's metadata all of which together must be below 2 GB. </p>

          <p> The default value for
              <codeph>--inc_stats_size_limit_bytes</codeph> is 209715200, 200
            MB. </p>

          <p> To change the <codeph>--inc_stats_size_limit_bytes</codeph> value,
            restart impalad and catalogd with the new value specified in bytes,
            for example, 1048576000 for 1 GB. See <xref
              href="impala_config_options.xml#config_options"/> for the steps to
            change the option and restart Impala daemons. </p>

          <note type="attention"> The
              <codeph>--inc_stats_size_limit_bytes</codeph> setting should be
            increased with care. A big value for the setting, such as 1 GB or
            more, can result in a spike in heap usage as well as a crash of
            Impala. </note>
          <p>In Impala 3.1 and higher, Impala improved how metadata is updated
            when executing <codeph>COMPUTE INCREMENTAL STATS</codeph>,
            significantly reducing the need for
              <codeph>--inc_stats_size_limit_bytes</codeph>. </p>
        </conbody>
      </concept>

    </concept>

  </concept>

  <concept rev="2.1.0" id="perf_stats_checking">

    <title>Detecting Missing Statistics</title>

    <conbody>

      <p>
        You can check whether a specific table has statistics using the <codeph>SHOW TABLE
        STATS</codeph> statement (for any table) or the <codeph>SHOW PARTITIONS</codeph>
        statement (for a partitioned table). Both statements display the same information. If a
        table or a partition does not have any statistics, the <codeph>#Rows</codeph> field
        contains <codeph>-1</codeph>. Once you compute statistics for the table or partition,
        the <codeph>#Rows</codeph> field changes to an accurate value.
      </p>

      <p>
        The following example shows a table that initially does not have any statistics. The
        <codeph>SHOW TABLE STATS</codeph> statement displays different values for
        <codeph>#Rows</codeph> before and after the <codeph>COMPUTE STATS</codeph> operation.
      </p>

<codeblock>[localhost:21000] &gt; create table no_stats (x int);
[localhost:21000] &gt; show table stats no_stats;
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] &gt; compute stats no_stats;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] &gt; show table stats no_stats;
+-------+--------+------+--------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+--------+------+--------------+--------+-------------------+
| 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
+-------+--------+------+--------------+--------+-------------------+
</codeblock>

      <p>
        The following example shows a similar progression with a partitioned table. Initially,
        <codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph>
        operation, <codeph>#Rows</codeph> changes to an accurate value. Any newly added
        partition starts with no statistics, meaning that you must collect statistics after
        adding a new partition.
      </p>

<codeblock>[localhost:21000] &gt; create table no_stats_partitioned (x int) partitioned by (year smallint);
[localhost:21000] &gt; show table stats no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| Total | -1    | 0      | 0B   | 0B           |        |                   |
+-------+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] &gt; show partitions no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| Total | -1    | 0      | 0B   | 0B           |        |                   |
+-------+-------+--------+------+--------------+--------+-------------------+
[localhost:21000] &gt; alter table no_stats_partitioned add partition (year=2013);
[localhost:21000] &gt; compute stats no_stats_partitioned;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] &gt; alter table no_stats_partitioned add partition (year=2014);
[localhost:21000] &gt; show partitions no_stats_partitioned;
+-------+-------+--------+------+--------------+--------+-------------------+
| year  | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
+-------+-------+--------+------+--------------+--------+-------------------+
| 2013  | 0     | 0      | 0B   | NOT CACHED   | TEXT   | false             |
| 2014  | -1    | 0      | 0B   | NOT CACHED   | TEXT   | false             |
| Total | 0     | 0      | 0B   | 0B           |        |                   |
+-------+-------+--------+------+--------------+--------+-------------------+
</codeblock>

      <note>
        Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates
        statistics for all partitions in a table, if you expect to frequently add new
        partitions, use the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax instead, which
        lets you compute stats for a single specified partition, or only for those partitions
        that do not already have incremental stats.
      </note>

      <p>
        If checking each individual table is impractical, due to a large number of tables or
        views that hide the underlying base tables, you can also check for missing statistics
        for a particular query. Use the <codeph>EXPLAIN</codeph> statement to preview query
        efficiency before actually running the query. Use the query profile output available
        through the <codeph>PROFILE</codeph> command in <cmdname>impala-shell</cmdname> or the
        web UI to verify query execution and timing after running the query. Both the
        <codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning
        if any tables or partitions involved in the query do not have statistics.
      </p>

<codeblock>[localhost:21000] &gt; create table no_stats (x int);
[localhost:21000] &gt; explain select count(*) from no_stats;
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| incremental_stats.no_stats                                                         |
|                                                                                    |
| 03:AGGREGATE [FINALIZE]                                                            |
| |  output: count:merge(*)                                                          |
| |                                                                                  |
| 02:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 01:AGGREGATE                                                                       |
| |  output: count(*)                                                                |
| |                                                                                  |
| 00:SCAN HDFS [incremental_stats.no_stats]                                          |
|    partitions=1/1 files=0 size=0B                                                  |
+------------------------------------------------------------------------------------+
</codeblock>

      <p>
        Because Impala uses the <term>partition pruning</term> technique when possible to only
        evaluate certain partitions, if you have a partitioned table with statistics for some
        partitions and not others, whether or not the <codeph>EXPLAIN</codeph> statement shows
        the warning depends on the actual partitions used by the query. For example, you might
        see warnings or not for different queries against the same table:
      </p>

<codeblock>-- No warning because all the partitions for the year 2012 have stats.
EXPLAIN SELECT ... FROM t1 WHERE year = 2012;

-- Missing stats warning because one or more partitions in this range
-- do not have stats.
EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;
</codeblock>

      <p>
        To confirm if any partitions at all in the table are missing statistics, you might
        explain a query that scans the entire table, such as <codeph>SELECT COUNT(*) FROM
        <varname>table_name</varname></codeph>.
      </p>

    </conbody>

  </concept>

  <concept id="concept_s3c_4gl_mdb">

    <title>Manually Setting Table and Column Statistics with ALTER TABLE</title>

    <concept id="concept_wpt_pgl_mdb">

      <title>Setting Table Statistics</title>

      <conbody>

        <p>
          The most crucial piece of data in all the statistics is the number of rows in the
          table (for an unpartitioned or partitioned table) and for each partition (for a
          partitioned table). The <codeph>COMPUTE STATS</codeph> statement always gathers
          statistics about all columns, as well as overall table statistics. If it is not
          practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL
          STATS</codeph> operation after adding a partition or inserting data, or if you can see
          that Impala would produce a more efficient plan if the number of rows was different,
          you can manually set the number of rows through an <codeph>ALTER TABLE</codeph>
          statement:
        </p>

<codeblock>
-- Set total number of rows. Applies to both unpartitioned and partitioned tables.
alter table <varname>table_name</varname> set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');

-- Set total number of rows for a specific partition. Applies to partitioned tables only.
-- You must specify all the partition key columns in the PARTITION clause.
alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...) set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
</codeblock>

        <p>
          This statement avoids re-scanning any data files. (The requirement to include the
          <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a
          result of the issue
          <xref
            href="https://issues.apache.org/jira/browse/HIVE-8648"
            scope="external" format="html">HIVE-8648</xref>
          for the Hive metastore.)
        </p>

<codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/>

        <p>
          For a partitioned table, update both the per-partition number of rows and the number
          of rows for the whole table:
        </p>

<codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/>

        <p>
          In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE
          INCREMENTAL STATS</codeph> for a partitioned table, should be fast and convenient
          enough that this technique is only useful for the very largest partitioned tables.
<!--
        It is most useful as a workaround for in case of performance issues where you might adjust the <codeph>numRows</codeph> value higher
        or lower to produce the ideal join order.
        -->
<!-- Following wording is duplicated from earlier. Consider conref'ing. -->
          Because the column statistics might be left in a stale state, do not use this
          technique as a replacement for <codeph>COMPUTE STATS</codeph>. Only use this technique
          if all other means of collecting statistics are impractical, or as a low-overhead
          operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or
          <codeph>COMPUTE INCREMENTAL STATS</codeph> operations.
        </p>

      </conbody>

    </concept>

    <concept id="concept_asb_vgl_mdb">

      <title>Setting Column Statistics</title>

      <conbody>

        <p>
          In <keyword keyref="impala26_full"/> and higher, you can also use the <codeph>SET
          COLUMN STATS</codeph> clause of <codeph>ALTER TABLE</codeph> to manually set or change
          column statistics. Only use this technique in cases where it is impractical to run
          <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
          frequently enough to keep up with data changes for a huge table.
        </p>

        <p conref="../shared/impala_common.xml#common/set_column_stats_example"/>

      </conbody>

    </concept>

  </concept>

  <concept rev="1.2.2" id="perf_stats_examples">

    <title>Examples of Using Table and Column Statistics with Impala</title>

    <conbody>

      <p>
        The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>,
        <codeph>SHOW COLUMN STATS</codeph>, <codeph>ALTER TABLE</codeph>, and
        <codeph>SELECT</codeph> and <codeph>INSERT</codeph> statements to illustrate various
        aspects of how Impala uses statistics to help optimize queries.
      </p>

      <p>
        This example shows table and column statistics for the <codeph>STORE</codeph> column
        used in the <xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS
        benchmarks for decision support</xref> systems. It is a tiny table holding data for 12
        stores. Initially, before any statistics are gathered by a <codeph>COMPUTE
        STATS</codeph> statement, most of the numeric fields show placeholder values of -1,
        indicating that the figures are unknown. The figures that are filled in are values that
        are easily countable or deducible at the physical level, such as the number of files,
        total data size of the files, and the maximum and average sizes for data types that have
        a constant size such as <codeph>INT</codeph>, <codeph>FLOAT</codeph>, and
        <codeph>TIMESTAMP</codeph>.
      </p>

<codeblock>[localhost:21000] &gt; show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size   | Format |
+-------+--------+--------+--------+
| -1    | 1      | 3.08KB | TEXT   |
+-------+--------+--------+--------+
Returned 1 row(s) in 0.03s
[localhost:21000] &gt; show column stats store;
+--------------------+-----------+------------------+--------+----------+----------+
| Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------------------+-----------+------------------+--------+----------+----------+
| s_store_sk         | INT       | -1               | -1     | 4        | 4        |
| s_store_id         | STRING    | -1               | -1     | -1       | -1       |
| s_rec_start_date   | TIMESTAMP | -1               | -1     | 16       | 16       |
| s_rec_end_date     | TIMESTAMP | -1               | -1     | 16       | 16       |
| s_closed_date_sk   | INT       | -1               | -1     | 4        | 4        |
| s_store_name       | STRING    | -1               | -1     | -1       | -1       |
| s_number_employees | INT       | -1               | -1     | 4        | 4        |
| s_floor_space      | INT       | -1               | -1     | 4        | 4        |
| s_hours            | STRING    | -1               | -1     | -1       | -1       |
| s_manager          | STRING    | -1               | -1     | -1       | -1       |
| s_market_id        | INT       | -1               | -1     | 4        | 4        |
| s_geography_class  | STRING    | -1               | -1     | -1       | -1       |
| s_market_desc      | STRING    | -1               | -1     | -1       | -1       |
| s_market_manager   | STRING    | -1               | -1     | -1       | -1       |
| s_division_id      | INT       | -1               | -1     | 4        | 4        |
| s_division_name    | STRING    | -1               | -1     | -1       | -1       |
| s_company_id       | INT       | -1               | -1     | 4        | 4        |
| s_company_name     | STRING    | -1               | -1     | -1       | -1       |
| s_street_number    | STRING    | -1               | -1     | -1       | -1       |
| s_street_name      | STRING    | -1               | -1     | -1       | -1       |
| s_street_type      | STRING    | -1               | -1     | -1       | -1       |
| s_suite_number     | STRING    | -1               | -1     | -1       | -1       |
| s_city             | STRING    | -1               | -1     | -1       | -1       |
| s_county           | STRING    | -1               | -1     | -1       | -1       |
| s_state            | STRING    | -1               | -1     | -1       | -1       |
| s_zip              | STRING    | -1               | -1     | -1       | -1       |
| s_country          | STRING    | -1               | -1     | -1       | -1       |
| s_gmt_offset       | FLOAT     | -1               | -1     | 4        | 4        |
| s_tax_percentage   | FLOAT     | -1               | -1     | 4        | 4        |
+--------------------+-----------+------------------+--------+----------+----------+
Returned 29 row(s) in 0.04s</codeblock>

      <p>
        With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to
        specify each column for which to gather statistics. The Impala <codeph>COMPUTE
        STATS</codeph> statement automatically gathers statistics for all columns, because it
        reads through the entire table relatively quickly and can efficiently compute the values
        for all the columns. This example shows how after running the <codeph>COMPUTE
        STATS</codeph> statement, statistics are filled in for both the table and all its
        columns:
      </p>

<codeblock>[localhost:21000] &gt; compute stats store;
+------------------------------------------+
| summary                                  |
+------------------------------------------+
| Updated 1 partition(s) and 29 column(s). |
+------------------------------------------+
Returned 1 row(s) in 1.88s
[localhost:21000] &gt; show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size   | Format |
+-------+--------+--------+--------+
| 12    | 1      | 3.08KB | TEXT   |
+-------+--------+--------+--------+
Returned 1 row(s) in 0.02s
[localhost:21000] &gt; show column stats store;
+--------------------+-----------+------------------+--------+----------+-------------------+
| Column             | Type      | #Distinct Values | #Nulls | Max Size | Avg Size          |
+--------------------+-----------+------------------+--------+----------+-------------------+
| s_store_sk         | INT       | 12               | -1     | 4        | 4                 |
| s_store_id         | STRING    | 6                | -1     | 16       | 16                |
| s_rec_start_date   | TIMESTAMP | 4                | -1     | 16       | 16                |
| s_rec_end_date     | TIMESTAMP | 3                | -1     | 16       | 16                |
| s_closed_date_sk   | INT       | 3                | -1     | 4        | 4                 |
| s_store_name       | STRING    | 8                | -1     | 5        | 4.25              |
| s_number_employees | INT       | 9                | -1     | 4        | 4                 |
| s_floor_space      | INT       | 10               | -1     | 4        | 4                 |
| s_hours            | STRING    | 2                | -1     | 8        | 7.083300113677979 |
| s_manager          | STRING    | 7                | -1     | 15       | 12                |
| s_market_id        | INT       | 7                | -1     | 4        | 4                 |
| s_geography_class  | STRING    | 1                | -1     | 7        | 7                 |
| s_market_desc      | STRING    | 10               | -1     | 94       | 55.5              |
| s_market_manager   | STRING    | 7                | -1     | 16       | 14                |
| s_division_id      | INT       | 1                | -1     | 4        | 4                 |
| s_division_name    | STRING    | 1                | -1     | 7        | 7                 |
| s_company_id       | INT       | 1                | -1     | 4        | 4                 |
| s_company_name     | STRING    | 1                | -1     | 7        | 7                 |
| s_street_number    | STRING    | 9                | -1     | 3        | 2.833300113677979 |
| s_street_name      | STRING    | 12               | -1     | 11       | 6.583300113677979 |
| s_street_type      | STRING    | 8                | -1     | 9        | 4.833300113677979 |
| s_suite_number     | STRING    | 11               | -1     | 9        | 8.25              |
| s_city             | STRING    | 2                | -1     | 8        | 6.5               |
| s_county           | STRING    | 1                | -1     | 17       | 17                |
| s_state            | STRING    | 1                | -1     | 2        | 2                 |
| s_zip              | STRING    | 2                | -1     | 5        | 5                 |
| s_country          | STRING    | 1                | -1     | 13       | 13                |
| s_gmt_offset       | FLOAT     | 1                | -1     | 4        | 4                 |
| s_tax_percentage   | FLOAT     | 5                | -1     | 4        | 4                 |
+--------------------+-----------+------------------+--------+----------+-------------------+
Returned 29 row(s) in 0.04s</codeblock>

      <p>
        The following example shows how statistics are represented for a partitioned table. In
        this case, we have set up a table to hold the world's most trivial census data, a single
        <codeph>STRING</codeph> field, partitioned by a <codeph>YEAR</codeph> column. The table
        statistics include a separate entry for each partition, plus final totals for the
        numeric fields. The column statistics include some easily deducible facts for the
        partitioning column, such as the number of distinct values (the number of partition
        subdirectories).
<!-- and the number of <codeph>NULL</codeph> values (none in this case). -->
      </p>

<codeblock>localhost:21000] &gt; describe census;
+------+----------+---------+
| name | type     | comment |
+------+----------+---------+
| name | string   |         |
| year | smallint |         |
+------+----------+---------+
Returned 2 row(s) in 0.02s
[localhost:21000] &gt; show table stats census;
+-------+-------+--------+------+---------+
| year  | #Rows | #Files | Size | Format  |
+-------+-------+--------+------+---------+
| 2000  | -1    | 0      | 0B   | TEXT    |
| 2004  | -1    | 0      | 0B   | TEXT    |
| 2008  | -1    | 0      | 0B   | TEXT    |
| 2010  | -1    | 0      | 0B   | TEXT    |
| 2011  | 0     | 1      | 22B  | TEXT    |
| 2012  | -1    | 1      | 22B  | TEXT    |
| 2013  | -1    | 1      | 231B | PARQUET |
| Total | 0     | 3      | 275B |         |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] &gt; show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name   | STRING   | -1               | -1     | -1       | -1       |
| year   | SMALLINT | 7                | -1     | 2        | 2        |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s</codeblock>

      <p>
        The following example shows how the statistics are filled in by a <codeph>COMPUTE
        STATS</codeph> statement in Impala.
      </p>

<codeblock>[localhost:21000] &gt; compute stats census;
+-----------------------------------------+
| summary                                 |
+-----------------------------------------+
| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
Returned 1 row(s) in 2.16s
[localhost:21000] &gt; show table stats census;
+-------+-------+--------+------+---------+
| year  | #Rows | #Files | Size | Format  |
+-------+-------+--------+------+---------+
| 2000  | -1    | 0      | 0B   | TEXT    |
| 2004  | -1    | 0      | 0B   | TEXT    |
| 2008  | -1    | 0      | 0B   | TEXT    |
| 2010  | -1    | 0      | 0B   | TEXT    |
| 2011  | 4     | 1      | 22B  | TEXT    |
| 2012  | 4     | 1      | 22B  | TEXT    |
| 2013  | 1     | 1      | 231B | PARQUET |
| Total | 9     | 3      | 275B |         |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] &gt; show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type     | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name   | STRING   | 4                | -1     | 5        | 4.5      |
| year   | SMALLINT | 7                | -1     | 2        | 2        |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s</codeblock>

      <p rev="1.4.0">
        For examples showing how some queries work differently when statistics are available,
        see <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala
        executes a query differently in each case by observing the <codeph>EXPLAIN</codeph>
        output before and after collecting statistics. Measure the before and after query times,
        and examine the throughput numbers in before and after <codeph>SUMMARY</codeph> or
        <codeph>PROFILE</codeph> output, to verify how much the improved plan speeds up
        performance.
      </p>

    </conbody>

  </concept>

</concept>