mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
- Replaced ellipses in example columns with sample output - Fixed table formatting problems - Exhumed varname styles - Reverted table formatting at line 292 to published version formatting - Fixed table formatting at 831 Change-Id: I83fd30b87730c82c87f6f7aee26d8cceb77b6308 Reviewed-on: http://gerrit.cloudera.org:8080/15476 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
1213 lines
65 KiB
XML
1213 lines
65 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="perf_stats">
|
|
|
|
<title>Table and Column Statistics</title>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Performance"/>
|
|
<data name="Category" value="Querying"/>
|
|
<data name="Category" value="Concepts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Impala can do better optimization for complex or multi-table queries when it has access to
|
|
statistics about the volume of data and how the values are distributed. Impala uses this
|
|
information to help parallelize and distribute the work for a query. For example,
|
|
optimizing join queries requires a way of determining if one table is <q>bigger</q> than
|
|
another, which is a function of the number of rows and the average row size for each
|
|
table. The following sections describe the categories of statistics Impala can work with,
|
|
and how to produce them and keep them up to date.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage all"/>
|
|
|
|
</conbody>
|
|
|
|
<concept id="perf_table_stats">
|
|
|
|
<title id="table_stats">Overview of Table Statistics</title>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Concepts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The Impala query planner can make use of statistics about entire tables and partitions.
|
|
This information includes physical characteristics such as the number of rows, number of
|
|
data files, the total size of the data files, and the file format. For partitioned
|
|
tables, the numbers are calculated per partition, and as totals for the whole table.
|
|
This metadata is stored in the metastore database, and can be updated by either Impala
|
|
or Hive. If a number is not available, the value -1 is used as a placeholder. Some
|
|
numbers, such as number and total sizes of data files, are always kept up to date
|
|
because they can be calculated cheaply, as part of gathering HDFS block metadata.
|
|
</p>
|
|
|
|
<p>
|
|
The following example shows table stats for an unpartitioned Parquet table. The values
|
|
for the number and sizes of files are always available. Initially, the number of rows is
|
|
not known, because it requires a potentially expensive scan through the entire table,
|
|
and so that value is displayed as -1. The <codeph>COMPUTE STATS</codeph> statement fills
|
|
in any unknown table stats values.
|
|
</p>
|
|
|
|
<codeblock>
|
|
show table stats parquet_snappy;
|
|
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
|
|
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| -1 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
|
|
compute stats parquet_snappy;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 6 column(s). |
|
|
+-----------------------------------------+
|
|
|
|
|
|
show table stats parquet_snappy;
|
|
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
|
|
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| 1000000000 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
+------------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Impala performs some optimizations using this metadata on its own, and other
|
|
optimizations by using a combination of table and column statistics.
|
|
</p>
|
|
|
|
<p rev="1.2.1">
|
|
To check that table statistics are available for a table, and see the details of those
|
|
statistics, use the statement <codeph>SHOW TABLE STATS
|
|
<varname>table_name</varname></codeph>. See <xref href="impala_show.xml#show"/> for
|
|
details.
|
|
</p>
|
|
|
|
<p>
|
|
If you use the Hive-based methods of gathering statistics, see
|
|
<xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the
|
|
Hive wiki</xref> for information about the required configuration on the Hive side.
|
|
Where practical, use the Impala <codeph>COMPUTE STATS</codeph> statement to avoid
|
|
potential configuration and scalability issues with the statistics-gathering process.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="perf_column_stats">
|
|
|
|
<title id="column_stats">Overview of Column Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The Impala query planner can make use of statistics about individual columns when that
|
|
metadata is available in the metastore database. This technique is most valuable for
|
|
columns compared across tables in <xref href="impala_perf_joins.xml#perf_joins">join
|
|
queries</xref>, to help estimate how many rows the query will retrieve from each table.
|
|
<ph rev="2.0.0"> These statistics are also important for correlated subqueries using the
|
|
<codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed
|
|
internally the same way as join queries.</ph>
|
|
</p>
|
|
|
|
<p>
|
|
The following example shows column stats for an unpartitioned Parquet table. The values
|
|
for the maximum and average sizes of some types are always available, because those
|
|
figures are constant for numeric and other fixed-size types. Initially, the number of
|
|
distinct values is not known, because it requires a potentially expensive scan through
|
|
the entire table, and so that value is displayed as -1. The same applies to maximum and
|
|
average sizes of variable-sized types, such as <codeph>STRING</codeph>. The
|
|
<codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values. (It
|
|
does not record the number of <codeph>NULL</codeph> values, because currently Impala
|
|
does not use that figure for query optimization.)
|
|
</p>
|
|
|
|
<codeblock>
|
|
show column stats parquet_snappy;
|
|
+-------------+----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-------------+----------+------------------+--------+----------+----------+
|
|
| id | BIGINT | -1 | -1 | 8 | 8 |
|
|
| val | INT | -1 | -1 | 4 | 4 |
|
|
| zerofill | STRING | -1 | -1 | -1 | -1 |
|
|
| name | STRING | -1 | -1 | -1 | -1 |
|
|
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
|
|
| location_id | SMALLINT | -1 | -1 | 2 | 2 |
|
|
+-------------+----------+------------------+--------+----------+----------+
|
|
|
|
compute stats parquet_snappy;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 6 column(s). |
|
|
+-----------------------------------------+
|
|
|
|
show column stats parquet_snappy;
|
|
+-------------+----------+------------------+--------+----------+-------------------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-------------+----------+------------------+--------+----------+-------------------+
|
|
| id | BIGINT | 183861280 | -1 | 8 | 8 |
|
|
| val | INT | 139017 | -1 | 4 | 4 |
|
|
| zerofill | STRING | 101761 | -1 | 6 | 6 |
|
|
| name | STRING | 145636240 | -1 | 22 | 13.00020027160645 |
|
|
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
|
|
| location_id | SMALLINT | 339 | -1 | 2 | 2 |
|
|
+-------------+----------+------------------+--------+----------+-------------------+
|
|
</codeblock>
|
|
|
|
<note>
|
|
<p>
|
|
For column statistics to be effective in Impala, you also need to have table
|
|
statistics for the applicable tables, as described in
|
|
<xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use the Impala
|
|
<codeph>COMPUTE STATS</codeph> statement, both table and column statistics are
|
|
automatically gathered at the same time, for all columns in the table.
|
|
</p>
|
|
</note>
|
|
|
|
<note conref="../shared/impala_common.xml#common/compute_stats_nulls"/>
|
|
|
|
<p rev="1.2.1">
|
|
To check whether column statistics are available for a particular set of columns, use
|
|
the <codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statement, or check
|
|
the extended <codeph>EXPLAIN</codeph> output for a query against that table that refers
|
|
to those columns. See <xref href="impala_show.xml#show"/> and
|
|
<xref href="impala_explain.xml#explain"/> for details.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="perf_stats_partitions">
|
|
|
|
<title id="stats_partitions">How Table and Column Statistics Work for Partitioned Tables</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
When you use Impala for <q>big data</q>, you are highly likely to use partitioning for
|
|
your biggest tables, the ones representing data that can be logically divided based on
|
|
dates, geographic regions, or similar criteria. The table and column statistics are
|
|
especially useful for optimizing queries on such tables. For example, a query involving
|
|
one year might involve substantially more or less data than a query involving a
|
|
different year, or a range of several years. Each query might be optimized differently
|
|
as a result.
|
|
</p>
|
|
|
|
<p>
|
|
The following examples show how table and column stats work with a partitioned table.
|
|
The table for this example is partitioned by year, month, and day. For simplicity, the
|
|
sample data consists of 5 partitions, all from the same year and month. Table stats are
|
|
collected independently for each partition. (In fact, the <codeph>SHOW
|
|
PARTITIONS</codeph> statement displays exactly the same information as <codeph>SHOW
|
|
TABLE STATS</codeph> for a partitioned table.) Column stats apply to the entire table,
|
|
not to individual partitions. Because the partition key column values are represented as
|
|
HDFS directories, their characteristics are typically known in advance, even when the
|
|
values for non-key columns are shown as -1.
|
|
</p>
|
|
|
|
<codeblock>
|
|
show partitions year_month_day;
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| Total | | | -1 | 5 | 12.58MB | 0B | | | | |
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
|
|
show table stats year_month_day;
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| Total | | | -1 | 5 | 12.58MB | 0B | | | | |
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
|
|
show column stats year_month_day;
|
|
+-----------+---------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-----------+---------+------------------+--------+----------+----------+
|
|
| id | INT | -1 | -1 | 4 | 4 |
|
|
| val | INT | -1 | -1 | 4 | 4 |
|
|
| zfill | STRING | -1 | -1 | -1 | -1 |
|
|
| name | STRING | -1 | -1 | -1 | -1 |
|
|
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
|
|
| year | INT | 1 | 0 | 4 | 4 |
|
|
| month | INT | 1 | 0 | 4 | 4 |
|
|
| day | INT | 5 | 0 | 4 | 4 |
|
|
+-----------+---------+------------------+--------+----------+----------+
|
|
|
|
compute stats year_month_day;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 5 partition(s) and 5 column(s). |
|
|
+-----------------------------------------+
|
|
|
|
show table stats year_month_day;
|
|
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location |
|
|
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
| 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://<server>:8020/<path>/parquet_snappy |
|
|
| Total | | | 469290 | 5 | 12.58MB | 0B | | | | |
|
|
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+-------------------+--------------------------------------------+
|
|
|
|
show column stats year_month_day;
|
|
+-----------+---------+------------------+--------+----------+-------------------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-----------+---------+------------------+--------+----------+-------------------+
|
|
| id | INT | 511129 | -1 | 4 | 4 |
|
|
| val | INT | 364853 | -1 | 4 | 4 |
|
|
| zfill | STRING | 311430 | -1 | 6 | 6 |
|
|
| name | STRING | 471975 | -1 | 22 | 13.00160026550293 |
|
|
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
|
|
| year | INT | 1 | 0 | 4 | 4 |
|
|
| month | INT | 1 | 0 | 4 | 4 |
|
|
| day | INT | 5 | 0 | 4 | 4 |
|
|
+-----------+---------+------------------+--------+----------+-------------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="perf_generating_stats">
|
|
|
|
<title>Generating Table and Column Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Use the <codeph>COMPUTE STATS</codeph> family of commands to collect table and
|
|
column statistics. The <codeph>COMPUTE STATS</codeph> variants offer
|
|
different tradeoffs between computation cost, staleness, and maintenance
|
|
workflows which are explained below.
|
|
</p>
|
|
|
|
<note type="important">
|
|
<p conref="../shared/impala_common.xml#common/cs_or_cis"/>
|
|
</note>
|
|
|
|
<!-- TODO: Commented out because it is inaccurate and confusing. Leaving this
|
|
material for future refactoring into a Hive-compatibility section.
|
|
<p>
|
|
If you use Hive as part of your ETL workflow, you can also use Hive to generate table
|
|
and column statistics. You might need to do extra configuration within Hive itself, the
|
|
metastore, or even set up a separate database to hold Hive-generated statistics. You
|
|
might need to run multiple statements to generate all the necessary statistics.
|
|
Therefore, prefer the Impala <codeph>COMPUTE STATS</codeph> statement where that
|
|
technique is practical. For details about collecting statistics through Hive, see
|
|
<xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the
|
|
Hive wiki</xref>.
|
|
</p>
|
|
-->
|
|
|
|
</conbody>
|
|
|
|
<concept id="concept_y2f_nfl_mdb">
|
|
|
|
<title>COMPUTE STATS</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The <codeph>COMPUTE STATS</codeph> command collects and sets the table-level
|
|
and partition-level row counts as well as all column statistics for a given
|
|
table. The collection process is CPU-intensive and can take a long time to
|
|
complete for very large tables.
|
|
</p>
|
|
<p>
|
|
To speed up <codeph>COMPUTE STATS</codeph> consider the following options
|
|
which can be combined.
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Limit the number of columns for which statistics are collected to increase
|
|
the efficiency of COMPUTE STATS. Queries benefit from statistics for those
|
|
columns involved in filters, join conditions, group by or partition by
|
|
clauses. Other columns are good candidates to exclude from COMPUTE STATS.
|
|
This feature is available since Impala 2.12.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Set the MT_DOP query option to use more threads within each participating
|
|
impalad to compute the statistics faster - but not more efficiently. Note
|
|
that computing stats on a large table with a high MT_DOP value can
|
|
negatively affect other queries running at the same time if the
|
|
COMPUTE STATS claims most CPU cycles.
|
|
This feature is available since Impala 2.8.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Consider the experimental extrapolation and sampling features (see below)
|
|
to further increase the efficiency of computing stats.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
</p>
|
|
|
|
<p>
|
|
<codeph>COMPUTE STATS</codeph> is intended to be run periodically,
|
|
e.g. weekly, or on-demand when the contents of a table have changed
|
|
significantly. Due to the high resource utilization and long repsonse
|
|
time of t<codeph>COMPUTE STATS</codeph>, it is most practical to run it
|
|
in a scheduled maintnance window where the Impala cluster is idle
|
|
enough to accommodate the expensive operation. The degree of change that
|
|
qualifies as <q>significant</q> depends on the query workload, but typically,
|
|
if 30% of the rows have changed then it is recommended to recompute
|
|
statistics.
|
|
</p>
|
|
|
|
<p>
|
|
If you reload a complete new set of data for a table, but the number of rows and
|
|
number of distinct values for each column is relatively unchanged from before, you
|
|
do not need to recompute stats for the table.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
<concept id="experimental_stats_features">
|
|
<title>Experimental: Extrapolation and Sampling</title>
|
|
<conbody>
|
|
<p>
|
|
Impala 2.12 and higher includes two experimental features to alleviate
|
|
common issues for computing and maintaining statistics on very large tables.
|
|
The following shortcomings are improved upon:
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Newly added partitions do not have row count statistics. Table scans
|
|
that only access those new partitions are treated as not having stats.
|
|
Similarly, table scans that access both new and old partitions estimate
|
|
the scan cardinality based on those old partitions that have stats, and
|
|
the new partitions without stats are treated as having 0 rows.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
The row counts of existing partitions become stale when data is added
|
|
or dropped.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
Computing stats for tables with a 100,000 or more partitions might fail
|
|
or be very slow due to the high cost of updating the partition metadata
|
|
in the Hive Metastore.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
With transient compute resources it is important to minimize the time
|
|
from starting a new cluster to successfully running queries.
|
|
Since the cluster might be relatively short-lived, users might prefer to
|
|
quickly collect stats that are "good enough" as opposed to spending
|
|
a lot of time and resouces on computing full-fidelity stats.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
For very large tables, it is often wasteful or impractical to run a full
|
|
COMPUTE STATS to address the scenarios above on a frequent basis.
|
|
</p>
|
|
<p>
|
|
The sampling feature makes COMPUTE STATS more efficient by processing a
|
|
fraction of the table data, and the extrapolation feature aims to reduce
|
|
the frequency at which COMPUTE STATS needs to be re-run by estimating
|
|
the row count of new and modified partitions.
|
|
</p>
|
|
<p>
|
|
The sampling and extrapolation features are disabled by default.
|
|
They can be enabled globally or for specific tables, as follows.
|
|
Set the impalad start-up configuration "--enable_stats_extrapolation" to
|
|
enable the features globally. To enable them only for a specific table, set
|
|
the "impala.enable.stats.extrapolation" table property to "true" for the
|
|
desired table. The table-level property overrides the global setting, so
|
|
it is also possible to enable sampling and extrapolation globally, but
|
|
disable it for specific tables by setting the table property to "false".
|
|
Example:
|
|
ALTER TABLE mytable test_table SET TBLPROPERTIES("impala.enable.stats.extrapolation"="true")
|
|
</p>
|
|
<note>
|
|
Why are these features experimental? Due to their probabilistic nature
|
|
it is possible that these features perform pathologically poorly on tables
|
|
with extreme data/file/size distributions. Since it is not feasible for us
|
|
to test all possible scenarios we only cautiously advertise these new
|
|
capabilities. That said, the features have been thoroughly tested and
|
|
are considered functionally stable. If you decide to give these features
|
|
a try, please tell us about your experience at user@impala.apache.org!
|
|
We rely on user feedback to guide future inprovements in statistics
|
|
collection.
|
|
</note>
|
|
</conbody>
|
|
|
|
<concept id="experimental_stats_extrapolation">
|
|
<title>Stats Extrapolation</title>
|
|
<conbody>
|
|
<p>
|
|
The main idea of stats extrapolation is to estimate the row count of new
|
|
and modified partitions based on the result of the last COMPUTE STATS.
|
|
Enabling stats extrapolation changes the behavior of COMPUTE STATS,
|
|
as well as the cardinality estimation of table scans. COMPUTE STATS no
|
|
longer computes and stores per-partition row counts, and instead, only
|
|
computes a table-level row count together with the total number of file
|
|
bytes in the table at that time. No partition metadata is modified. The
|
|
input cardinality of a table scan is estimated by converting the data
|
|
volume of relevant partitions to a row count, based on the table-level
|
|
row count and file bytes statistics. It is assumed that within the same
|
|
table, different sets of files with the same data volume correspond
|
|
to the similar number of rows on average. With extrapolation enabled,
|
|
the scan cardinality estimation ignores per-partition row counts. It
|
|
only relies on the table-level statistics and the scanned data volume.
|
|
</p>
|
|
<p>
|
|
The SHOW TABLE STATS and EXPLAIN commands distinguish between row counts
|
|
stored in the Hive Metastore, and the row counts extrapolated based on the
|
|
above process. Consult the SHOW TABLE STATS and EXPLAIN documentation
|
|
for more details.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="experimental_stats_sampling">
|
|
<title>Sampling</title>
|
|
<conbody>
|
|
<p>
|
|
A TABLESAMPLE clause may be added to COMPUTE STATS to limit the
|
|
percentage of data to be processed. The final statistics are obtained
|
|
by extrapolating the statistics from the data sample over the entire table.
|
|
The extrapolated statistics are stored in the Hive Metastore, just as if no
|
|
sampling was used. The following example runs COMPUTE STATS over a 10 percent
|
|
data sample: COMPUTE STATS test_table TABLESAMPLE SYSTEM(10)
|
|
</p>
|
|
<p>
|
|
We have found that a 10 percent sampling rate typically offers a good
|
|
tradeoff between statistics accuracy and execution cost. A sampling rate
|
|
well below 10 percent has shown poor results and is not recommended.
|
|
</p>
|
|
<note type="important">
|
|
Sampling-based techniques sacrifice result accuracy for execution
|
|
efficiency, so your mileage may vary for different tables and columns
|
|
depending on their data distribution. The extrapolation procedure Impala
|
|
uses for estimating the number of distinct values per column is inherently
|
|
non-detetministic, so your results may even vary between runs of
|
|
COMPUTE STATS TABLESAMPLE, even if no data has changed.
|
|
</note>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|
|
</concept>
|
|
|
|
<concept id="concept_bmk_pfl_mdb">
|
|
|
|
<title>COMPUTE INCREMENTAL STATS</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In Impala 2.1.0 and higher, you can use the
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> and
|
|
<codeph>DROP INCREMENTAL STATS</codeph> commands.
|
|
The <codeph>INCREMENTAL</codeph> clauses work with incremental statistics,
|
|
a specialized feature for partitioned tables.
|
|
</p>
|
|
|
|
<p>
|
|
When you compute incremental statistics for a partitioned table, by default Impala only
|
|
processes those partitions that do not yet have incremental statistics. By processing
|
|
only newly added partitions, you can keep statistics up to date without incurring the
|
|
overhead of reprocessing the entire table each time.
|
|
</p>
|
|
|
|
<p>
|
|
You can also compute or drop statistics for a specified subset of partitions by
|
|
including a <codeph>PARTITION</codeph> clause in the
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph>
|
|
statement.
|
|
</p>
|
|
|
|
<note type="important">
|
|
<p
|
|
conref="../shared/impala_common.xml#common/incremental_stats_caveats"/>
|
|
<p
|
|
conref="../shared/impala_common.xml#common/incremental_stats_after_full"/>
|
|
</note>
|
|
|
|
<p>
|
|
The metadata for incremental statistics is handled differently from the original style
|
|
of statistics:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Issuing a <codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition
|
|
clause causes Impala to compute incremental stats for all partitions that
|
|
do not already have incremental stats. This might be the entire table when
|
|
running the command for the first time, but subsequent runs should only
|
|
update new partitions. You can force updating a partition that already has
|
|
incremental stats by issuing a <codeph>DROP INCREMENTAL STATS</codeph>
|
|
before running <codeph>COMPUTE INCREMENTAL STATS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph>
|
|
statements now include an additional column showing whether incremental statistics
|
|
are available for each column. A partition could already be covered by the original
|
|
type of statistics based on a prior <codeph>COMPUTE STATS</codeph> statement, as
|
|
indicated by a value other than <codeph>-1</codeph> under the <codeph>#Rows</codeph>
|
|
column. Impala query planning uses either kind of statistics when available.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE
|
|
STATS</codeph> for the same volume of data. Therefore it is most suitable for tables
|
|
with large data volume where new partitions are added frequently, making it
|
|
impractical to run a full <codeph>COMPUTE STATS</codeph> operation for each new
|
|
partition. For unpartitioned tables, or partitioned tables that are loaded once and
|
|
not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph>
|
|
syntax.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the
|
|
<cmdname>catalogd</cmdname> process, proportional to the number
|
|
of partitions and number of columns in the applicable table. The
|
|
memory overhead is approximately 400 bytes for each column in each
|
|
partition. This memory is reserved in the
|
|
<cmdname>catalogd</cmdname> daemon, the
|
|
<cmdname>statestored</cmdname> daemon, and in each instance of
|
|
the impalad daemon. </p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
In cases where new files are added to an existing partition, issue a
|
|
<codeph>REFRESH</codeph> statement for the table, followed by a <codeph>DROP
|
|
INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL STATS</codeph> sequence
|
|
for the changed partition.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single
|
|
partition at a time. To remove statistics (whether incremental or not) from all
|
|
partitions of a table, issue a <codeph>DROP STATS</codeph> statement with no
|
|
<codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
The following considerations apply to incremental statistics when the structure of an
|
|
existing table is changed (known as <term>schema evolution</term>):
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing
|
|
statistics remain valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not
|
|
rescan any partitions.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans
|
|
all partitions and fills in the appropriate column-level values the next time you
|
|
run <codeph>COMPUTE INCREMENTAL STATS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a
|
|
column, Impala rescans all partitions and fills in the appropriate column-level
|
|
values the next time you run <codeph>COMPUTE INCREMENTAL STATS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a
|
|
table, the existing statistics remain valid and a subsequent <codeph>COMPUTE
|
|
INCREMENTAL STATS</codeph> does not rescan any partitions.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
See <xref href="impala_compute_stats.xml#compute_stats"/> and
|
|
<xref
|
|
href="impala_drop_stats.xml#drop_stats"/> for syntax details.
|
|
</p>
|
|
|
|
</conbody>
|
|
<concept id="inc_stats_size_limit_bytes">
|
|
<title>Maximum Serialized Stats Size</title>
|
|
<conbody>
|
|
<p>In Impala 3.0 and lower, when executing <codeph>COMPUTE INCREMENTAL
|
|
STATS</codeph> on very large tables, use the configuration setting
|
|
<codeph>--inc_stats_size_limit_bytes</codeph> to prevent Impala
|
|
from running out of memory while updating table metadata. If this
|
|
limit is reached, Impala will stop loading the table and return an
|
|
error. The error serves as an indication that <codeph>COMPUTE
|
|
INCREMENTAL STATS</codeph> should not be used on the particular
|
|
table. Consider spitting the table and using regular <codeph>COMPUTE
|
|
STATS</codeph> ]if possible. </p>
|
|
|
|
<p> The <codeph>--inc_stats_size_limit_bytes</codeph> limit is set as
|
|
a safety check, to prevent Impala from hitting the maximum limit for
|
|
the table metadata. Note that this limit is only one part of the
|
|
entire table's metadata all of which together must be below 2 GB. </p>
|
|
|
|
<p> The default value for
|
|
<codeph>--inc_stats_size_limit_bytes</codeph> is 209715200, 200
|
|
MB. </p>
|
|
|
|
<p> To change the <codeph>--inc_stats_size_limit_bytes</codeph> value,
|
|
restart impalad and catalogd with the new value specified in bytes,
|
|
for example, 1048576000 for 1 GB. See <xref
|
|
href="impala_config_options.xml#config_options"/> for the steps to
|
|
change the option and restart Impala daemons. </p>
|
|
|
|
<note type="attention"> The
|
|
<codeph>--inc_stats_size_limit_bytes</codeph> setting should be
|
|
increased with care. A big value for the setting, such as 1 GB or
|
|
more, can result in a spike in heap usage as well as a crash of
|
|
Impala. </note>
|
|
<p>In Impala 3.1 and higher, Impala improved how metadata is updated
|
|
when executing <codeph>COMPUTE INCREMENTAL STATS</codeph>,
|
|
significantly reducing the need for
|
|
<codeph>--inc_stats_size_limit_bytes</codeph>. </p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
</concept>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.1.0" id="perf_stats_checking">
|
|
|
|
<title>Detecting Missing Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can check whether a specific table has statistics using the <codeph>SHOW TABLE
|
|
STATS</codeph> statement (for any table) or the <codeph>SHOW PARTITIONS</codeph>
|
|
statement (for a partitioned table). Both statements display the same information. If a
|
|
table or a partition does not have any statistics, the <codeph>#Rows</codeph> field
|
|
contains <codeph>-1</codeph>. Once you compute statistics for the table or partition,
|
|
the <codeph>#Rows</codeph> field changes to an accurate value.
|
|
</p>
|
|
|
|
<p>
|
|
The following example shows a table that initially does not have any statistics. The
|
|
<codeph>SHOW TABLE STATS</codeph> statement displays different values for
|
|
<codeph>#Rows</codeph> before and after the <codeph>COMPUTE STATS</codeph> operation.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table no_stats (x int);
|
|
[localhost:21000] > show table stats no_stats;
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| -1 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
[localhost:21000] > compute stats no_stats;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 1 column(s). |
|
|
+-----------------------------------------+
|
|
[localhost:21000] > show table stats no_stats;
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| 0 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows a similar progression with a partitioned table. Initially,
|
|
<codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph>
|
|
operation, <codeph>#Rows</codeph> changes to an accurate value. Any newly added
|
|
partition starts with no statistics, meaning that you must collect statistics after
|
|
adding a new partition.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint);
|
|
[localhost:21000] > show table stats no_stats_partitioned;
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| Total | -1 | 0 | 0B | 0B | | |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
[localhost:21000] > show partitions no_stats_partitioned;
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| Total | -1 | 0 | 0B | 0B | | |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
[localhost:21000] > alter table no_stats_partitioned add partition (year=2013);
|
|
[localhost:21000] > compute stats no_stats_partitioned;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 1 column(s). |
|
|
+-----------------------------------------+
|
|
[localhost:21000] > alter table no_stats_partitioned add partition (year=2014);
|
|
[localhost:21000] > show partitions no_stats_partitioned;
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| 2013 | 0 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
| 2014 | -1 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
| Total | 0 | 0 | 0B | 0B | | |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
</codeblock>
|
|
|
|
<note>
|
|
Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates
|
|
statistics for all partitions in a table, if you expect to frequently add new
|
|
partitions, use the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax instead, which
|
|
lets you compute stats for a single specified partition, or only for those partitions
|
|
that do not already have incremental stats.
|
|
</note>
|
|
|
|
<p>
|
|
If checking each individual table is impractical, due to a large number of tables or
|
|
views that hide the underlying base tables, you can also check for missing statistics
|
|
for a particular query. Use the <codeph>EXPLAIN</codeph> statement to preview query
|
|
efficiency before actually running the query. Use the query profile output available
|
|
through the <codeph>PROFILE</codeph> command in <cmdname>impala-shell</cmdname> or the
|
|
web UI to verify query execution and timing after running the query. Both the
|
|
<codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning
|
|
if any tables or partitions involved in the query do not have statistics.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table no_stats (x int);
|
|
[localhost:21000] > explain select count(*) from no_stats;
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| incremental_stats.no_stats |
|
|
| |
|
|
| 03:AGGREGATE [FINALIZE] |
|
|
| | output: count:merge(*) |
|
|
| | |
|
|
| 02:EXCHANGE [UNPARTITIONED] |
|
|
| | |
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
| 00:SCAN HDFS [incremental_stats.no_stats] |
|
|
| partitions=1/1 files=0 size=0B |
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Because Impala uses the <term>partition pruning</term> technique when possible to only
|
|
evaluate certain partitions, if you have a partitioned table with statistics for some
|
|
partitions and not others, whether or not the <codeph>EXPLAIN</codeph> statement shows
|
|
the warning depends on the actual partitions used by the query. For example, you might
|
|
see warnings or not for different queries against the same table:
|
|
</p>
|
|
|
|
<codeblock>-- No warning because all the partitions for the year 2012 have stats.
|
|
EXPLAIN SELECT ... FROM t1 WHERE year = 2012;
|
|
|
|
-- Missing stats warning because one or more partitions in this range
|
|
-- do not have stats.
|
|
EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;
|
|
</codeblock>
|
|
|
|
<p>
|
|
To confirm if any partitions at all in the table are missing statistics, you might
|
|
explain a query that scans the entire table, such as <codeph>SELECT COUNT(*) FROM
|
|
<varname>table_name</varname></codeph>.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="concept_s3c_4gl_mdb">
|
|
|
|
<title>Manually Setting Table and Column Statistics with ALTER TABLE</title>
|
|
|
|
<concept id="concept_wpt_pgl_mdb">
|
|
|
|
<title>Setting Table Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The most crucial piece of data in all the statistics is the number of rows in the
|
|
table (for an unpartitioned or partitioned table) and for each partition (for a
|
|
partitioned table). The <codeph>COMPUTE STATS</codeph> statement always gathers
|
|
statistics about all columns, as well as overall table statistics. If it is not
|
|
practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL
|
|
STATS</codeph> operation after adding a partition or inserting data, or if you can see
|
|
that Impala would produce a more efficient plan if the number of rows was different,
|
|
you can manually set the number of rows through an <codeph>ALTER TABLE</codeph>
|
|
statement:
|
|
</p>
|
|
|
|
<codeblock>
|
|
-- Set total number of rows. Applies to both unpartitioned and partitioned tables.
|
|
alter table <varname>table_name</varname> set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
|
|
|
|
-- Set total number of rows for a specific partition. Applies to partitioned tables only.
|
|
-- You must specify all the partition key columns in the PARTITION clause.
|
|
alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...) set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
|
|
</codeblock>
|
|
|
|
<p>
|
|
This statement avoids re-scanning any data files. (The requirement to include the
|
|
<codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a
|
|
result of the issue
|
|
<xref
|
|
href="https://issues.apache.org/jira/browse/HIVE-8648"
|
|
scope="external" format="html">HIVE-8648</xref>
|
|
for the Hive metastore.)
|
|
</p>
|
|
|
|
<codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/>
|
|
|
|
<p>
|
|
For a partitioned table, update both the per-partition number of rows and the number
|
|
of rows for the whole table:
|
|
</p>
|
|
|
|
<codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/>
|
|
|
|
<p>
|
|
In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE
|
|
INCREMENTAL STATS</codeph> for a partitioned table, should be fast and convenient
|
|
enough that this technique is only useful for the very largest partitioned tables.
|
|
<!--
|
|
It is most useful as a workaround for in case of performance issues where you might adjust the <codeph>numRows</codeph> value higher
|
|
or lower to produce the ideal join order.
|
|
-->
|
|
<!-- Following wording is duplicated from earlier. Consider conref'ing. -->
|
|
Because the column statistics might be left in a stale state, do not use this
|
|
technique as a replacement for <codeph>COMPUTE STATS</codeph>. Only use this technique
|
|
if all other means of collecting statistics are impractical, or as a low-overhead
|
|
operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> operations.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="concept_asb_vgl_mdb">
|
|
|
|
<title>Setting Column Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In <keyword keyref="impala26_full"/> and higher, you can also use the <codeph>SET
|
|
COLUMN STATS</codeph> clause of <codeph>ALTER TABLE</codeph> to manually set or change
|
|
column statistics. Only use this technique in cases where it is impractical to run
|
|
<codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
|
|
frequently enough to keep up with data changes for a huge table.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|
|
|
|
<concept rev="1.2.2" id="perf_stats_examples">
|
|
|
|
<title>Examples of Using Table and Column Statistics with Impala</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>,
|
|
<codeph>SHOW COLUMN STATS</codeph>, <codeph>ALTER TABLE</codeph>, and
|
|
<codeph>SELECT</codeph> and <codeph>INSERT</codeph> statements to illustrate various
|
|
aspects of how Impala uses statistics to help optimize queries.
|
|
</p>
|
|
|
|
<p>
|
|
This example shows table and column statistics for the <codeph>STORE</codeph> column
|
|
used in the <xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS
|
|
benchmarks for decision support</xref> systems. It is a tiny table holding data for 12
|
|
stores. Initially, before any statistics are gathered by a <codeph>COMPUTE
|
|
STATS</codeph> statement, most of the numeric fields show placeholder values of -1,
|
|
indicating that the figures are unknown. The figures that are filled in are values that
|
|
are easily countable or deducible at the physical level, such as the number of files,
|
|
total data size of the files, and the maximum and average sizes for data types that have
|
|
a constant size such as <codeph>INT</codeph>, <codeph>FLOAT</codeph>, and
|
|
<codeph>TIMESTAMP</codeph>.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > show table stats store;
|
|
+-------+--------+--------+--------+
|
|
| #Rows | #Files | Size | Format |
|
|
+-------+--------+--------+--------+
|
|
| -1 | 1 | 3.08KB | TEXT |
|
|
+-------+--------+--------+--------+
|
|
Returned 1 row(s) in 0.03s
|
|
[localhost:21000] > show column stats store;
|
|
+--------------------+-----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------------------+-----------+------------------+--------+----------+----------+
|
|
| s_store_sk | INT | -1 | -1 | 4 | 4 |
|
|
| s_store_id | STRING | -1 | -1 | -1 | -1 |
|
|
| s_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 |
|
|
| s_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 |
|
|
| s_closed_date_sk | INT | -1 | -1 | 4 | 4 |
|
|
| s_store_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_number_employees | INT | -1 | -1 | 4 | 4 |
|
|
| s_floor_space | INT | -1 | -1 | 4 | 4 |
|
|
| s_hours | STRING | -1 | -1 | -1 | -1 |
|
|
| s_manager | STRING | -1 | -1 | -1 | -1 |
|
|
| s_market_id | INT | -1 | -1 | 4 | 4 |
|
|
| s_geography_class | STRING | -1 | -1 | -1 | -1 |
|
|
| s_market_desc | STRING | -1 | -1 | -1 | -1 |
|
|
| s_market_manager | STRING | -1 | -1 | -1 | -1 |
|
|
| s_division_id | INT | -1 | -1 | 4 | 4 |
|
|
| s_division_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_company_id | INT | -1 | -1 | 4 | 4 |
|
|
| s_company_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_street_number | STRING | -1 | -1 | -1 | -1 |
|
|
| s_street_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_street_type | STRING | -1 | -1 | -1 | -1 |
|
|
| s_suite_number | STRING | -1 | -1 | -1 | -1 |
|
|
| s_city | STRING | -1 | -1 | -1 | -1 |
|
|
| s_county | STRING | -1 | -1 | -1 | -1 |
|
|
| s_state | STRING | -1 | -1 | -1 | -1 |
|
|
| s_zip | STRING | -1 | -1 | -1 | -1 |
|
|
| s_country | STRING | -1 | -1 | -1 | -1 |
|
|
| s_gmt_offset | FLOAT | -1 | -1 | 4 | 4 |
|
|
| s_tax_percentage | FLOAT | -1 | -1 | 4 | 4 |
|
|
+--------------------+-----------+------------------+--------+----------+----------+
|
|
Returned 29 row(s) in 0.04s</codeblock>
|
|
|
|
<p>
|
|
With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to
|
|
specify each column for which to gather statistics. The Impala <codeph>COMPUTE
|
|
STATS</codeph> statement automatically gathers statistics for all columns, because it
|
|
reads through the entire table relatively quickly and can efficiently compute the values
|
|
for all the columns. This example shows how after running the <codeph>COMPUTE
|
|
STATS</codeph> statement, statistics are filled in for both the table and all its
|
|
columns:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > compute stats store;
|
|
+------------------------------------------+
|
|
| summary |
|
|
+------------------------------------------+
|
|
| Updated 1 partition(s) and 29 column(s). |
|
|
+------------------------------------------+
|
|
Returned 1 row(s) in 1.88s
|
|
[localhost:21000] > show table stats store;
|
|
+-------+--------+--------+--------+
|
|
| #Rows | #Files | Size | Format |
|
|
+-------+--------+--------+--------+
|
|
| 12 | 1 | 3.08KB | TEXT |
|
|
+-------+--------+--------+--------+
|
|
Returned 1 row(s) in 0.02s
|
|
[localhost:21000] > show column stats store;
|
|
+--------------------+-----------+------------------+--------+----------+-------------------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------------------+-----------+------------------+--------+----------+-------------------+
|
|
| s_store_sk | INT | 12 | -1 | 4 | 4 |
|
|
| s_store_id | STRING | 6 | -1 | 16 | 16 |
|
|
| s_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16 |
|
|
| s_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16 |
|
|
| s_closed_date_sk | INT | 3 | -1 | 4 | 4 |
|
|
| s_store_name | STRING | 8 | -1 | 5 | 4.25 |
|
|
| s_number_employees | INT | 9 | -1 | 4 | 4 |
|
|
| s_floor_space | INT | 10 | -1 | 4 | 4 |
|
|
| s_hours | STRING | 2 | -1 | 8 | 7.083300113677979 |
|
|
| s_manager | STRING | 7 | -1 | 15 | 12 |
|
|
| s_market_id | INT | 7 | -1 | 4 | 4 |
|
|
| s_geography_class | STRING | 1 | -1 | 7 | 7 |
|
|
| s_market_desc | STRING | 10 | -1 | 94 | 55.5 |
|
|
| s_market_manager | STRING | 7 | -1 | 16 | 14 |
|
|
| s_division_id | INT | 1 | -1 | 4 | 4 |
|
|
| s_division_name | STRING | 1 | -1 | 7 | 7 |
|
|
| s_company_id | INT | 1 | -1 | 4 | 4 |
|
|
| s_company_name | STRING | 1 | -1 | 7 | 7 |
|
|
| s_street_number | STRING | 9 | -1 | 3 | 2.833300113677979 |
|
|
| s_street_name | STRING | 12 | -1 | 11 | 6.583300113677979 |
|
|
| s_street_type | STRING | 8 | -1 | 9 | 4.833300113677979 |
|
|
| s_suite_number | STRING | 11 | -1 | 9 | 8.25 |
|
|
| s_city | STRING | 2 | -1 | 8 | 6.5 |
|
|
| s_county | STRING | 1 | -1 | 17 | 17 |
|
|
| s_state | STRING | 1 | -1 | 2 | 2 |
|
|
| s_zip | STRING | 2 | -1 | 5 | 5 |
|
|
| s_country | STRING | 1 | -1 | 13 | 13 |
|
|
| s_gmt_offset | FLOAT | 1 | -1 | 4 | 4 |
|
|
| s_tax_percentage | FLOAT | 5 | -1 | 4 | 4 |
|
|
+--------------------+-----------+------------------+--------+----------+-------------------+
|
|
Returned 29 row(s) in 0.04s</codeblock>
|
|
|
|
<p>
|
|
The following example shows how statistics are represented for a partitioned table. In
|
|
this case, we have set up a table to hold the world's most trivial census data, a single
|
|
<codeph>STRING</codeph> field, partitioned by a <codeph>YEAR</codeph> column. The table
|
|
statistics include a separate entry for each partition, plus final totals for the
|
|
numeric fields. The column statistics include some easily deducible facts for the
|
|
partitioning column, such as the number of distinct values (the number of partition
|
|
subdirectories).
|
|
<!-- and the number of <codeph>NULL</codeph> values (none in this case). -->
|
|
</p>
|
|
|
|
<codeblock>localhost:21000] > describe census;
|
|
+------+----------+---------+
|
|
| name | type | comment |
|
|
+------+----------+---------+
|
|
| name | string | |
|
|
| year | smallint | |
|
|
+------+----------+---------+
|
|
Returned 2 row(s) in 0.02s
|
|
[localhost:21000] > show table stats census;
|
|
+-------+-------+--------+------+---------+
|
|
| year | #Rows | #Files | Size | Format |
|
|
+-------+-------+--------+------+---------+
|
|
| 2000 | -1 | 0 | 0B | TEXT |
|
|
| 2004 | -1 | 0 | 0B | TEXT |
|
|
| 2008 | -1 | 0 | 0B | TEXT |
|
|
| 2010 | -1 | 0 | 0B | TEXT |
|
|
| 2011 | 0 | 1 | 22B | TEXT |
|
|
| 2012 | -1 | 1 | 22B | TEXT |
|
|
| 2013 | -1 | 1 | 231B | PARQUET |
|
|
| Total | 0 | 3 | 275B | |
|
|
+-------+-------+--------+------+---------+
|
|
Returned 8 row(s) in 0.02s
|
|
[localhost:21000] > show column stats census;
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| name | STRING | -1 | -1 | -1 | -1 |
|
|
| year | SMALLINT | 7 | -1 | 2 | 2 |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
Returned 2 row(s) in 0.02s</codeblock>
|
|
|
|
<p>
|
|
The following example shows how the statistics are filled in by a <codeph>COMPUTE
|
|
STATS</codeph> statement in Impala.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > compute stats census;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 3 partition(s) and 1 column(s). |
|
|
+-----------------------------------------+
|
|
Returned 1 row(s) in 2.16s
|
|
[localhost:21000] > show table stats census;
|
|
+-------+-------+--------+------+---------+
|
|
| year | #Rows | #Files | Size | Format |
|
|
+-------+-------+--------+------+---------+
|
|
| 2000 | -1 | 0 | 0B | TEXT |
|
|
| 2004 | -1 | 0 | 0B | TEXT |
|
|
| 2008 | -1 | 0 | 0B | TEXT |
|
|
| 2010 | -1 | 0 | 0B | TEXT |
|
|
| 2011 | 4 | 1 | 22B | TEXT |
|
|
| 2012 | 4 | 1 | 22B | TEXT |
|
|
| 2013 | 1 | 1 | 231B | PARQUET |
|
|
| Total | 9 | 3 | 275B | |
|
|
+-------+-------+--------+------+---------+
|
|
Returned 8 row(s) in 0.02s
|
|
[localhost:21000] > show column stats census;
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| name | STRING | 4 | -1 | 5 | 4.5 |
|
|
| year | SMALLINT | 7 | -1 | 2 | 2 |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
Returned 2 row(s) in 0.02s</codeblock>
|
|
|
|
<p rev="1.4.0">
|
|
For examples showing how some queries work differently when statistics are available,
|
|
see <xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala
|
|
executes a query differently in each case by observing the <codeph>EXPLAIN</codeph>
|
|
output before and after collecting statistics. Measure the before and after query times,
|
|
and examine the throughput numbers in before and after <codeph>SUMMARY</codeph> or
|
|
<codeph>PROFILE</codeph> output, to verify how much the improved plan speeds up
|
|
performance.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|