mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
1050 lines
58 KiB
XML
1050 lines
58 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="perf_stats">
|
|
|
|
<title>Table and Column Statistics</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Performance"/>
|
|
<data name="Category" value="Querying"/>
|
|
<data name="Category" value="Concepts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Impala can do better optimization for complex or multi-table queries when it has access to statistics about
|
|
the volume of data and how the values are distributed. Impala uses this information to help parallelize and
|
|
distribute the work for a query. For example, optimizing join queries requires a way of determining if one
|
|
table is <q>bigger</q> than another, which is a function of the number of rows and the average row size
|
|
for each table. The following sections describe the categories of statistics Impala can work
|
|
with, and how to produce them and keep them up to date.
|
|
</p>
|
|
|
|
<note>
|
|
<p rev="1.2.2">
|
|
Originally, Impala relied on the Hive mechanism for collecting statistics, through the Hive <codeph>ANALYZE
|
|
TABLE</codeph> statement which initiates a MapReduce job. For better user-friendliness and reliability,
|
|
Impala implements its own <codeph>COMPUTE STATS</codeph> statement in Impala 1.2.2 and higher, along with the
|
|
<codeph>DROP STATS</codeph>, <codeph>SHOW TABLE STATS</codeph>, and <codeph>SHOW COLUMN STATS</codeph>
|
|
statements.
|
|
</p>
|
|
</note>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="perf_table_stats">
|
|
|
|
<title id="table_stats">Overview of Table Statistics</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Concepts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<!-- Hive background info: https://cwiki.apache.org/Hive/statsdev.html -->
|
|
|
|
<p>
|
|
The Impala query planner can make use of statistics about entire tables and partitions.
|
|
This information includes physical characteristics such as the number of rows, number of data files,
|
|
the total size of the data files, and the file format. For partitioned tables, the numbers
|
|
are calculated per partition, and as totals for the whole table.
|
|
This metadata is stored in the metastore database, and can be updated by either Impala or Hive.
|
|
If a number is not available, the value -1 is used as a placeholder.
|
|
Some numbers, such as number and total sizes of data files, are always kept up to date because
|
|
they can be calculated cheaply, as part of gathering HDFS block metadata.
|
|
</p>
|
|
|
|
<p>
|
|
The following example shows table stats for an unpartitioned Parquet table.
|
|
The values for the number and sizes of files are always available.
|
|
Initially, the number of rows is not known, because it requires a potentially expensive
|
|
scan through the entire table, and so that value is displayed as -1.
|
|
The <codeph>COMPUTE STATS</codeph> statement fills in any unknown table stats values.
|
|
</p>
|
|
|
|
<codeblock>
|
|
show table stats parquet_snappy;
|
|
+-------+--------+---------+--------------+-------------------+---------+-------------------+...
|
|
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |...
|
|
+-------+--------+---------+--------------+-------------------+---------+-------------------+...
|
|
| -1 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |...
|
|
+-------+--------+---------+--------------+-------------------+---------+-------------------+...
|
|
|
|
compute stats parquet_snappy;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 6 column(s). |
|
|
+-----------------------------------------+
|
|
|
|
|
|
show table stats parquet_snappy;
|
|
+------------+--------+---------+--------------+-------------------+---------+-------------------+...
|
|
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |...
|
|
+------------+--------+---------+--------------+-------------------+---------+-------------------+...
|
|
| 1000000000 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |...
|
|
+------------+--------+---------+--------------+-------------------+---------+-------------------+...
|
|
</codeblock>
|
|
|
|
<p>
|
|
Impala performs some optimizations using this metadata on its own, and other optimizations by
|
|
using a combination of table and column statistics.
|
|
</p>
|
|
|
|
<p rev="1.2.1">
|
|
To check that table statistics are available for a table, and see the details of those statistics, use the
|
|
statement <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph>. See
|
|
<xref href="impala_show.xml#show"/> for details.
|
|
</p>
|
|
|
|
<p>
|
|
If you use the Hive-based methods of gathering statistics, see
|
|
<xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the
|
|
Hive wiki</xref> for information about the required configuration on the Hive side. <ph rev="upstream">Cloudera</ph> recommends
|
|
using the Impala <codeph>COMPUTE STATS</codeph> statement to avoid potential configuration and scalability
|
|
issues with the statistics-gathering process.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="perf_column_stats">
|
|
|
|
<title id="column_stats">Overview of Column Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<!-- Cloudera+Hive background information: http://blog.cloudera.com/blog/2012/08/column-statistics-in-hive/ -->
|
|
|
|
<p>
|
|
The Impala query planner can make use of statistics about individual columns when that metadata is
|
|
available in the metastore database. This technique is most valuable for columns compared across tables in
|
|
<xref href="impala_perf_joins.xml#perf_joins">join queries</xref>, to help estimate how many rows the query
|
|
will retrieve from each table. <ph rev="2.0.0"> These statistics are also important for correlated
|
|
subqueries using the <codeph>EXISTS()</codeph> or <codeph>IN()</codeph> operators, which are processed
|
|
internally the same way as join queries.</ph>
|
|
</p>
|
|
|
|
<p>
|
|
The following example shows column stats for an unpartitioned Parquet table.
|
|
The values for the maximum and average sizes of some types are always available,
|
|
because those figures are constant for numeric and other fixed-size types.
|
|
Initially, the number of distinct values is not known, because it requires a potentially expensive
|
|
scan through the entire table, and so that value is displayed as -1.
|
|
The same applies to maximum and average sizes of variable-sized types, such as <codeph>STRING</codeph>.
|
|
The <codeph>COMPUTE STATS</codeph> statement fills in most unknown column stats values.
|
|
(It does not record the number of <codeph>NULL</codeph> values, because currently Impala
|
|
does not use that figure for query optimization.)
|
|
</p>
|
|
|
|
<codeblock>
|
|
show column stats parquet_snappy;
|
|
+-------------+----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-------------+----------+------------------+--------+----------+----------+
|
|
| id | BIGINT | -1 | -1 | 8 | 8 |
|
|
| val | INT | -1 | -1 | 4 | 4 |
|
|
| zerofill | STRING | -1 | -1 | -1 | -1 |
|
|
| name | STRING | -1 | -1 | -1 | -1 |
|
|
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
|
|
| location_id | SMALLINT | -1 | -1 | 2 | 2 |
|
|
+-------------+----------+------------------+--------+----------+----------+
|
|
|
|
compute stats parquet_snappy;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 6 column(s). |
|
|
+-----------------------------------------+
|
|
|
|
show column stats parquet_snappy;
|
|
+-------------+----------+------------------+--------+----------+-------------------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-------------+----------+------------------+--------+----------+-------------------+
|
|
| id | BIGINT | 183861280 | -1 | 8 | 8 |
|
|
| val | INT | 139017 | -1 | 4 | 4 |
|
|
| zerofill | STRING | 101761 | -1 | 6 | 6 |
|
|
| name | STRING | 145636240 | -1 | 22 | 13.00020027160645 |
|
|
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
|
|
| location_id | SMALLINT | 339 | -1 | 2 | 2 |
|
|
+-------------+----------+------------------+--------+----------+-------------------+
|
|
</codeblock>
|
|
|
|
<note>
|
|
<p>
|
|
For column statistics to be effective in Impala, you also need to have table statistics for the
|
|
applicable tables, as described in <xref href="impala_perf_stats.xml#perf_table_stats"/>. When you use
|
|
the Impala <codeph>COMPUTE STATS</codeph> statement, both table and column statistics are automatically
|
|
gathered at the same time, for all columns in the table.
|
|
</p>
|
|
<p conref="../shared/impala_common.xml#common/decimal_no_stats"/>
|
|
</note>
|
|
|
|
<note conref="../shared/impala_common.xml#common/compute_stats_nulls"/>
|
|
|
|
<!-- Hive-based instructions are considered obsolete since the introduction of the Impala COMPUTE STATS statement.
|
|
<p>
|
|
Add settings like the following to the <filepath>hive-site.xml</filepath>
|
|
configuration file, in the Hive configuration directory, on every node where you run
|
|
<codeph>ANALYZE TABLE</codeph> statements through the
|
|
<codeph>hive</codeph> shell. The
|
|
<codeph>hive.stats.ndv.error</codeph> setting represents the standard error when
|
|
estimating the number of distinct values for a column. The value of 5.0 is recommended as a tradeoff between the
|
|
accuracy of the gathered statistics and the resource usage of the stats-gathering process.
|
|
</p>
|
|
|
|
<codeblock><![CDATA[<property>
|
|
<name>hive.stats.ndv.error</name>
|
|
<value>5.0</value>
|
|
</property>]]></codeblock>
|
|
|
|
<p>
|
|
5.0 is a relatively low value that devotes substantial computational resources to the statistics-gathering
|
|
process. To reduce the resource usage, you could increase this value; to make the statistics even more precise,
|
|
you could lower it.
|
|
</p>
|
|
|
|
<p>
|
|
The syntax for gathering column statistics uses the <codeph>ANALYZE TABLE ...
|
|
COMPUTE STATISTICS</codeph> clause, with an additional <codeph>FOR
|
|
COLUMNS</codeph> clause. For partitioned tables, you can gather statistics for specific partitions by including
|
|
a clause <codeph>PARTITION
|
|
(<varname>col1=val1</varname>,<varname>col2=val2</varname>,
|
|
...)</codeph>; but you cannot include the partitioning columns in the
|
|
<codeph>FOR COLUMNS</codeph> clause. Also, you cannot use fully qualified table
|
|
names, so issue a <codeph>USE</codeph> command first to switch to the
|
|
appropriate database. For example:
|
|
</p>
|
|
|
|
<codeblock>USE <varname>database_name</varname>;
|
|
ANALYZE TABLE <varname>table_name</varname> COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>;
|
|
ANALYZE TABLE <varname>table_name</varname> PARTITION (<varname>partition_specs</varname>) COMPUTE STATISTICS FOR COLUMNS <varname>column_list</varname>;</codeblock>
|
|
-->
|
|
|
|
<p rev="1.2.1">
|
|
To check whether column statistics are available for a particular set of columns, use the <codeph>SHOW
|
|
COLUMN STATS <varname>table_name</varname></codeph> statement, or check the extended
|
|
<codeph>EXPLAIN</codeph> output for a query against that table that refers to those columns. See
|
|
<xref href="impala_show.xml#show"/> and <xref href="impala_explain.xml#explain"/> for details.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="perf_stats_partitions">
|
|
<title id="stats_partitions">How Table and Column Statistics Work for Partitioned Tables</title>
|
|
<conbody>
|
|
|
|
<p>
|
|
When you use Impala for <q>big data</q>, you are highly likely to use partitioning
|
|
for your biggest tables, the ones representing data that can be logically divided
|
|
based on dates, geographic regions, or similar criteria. The table and column statistics
|
|
are especially useful for optimizing queries on such tables. For example, a query involving
|
|
one year might involve substantially more or less data than a query involving a different year,
|
|
or a range of several years. Each query might be optimized differently as a result.
|
|
</p>
|
|
|
|
<p>
|
|
The following examples show how table and column stats work with a partitioned table.
|
|
The table for this example is partitioned by year, month, and day.
|
|
For simplicity, the sample data consists of 5 partitions, all from the same year and month.
|
|
Table stats are collected independently for each partition. (In fact, the
|
|
<codeph>SHOW PARTITIONS</codeph> statement displays exactly the same information as
|
|
<codeph>SHOW TABLE STATS</codeph> for a partitioned table.) Column stats apply to
|
|
the entire table, not to individual partitions. Because the partition key column values
|
|
are represented as HDFS directories, their characteristics are typically known in advance,
|
|
even when the values for non-key columns are shown as -1.
|
|
</p>
|
|
|
|
<codeblock>
|
|
show partitions year_month_day;
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
|
|
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
|
|
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| Total | | | -1 | 5 | 12.58MB | 0B | | |...
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
|
|
|
|
show table stats year_month_day;
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
|
|
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
|
|
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| Total | | | -1 | 5 | 12.58MB | 0B | | |...
|
|
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
|
|
|
|
show column stats year_month_day;
|
|
+-----------+---------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-----------+---------+------------------+--------+----------+----------+
|
|
| id | INT | -1 | -1 | 4 | 4 |
|
|
| val | INT | -1 | -1 | 4 | 4 |
|
|
| zfill | STRING | -1 | -1 | -1 | -1 |
|
|
| name | STRING | -1 | -1 | -1 | -1 |
|
|
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
|
|
| year | INT | 1 | 0 | 4 | 4 |
|
|
| month | INT | 1 | 0 | 4 | 4 |
|
|
| day | INT | 5 | 0 | 4 | 4 |
|
|
+-----------+---------+------------------+--------+----------+----------+
|
|
|
|
compute stats year_month_day;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 5 partition(s) and 5 column(s). |
|
|
+-----------------------------------------+
|
|
|
|
show table stats year_month_day;
|
|
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
|
|
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
|
|
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
|
|
| 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
|
|
| Total | | | 469290 | 5 | 12.58MB | 0B | | |...
|
|
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
|
|
|
|
show column stats year_month_day;
|
|
+-----------+---------+------------------+--------+----------+-------------------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+-----------+---------+------------------+--------+----------+-------------------+
|
|
| id | INT | 511129 | -1 | 4 | 4 |
|
|
| val | INT | 364853 | -1 | 4 | 4 |
|
|
| zfill | STRING | 311430 | -1 | 6 | 6 |
|
|
| name | STRING | 471975 | -1 | 22 | 13.00160026550293 |
|
|
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
|
|
| year | INT | 1 | 0 | 4 | 4 |
|
|
| month | INT | 1 | 0 | 4 | 4 |
|
|
| day | INT | 5 | 0 | 4 | 4 |
|
|
+-----------+---------+------------------+--------+----------+-------------------+
|
|
</codeblock>
|
|
|
|
<note>
|
|
Partitioned tables can grow so large that scanning the entire table, as the <codeph>COMPUTE STATS</codeph>
|
|
statement does, is impractical just to update the statistics for a new partition. The standard
|
|
<codeph>COMPUTE STATS</codeph> statement might take hours, or even days. That situation is where you switch
|
|
to using incremental statistics, a feature available in <keyword keyref="impala21_full"/> and higher.
|
|
See <xref href="impala_perf_stats.xml#perf_stats_incremental"/> for details about this feature
|
|
and the <codeph>COMPUTE INCREMENTAL STATS</codeph> syntax.
|
|
</note>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="2.1.0" id="perf_stats_incremental">
|
|
|
|
<title id="incremental_stats">Overview of Incremental Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In Impala 2.1.0 and higher, you can use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph> and
|
|
<codeph>DROP INCREMENTAL STATS</codeph>. The <codeph>INCREMENTAL</codeph> clauses work with incremental
|
|
statistics, a specialized feature for partitioned tables that are large or frequently updated with new
|
|
partitions.
|
|
</p>
|
|
|
|
<p>
|
|
When you compute incremental statistics for a partitioned table, by default Impala only processes those
|
|
partitions that do not yet have incremental statistics. By processing only newly added partitions, you can
|
|
keep statistics up to date for large partitioned tables, without incurring the overhead of reprocessing the
|
|
entire table each time.
|
|
</p>
|
|
|
|
<p>
|
|
You can also compute or drop statistics for a single partition by including a <codeph>PARTITION</codeph>
|
|
clause in the <codeph>COMPUTE INCREMENTAL STATS</codeph> or <codeph>DROP INCREMENTAL STATS</codeph>
|
|
statement.
|
|
</p>
|
|
|
|
<p>
|
|
The metadata for incremental statistics is handled differently from the original style of statistics:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
If you have an existing partitioned table for which you have already computed statistics, issuing
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> without a partition clause causes Impala to rescan the
|
|
entire table. Once the incremental statistics are computed, any future <codeph>COMPUTE INCREMENTAL
|
|
STATS</codeph> statements only scan any new partitions and any partitions where you performed
|
|
<codeph>DROP INCREMENTAL STATS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The <codeph>SHOW TABLE STATS</codeph> and <codeph>SHOW PARTITIONS</codeph> statements now include an
|
|
additional column showing whether incremental statistics are available for each column. A partition
|
|
could already be covered by the original type of statistics based on a prior <codeph>COMPUTE
|
|
STATS</codeph> statement, as indicated by a value other than <codeph>-1</codeph> under the
|
|
<codeph>#Rows</codeph> column. Impala query planning uses either kind of statistics when available.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> takes more time than <codeph>COMPUTE STATS</codeph> for the
|
|
same volume of data. Therefore it is most suitable for tables with large data volume where new
|
|
partitions are added frequently, making it impractical to run a full <codeph>COMPUTE STATS</codeph>
|
|
operation for each new partition. For unpartitioned tables, or partitioned tables that are loaded once
|
|
and not updated with new partitions, use the original <codeph>COMPUTE STATS</codeph> syntax.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph> uses some memory in the <cmdname>catalogd</cmdname> process,
|
|
proportional to the number of partitions and number of columns in the applicable table. The memory
|
|
overhead is approximately 400 bytes for each column in each partition. This memory is reserved in the
|
|
<cmdname>catalogd</cmdname> daemon, the <cmdname>statestored</cmdname> daemon, and in each instance of
|
|
the <cmdname>impalad</cmdname> daemon.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
In cases where new files are added to an existing partition, issue a <codeph>REFRESH</codeph> statement
|
|
for the table, followed by a <codeph>DROP INCREMENTAL STATS</codeph> and <codeph>COMPUTE INCREMENTAL
|
|
STATS</codeph> sequence for the changed partition.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The <codeph>DROP INCREMENTAL STATS</codeph> statement operates only on a single partition at a time. To
|
|
remove statistics (whether incremental or not) from all partitions of a table, issue a <codeph>DROP
|
|
STATS</codeph> statement with no <codeph>INCREMENTAL</codeph> or <codeph>PARTITION</codeph> clauses.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
The following considerations apply to incremental statistics when the structure of an existing table is
|
|
changed (known as <term>schema evolution</term>):
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to drop a column, the existing statistics remain
|
|
valid and <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any partitions.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to add a column, Impala rescans all partitions and
|
|
fills in the appropriate column-level values the next time you run <codeph>COMPUTE INCREMENTAL
|
|
STATS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to change the data type of a column, Impala
|
|
rescans all partitions and fills in the appropriate column-level values the next time you run
|
|
<codeph>COMPUTE INCREMENTAL STATS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If you use an <codeph>ALTER TABLE</codeph> statement to change the file format of a table, the existing
|
|
statistics remain valid and a subsequent <codeph>COMPUTE INCREMENTAL STATS</codeph> does not rescan any
|
|
partitions.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
See <xref href="impala_compute_stats.xml#compute_stats"/> and
|
|
<xref href="impala_drop_stats.xml#drop_stats"/> for syntax details.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="perf_stats_computing">
|
|
<title>Generating Table and Column Statistics (COMPUTE STATS Statement)</title>
|
|
<conbody>
|
|
|
|
<p>
|
|
To gather table statistics after loading data into a table or partition, you typically use the
|
|
<codeph>COMPUTE STATS</codeph> statement. This statement is available in Impala 1.2.2 and higher.
|
|
It gathers both table statistics and column statistics for all columns in a single operation.
|
|
For large partitioned tables, where you frequently need to update statistics and it is impractical
|
|
to scan the entire table each time, use the syntax <codeph>COMPUTE INCREMENTAL STATS</codeph>,
|
|
which is available in <keyword keyref="impala21_full"/> and higher.
|
|
</p>
|
|
|
|
<p>
|
|
If you use Hive as part of your ETL workflow, you can also use Hive to generate table and
|
|
column statistics. You might need to do extra configuration within Hive itself, the metastore,
|
|
or even set up a separate database to hold Hive-generated statistics. You might need to run
|
|
multiple statements to generate all the necessary statistics. Therefore, prefer the
|
|
Impala <codeph>COMPUTE STATS</codeph> statement where that technique is practical.
|
|
For details about collecting statistics through Hive, see
|
|
<xref href="https://cwiki.apache.org/confluence/display/Hive/StatsDev" scope="external" format="html">the Hive wiki</xref>.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hive_column_stats_caveat"/>
|
|
|
|
<!-- Commenting out over-detailed Hive instructions as part of stats reorg.
|
|
<li>
|
|
Issue an <codeph>ANALYZE TABLE</codeph> statement in Hive, for the entire table or a specific partition.
|
|
<codeblock>ANALYZE TABLE <varname>tablename</varname> [PARTITION(<varname>partcol1</varname>[=<varname>val1</varname>], <varname>partcol2</varname>[=<varname>val2</varname>], ...)] COMPUTE STATISTICS [NOSCAN];</codeblock>
|
|
For example, to gather statistics for a non-partitioned table:
|
|
<codeblock>ANALYZE TABLE customer COMPUTE STATISTICS;</codeblock>
|
|
To gather statistics for a <codeph>store</codeph> table partitioned by state and city, and both of its
|
|
partitions:
|
|
<codeblock>ANALYZE TABLE store PARTITION(s_state, s_county) COMPUTE STATISTICS;</codeblock>
|
|
To gather statistics for the <codeph>store</codeph> table and only the partitions for California:
|
|
<codeblock>ANALYZE TABLE store PARTITION(s_state='CA', s_county) COMPUTE STATISTICS;</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
Load the data through the <codeph>INSERT OVERWRITE</codeph> statement in Hive, while the Hive setting
|
|
<b>hive.stats.autogather</b> is enabled.
|
|
</li>
|
|
|
|
</ul>
|
|
-->
|
|
|
|
<p rev="2.0.1">
|
|
<!-- Additional info as a result of IMPALA-1420 -->
|
|
<!-- Keep checking if https://issues.apache.org/jira/browse/HIVE-8648 ever gets fixed and when that fix makes it into a CDH release. -->
|
|
For your very largest tables, you might find that <codeph>COMPUTE STATS</codeph> or even <codeph>COMPUTE INCREMENTAL STATS</codeph>
|
|
take so long to scan the data that it is impractical to use them regularly. In such a case, after adding a partition or inserting new data,
|
|
you can update just the number of rows property through an <codeph>ALTER TABLE</codeph> statement.
|
|
See <xref href="impala_perf_stats.xml#perf_table_stats_manual"/> for details.
|
|
Because the column statistics might be left in a stale state, do not use this technique as a replacement
|
|
for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a
|
|
low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations.
|
|
</p>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="2.1.0" id="perf_stats_checking">
|
|
|
|
<title>Detecting Missing Statistics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can check whether a specific table has statistics using the <codeph>SHOW TABLE STATS</codeph> statement
|
|
(for any table) or the <codeph>SHOW PARTITIONS</codeph> statement (for a partitioned table). Both
|
|
statements display the same information. If a table or a partition does not have any statistics, the
|
|
<codeph>#Rows</codeph> field contains <codeph>-1</codeph>. Once you compute statistics for the table or
|
|
partition, the <codeph>#Rows</codeph> field changes to an accurate value.
|
|
</p>
|
|
|
|
<p>
|
|
The following example shows a table that initially does not have any statistics. The <codeph>SHOW TABLE
|
|
STATS</codeph> statement displays different values for <codeph>#Rows</codeph> before and after the
|
|
<codeph>COMPUTE STATS</codeph> operation.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table no_stats (x int);
|
|
[localhost:21000] > show table stats no_stats;
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| -1 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
[localhost:21000] > compute stats no_stats;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 1 column(s). |
|
|
+-----------------------------------------+
|
|
[localhost:21000] > show table stats no_stats;
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
| 0 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
+-------+--------+------+--------------+--------+-------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows a similar progression with a partitioned table. Initially,
|
|
<codeph>#Rows</codeph> is <codeph>-1</codeph>. After a <codeph>COMPUTE STATS</codeph> operation,
|
|
<codeph>#Rows</codeph> changes to an accurate value. Any newly added partition starts with no statistics,
|
|
meaning that you must collect statistics after adding a new partition.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table no_stats_partitioned (x int) partitioned by (year smallint);
|
|
[localhost:21000] > show table stats no_stats_partitioned;
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| Total | -1 | 0 | 0B | 0B | | |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
[localhost:21000] > show partitions no_stats_partitioned;
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| Total | -1 | 0 | 0B | 0B | | |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
[localhost:21000] > alter table no_stats_partitioned add partition (year=2013);
|
|
[localhost:21000] > compute stats no_stats_partitioned;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 1 partition(s) and 1 column(s). |
|
|
+-----------------------------------------+
|
|
[localhost:21000] > alter table no_stats_partitioned add partition (year=2014);
|
|
[localhost:21000] > show partitions no_stats_partitioned;
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
| 2013 | 0 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
| 2014 | -1 | 0 | 0B | NOT CACHED | TEXT | false |
|
|
| Total | 0 | 0 | 0B | 0B | | |
|
|
+-------+-------+--------+------+--------------+--------+-------------------+
|
|
</codeblock>
|
|
|
|
<note>
|
|
Because the default <codeph>COMPUTE STATS</codeph> statement creates and updates statistics for all
|
|
partitions in a table, if you expect to frequently add new partitions, use the <codeph>COMPUTE INCREMENTAL
|
|
STATS</codeph> syntax instead, which lets you compute stats for a single specified partition, or only for
|
|
those partitions that do not already have incremental stats.
|
|
</note>
|
|
|
|
<p>
|
|
If checking each individual table is impractical, due to a large number of tables or views that hide the
|
|
underlying base tables, you can also check for missing statistics for a particular query. Use the
|
|
<codeph>EXPLAIN</codeph> statement to preview query efficiency before actually running the query. Use the
|
|
query profile output available through the <codeph>PROFILE</codeph> command in
|
|
<cmdname>impala-shell</cmdname> or the web UI to verify query execution and timing after running the query.
|
|
Both the <codeph>EXPLAIN</codeph> plan and the <codeph>PROFILE</codeph> output display a warning if any
|
|
tables or partitions involved in the query do not have statistics.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table no_stats (x int);
|
|
[localhost:21000] > explain select count(*) from no_stats;
|
|
+------------------------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------------------------+
|
|
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
|
|
| WARNING: The following tables are missing relevant table and/or column statistics. |
|
|
| incremental_stats.no_stats |
|
|
| |
|
|
| 03:AGGREGATE [FINALIZE] |
|
|
| | output: count:merge(*) |
|
|
| | |
|
|
| 02:EXCHANGE [UNPARTITIONED] |
|
|
| | |
|
|
| 01:AGGREGATE |
|
|
| | output: count(*) |
|
|
| | |
|
|
| 00:SCAN HDFS [incremental_stats.no_stats] |
|
|
| partitions=1/1 files=0 size=0B |
|
|
+------------------------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Because Impala uses the <term>partition pruning</term> technique when possible to only evaluate certain
|
|
partitions, if you have a partitioned table with statistics for some partitions and not others, whether or
|
|
not the <codeph>EXPLAIN</codeph> statement shows the warning depends on the actual partitions used by the
|
|
query. For example, you might see warnings or not for different queries against the same table:
|
|
</p>
|
|
|
|
<codeblock>-- No warning because all the partitions for the year 2012 have stats.
|
|
EXPLAIN SELECT ... FROM t1 WHERE year = 2012;
|
|
|
|
-- Missing stats warning because one or more partitions in this range
|
|
-- do not have stats.
|
|
EXPLAIN SELECT ... FROM t1 WHERE year BETWEEN 2006 AND 2009;
|
|
</codeblock>
|
|
|
|
<p>
|
|
To confirm if any partitions at all in the table are missing statistics, you might explain a query that
|
|
scans the entire table, such as <codeph>SELECT COUNT(*) FROM <varname>table_name</varname></codeph>.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="2.1.0" id="perf_stats_collecting">
|
|
|
|
<title>Keeping Statistics Up to Date</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
When the contents of a table or partition change significantly, recompute the stats for the relevant table
|
|
or partition. The degree of change that qualifies as <q>significant</q> varies, depending on the absolute
|
|
and relative sizes of the tables. Typically, if you add more than 30% more data to a table, it is
|
|
worthwhile to recompute stats, because the differences in number of rows and number of distinct values
|
|
might cause Impala to choose a different join order when that table is used in join queries. This guideline
|
|
is most important for the largest tables. For example, adding 30% new data to a table containing 1 TB has a
|
|
greater effect on join order than adding 30% to a table containing only a few megabytes, and the larger
|
|
table has a greater effect on query performance if Impala chooses a suboptimal join order as a result of
|
|
outdated statistics.
|
|
</p>
|
|
|
|
<p>
|
|
If you reload a complete new set of data for a table, but the number of rows and number of distinct values
|
|
for each column is relatively unchanged from before, you do not need to recompute stats for the table.
|
|
</p>
|
|
|
|
<p>
|
|
If the statistics for a table are out of date, and the table's large size makes it impractical to recompute
|
|
new stats immediately, you can use the <codeph>DROP STATS</codeph> statement to remove the obsolete
|
|
statistics, making it easier to identify tables that need a new <codeph>COMPUTE STATS</codeph> operation.
|
|
</p>
|
|
|
|
<p>
|
|
For a large partitioned table, consider using the incremental stats feature available in Impala 2.1.0 and
|
|
higher, as explained in <xref href="impala_perf_stats.xml#perf_stats_incremental"/>. If you add a new
|
|
partition to a table, it is worthwhile to recompute incremental stats, because the operation only scans the
|
|
data for that one new partition.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<!-- Might deserve its own conceptual topic at some point. -->
|
|
|
|
<concept audience="hidden" rev="1.2.2" id="perf_stats_joins">
|
|
|
|
<title>How Statistics Are Used in Join Queries</title>
|
|
|
|
<conbody>
|
|
|
|
<p></p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<!-- Might deserve its own conceptual topic at some point. -->
|
|
|
|
<concept audience="hidden" rev="1.2.2" id="perf_stats_inserts">
|
|
|
|
<title>How Statistics Are Used in INSERT Operations</title>
|
|
|
|
<conbody>
|
|
|
|
<p conref="../shared/impala_common.xml#common/insert_hints"/>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="1.2.2" id="perf_table_stats_manual">
|
|
|
|
<title>Setting the NUMROWS Value Manually through ALTER TABLE</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The most crucial piece of data in all the statistics is the number of rows in the table (for an
|
|
unpartitioned or partitioned table) and for each partition (for a partitioned table). The <codeph>COMPUTE STATS</codeph>
|
|
statement always gathers statistics about all columns, as well as overall table statistics. If it is not
|
|
practical to do a full <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
|
|
operation after adding a partition or inserting data, or if you can see that Impala would produce a more
|
|
efficient plan if the number of rows was different, you can manually set the number of rows through an
|
|
<codeph>ALTER TABLE</codeph> statement:
|
|
</p>
|
|
|
|
<codeblock>
|
|
-- Set total number of rows. Applies to both unpartitioned and partitioned tables.
|
|
alter table <varname>table_name</varname> set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
|
|
|
|
-- Set total number of rows for a specific partition. Applies to partitioned tables only.
|
|
-- You must specify all the partition key columns in the PARTITION clause.
|
|
alter table <varname>table_name</varname> partition (<varname>keycol1</varname>=<varname>val1</varname>,<varname>keycol2</varname>=<varname>val2</varname>...) set tblproperties('numRows'='<varname>new_value</varname>', 'STATS_GENERATED_VIA_STATS_TASK'='true');
|
|
</codeblock>
|
|
|
|
<p>
|
|
This statement avoids re-scanning any data files. (The requirement to include the <codeph>STATS_GENERATED_VIA_STATS_TASK</codeph> property is relatively new, as a
|
|
result of the issue <xref href="https://issues.apache.org/jira/browse/HIVE-8648" scope="external" format="html">HIVE-8648</xref>
|
|
for the Hive metastore.)
|
|
</p>
|
|
|
|
<codeblock conref="../shared/impala_common.xml#common/set_numrows_example"/>
|
|
|
|
<p>
|
|
For a partitioned table, update both the per-partition number of rows and the number of rows for the whole
|
|
table:
|
|
</p>
|
|
|
|
<codeblock conref="../shared/impala_common.xml#common/set_numrows_partitioned_example"/>
|
|
|
|
<p>
|
|
In practice, the <codeph>COMPUTE STATS</codeph> statement, or <codeph>COMPUTE INCREMENTAL STATS</codeph>
|
|
for a partitioned table, should be fast and convenient enough that this technique is only useful for the very
|
|
largest partitioned tables.
|
|
<!--
|
|
It is most useful as a workaround for in case of performance issues where you might adjust the <codeph>numRows</codeph> value higher
|
|
or lower to produce the ideal join order.
|
|
-->
|
|
<!-- Following wording is duplicated from earlier. Consider conref'ing. -->
|
|
Because the column statistics might be left in a stale state, do not use this technique as a replacement
|
|
for <codeph>COMPUTE STATS</codeph>. Only use this technique if all other means of collecting statistics are impractical, or as a
|
|
low-overhead operation that you run in between periodic <codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph> operations.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="perf_column_stats_manual" rev="2.6.0 IMPALA-3369">
|
|
<title>Setting Column Stats Manually through ALTER TABLE</title>
|
|
<conbody>
|
|
<p>
|
|
In <keyword keyref="impala26_full"/> and higher, you can also use the <codeph>SET COLUMN STATS</codeph>
|
|
clause of <codeph>ALTER TABLE</codeph> to manually set or change column statistics.
|
|
Only use this technique in cases where it is impractical to run
|
|
<codeph>COMPUTE STATS</codeph> or <codeph>COMPUTE INCREMENTAL STATS</codeph>
|
|
frequently enough to keep up with data changes for a huge table.
|
|
</p>
|
|
<p conref="../shared/impala_common.xml#common/set_column_stats_example"/>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="1.2.2" id="perf_stats_examples">
|
|
|
|
<title>Examples of Using Table and Column Statistics with Impala</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The following examples walk through a sequence of <codeph>SHOW TABLE STATS</codeph>, <codeph>SHOW COLUMN
|
|
STATS</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>SELECT</codeph> and <codeph>INSERT</codeph>
|
|
statements to illustrate various aspects of how Impala uses statistics to help optimize queries.
|
|
</p>
|
|
|
|
<p>
|
|
This example shows table and column statistics for the <codeph>STORE</codeph> column used in the
|
|
<xref href="http://www.tpc.org/tpcds/" scope="external" format="html">TPC-DS benchmarks for decision
|
|
support</xref> systems. It is a tiny table holding data for 12 stores. Initially, before any statistics are
|
|
gathered by a <codeph>COMPUTE STATS</codeph> statement, most of the numeric fields show placeholder values
|
|
of -1, indicating that the figures are unknown. The figures that are filled in are values that are easily
|
|
countable or deducible at the physical level, such as the number of files, total data size of the files,
|
|
and the maximum and average sizes for data types that have a constant size such as <codeph>INT</codeph>,
|
|
<codeph>FLOAT</codeph>, and <codeph>TIMESTAMP</codeph>.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > show table stats store;
|
|
+-------+--------+--------+--------+
|
|
| #Rows | #Files | Size | Format |
|
|
+-------+--------+--------+--------+
|
|
| -1 | 1 | 3.08KB | TEXT |
|
|
+-------+--------+--------+--------+
|
|
Returned 1 row(s) in 0.03s
|
|
[localhost:21000] > show column stats store;
|
|
+--------------------+-----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------------------+-----------+------------------+--------+----------+----------+
|
|
| s_store_sk | INT | -1 | -1 | 4 | 4 |
|
|
| s_store_id | STRING | -1 | -1 | -1 | -1 |
|
|
| s_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 |
|
|
| s_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 |
|
|
| s_closed_date_sk | INT | -1 | -1 | 4 | 4 |
|
|
| s_store_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_number_employees | INT | -1 | -1 | 4 | 4 |
|
|
| s_floor_space | INT | -1 | -1 | 4 | 4 |
|
|
| s_hours | STRING | -1 | -1 | -1 | -1 |
|
|
| s_manager | STRING | -1 | -1 | -1 | -1 |
|
|
| s_market_id | INT | -1 | -1 | 4 | 4 |
|
|
| s_geography_class | STRING | -1 | -1 | -1 | -1 |
|
|
| s_market_desc | STRING | -1 | -1 | -1 | -1 |
|
|
| s_market_manager | STRING | -1 | -1 | -1 | -1 |
|
|
| s_division_id | INT | -1 | -1 | 4 | 4 |
|
|
| s_division_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_company_id | INT | -1 | -1 | 4 | 4 |
|
|
| s_company_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_street_number | STRING | -1 | -1 | -1 | -1 |
|
|
| s_street_name | STRING | -1 | -1 | -1 | -1 |
|
|
| s_street_type | STRING | -1 | -1 | -1 | -1 |
|
|
| s_suite_number | STRING | -1 | -1 | -1 | -1 |
|
|
| s_city | STRING | -1 | -1 | -1 | -1 |
|
|
| s_county | STRING | -1 | -1 | -1 | -1 |
|
|
| s_state | STRING | -1 | -1 | -1 | -1 |
|
|
| s_zip | STRING | -1 | -1 | -1 | -1 |
|
|
| s_country | STRING | -1 | -1 | -1 | -1 |
|
|
| s_gmt_offset | FLOAT | -1 | -1 | 4 | 4 |
|
|
| s_tax_percentage | FLOAT | -1 | -1 | 4 | 4 |
|
|
+--------------------+-----------+------------------+--------+----------+----------+
|
|
Returned 29 row(s) in 0.04s</codeblock>
|
|
|
|
<p>
|
|
With the Hive <codeph>ANALYZE TABLE</codeph> statement for column statistics, you had to specify each
|
|
column for which to gather statistics. The Impala <codeph>COMPUTE STATS</codeph> statement automatically
|
|
gathers statistics for all columns, because it reads through the entire table relatively quickly and can
|
|
efficiently compute the values for all the columns. This example shows how after running the
|
|
<codeph>COMPUTE STATS</codeph> statement, statistics are filled in for both the table and all its columns:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > compute stats store;
|
|
+------------------------------------------+
|
|
| summary |
|
|
+------------------------------------------+
|
|
| Updated 1 partition(s) and 29 column(s). |
|
|
+------------------------------------------+
|
|
Returned 1 row(s) in 1.88s
|
|
[localhost:21000] > show table stats store;
|
|
+-------+--------+--------+--------+
|
|
| #Rows | #Files | Size | Format |
|
|
+-------+--------+--------+--------+
|
|
| 12 | 1 | 3.08KB | TEXT |
|
|
+-------+--------+--------+--------+
|
|
Returned 1 row(s) in 0.02s
|
|
[localhost:21000] > show column stats store;
|
|
+--------------------+-----------+------------------+--------+----------+-------------------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------------------+-----------+------------------+--------+----------+-------------------+
|
|
| s_store_sk | INT | 12 | -1 | 4 | 4 |
|
|
| s_store_id | STRING | 6 | -1 | 16 | 16 |
|
|
| s_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16 |
|
|
| s_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16 |
|
|
| s_closed_date_sk | INT | 3 | -1 | 4 | 4 |
|
|
| s_store_name | STRING | 8 | -1 | 5 | 4.25 |
|
|
| s_number_employees | INT | 9 | -1 | 4 | 4 |
|
|
| s_floor_space | INT | 10 | -1 | 4 | 4 |
|
|
| s_hours | STRING | 2 | -1 | 8 | 7.083300113677979 |
|
|
| s_manager | STRING | 7 | -1 | 15 | 12 |
|
|
| s_market_id | INT | 7 | -1 | 4 | 4 |
|
|
| s_geography_class | STRING | 1 | -1 | 7 | 7 |
|
|
| s_market_desc | STRING | 10 | -1 | 94 | 55.5 |
|
|
| s_market_manager | STRING | 7 | -1 | 16 | 14 |
|
|
| s_division_id | INT | 1 | -1 | 4 | 4 |
|
|
| s_division_name | STRING | 1 | -1 | 7 | 7 |
|
|
| s_company_id | INT | 1 | -1 | 4 | 4 |
|
|
| s_company_name | STRING | 1 | -1 | 7 | 7 |
|
|
| s_street_number | STRING | 9 | -1 | 3 | 2.833300113677979 |
|
|
| s_street_name | STRING | 12 | -1 | 11 | 6.583300113677979 |
|
|
| s_street_type | STRING | 8 | -1 | 9 | 4.833300113677979 |
|
|
| s_suite_number | STRING | 11 | -1 | 9 | 8.25 |
|
|
| s_city | STRING | 2 | -1 | 8 | 6.5 |
|
|
| s_county | STRING | 1 | -1 | 17 | 17 |
|
|
| s_state | STRING | 1 | -1 | 2 | 2 |
|
|
| s_zip | STRING | 2 | -1 | 5 | 5 |
|
|
| s_country | STRING | 1 | -1 | 13 | 13 |
|
|
| s_gmt_offset | FLOAT | 1 | -1 | 4 | 4 |
|
|
| s_tax_percentage | FLOAT | 5 | -1 | 4 | 4 |
|
|
+--------------------+-----------+------------------+--------+----------+-------------------+
|
|
Returned 29 row(s) in 0.04s</codeblock>
|
|
|
|
<p>
|
|
The following example shows how statistics are represented for a partitioned table. In this case, we have
|
|
set up a table to hold the world's most trivial census data, a single <codeph>STRING</codeph> field,
|
|
partitioned by a <codeph>YEAR</codeph> column. The table statistics include a separate entry for each
|
|
partition, plus final totals for the numeric fields. The column statistics include some easily deducible
|
|
facts for the partitioning column, such as the number of distinct values (the number of partition
|
|
subdirectories).
|
|
<!-- and the number of <codeph>NULL</codeph> values (none in this case). -->
|
|
</p>
|
|
|
|
<codeblock>localhost:21000] > describe census;
|
|
+------+----------+---------+
|
|
| name | type | comment |
|
|
+------+----------+---------+
|
|
| name | string | |
|
|
| year | smallint | |
|
|
+------+----------+---------+
|
|
Returned 2 row(s) in 0.02s
|
|
[localhost:21000] > show table stats census;
|
|
+-------+-------+--------+------+---------+
|
|
| year | #Rows | #Files | Size | Format |
|
|
+-------+-------+--------+------+---------+
|
|
| 2000 | -1 | 0 | 0B | TEXT |
|
|
| 2004 | -1 | 0 | 0B | TEXT |
|
|
| 2008 | -1 | 0 | 0B | TEXT |
|
|
| 2010 | -1 | 0 | 0B | TEXT |
|
|
| 2011 | 0 | 1 | 22B | TEXT |
|
|
| 2012 | -1 | 1 | 22B | TEXT |
|
|
| 2013 | -1 | 1 | 231B | PARQUET |
|
|
| Total | 0 | 3 | 275B | |
|
|
+-------+-------+--------+------+---------+
|
|
Returned 8 row(s) in 0.02s
|
|
[localhost:21000] > show column stats census;
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| name | STRING | -1 | -1 | -1 | -1 |
|
|
| year | SMALLINT | 7 | -1 | 2 | 2 |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
Returned 2 row(s) in 0.02s</codeblock>
|
|
|
|
<p>
|
|
The following example shows how the statistics are filled in by a <codeph>COMPUTE STATS</codeph> statement
|
|
in Impala.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > compute stats census;
|
|
+-----------------------------------------+
|
|
| summary |
|
|
+-----------------------------------------+
|
|
| Updated 3 partition(s) and 1 column(s). |
|
|
+-----------------------------------------+
|
|
Returned 1 row(s) in 2.16s
|
|
[localhost:21000] > show table stats census;
|
|
+-------+-------+--------+------+---------+
|
|
| year | #Rows | #Files | Size | Format |
|
|
+-------+-------+--------+------+---------+
|
|
| 2000 | -1 | 0 | 0B | TEXT |
|
|
| 2004 | -1 | 0 | 0B | TEXT |
|
|
| 2008 | -1 | 0 | 0B | TEXT |
|
|
| 2010 | -1 | 0 | 0B | TEXT |
|
|
| 2011 | 4 | 1 | 22B | TEXT |
|
|
| 2012 | 4 | 1 | 22B | TEXT |
|
|
| 2013 | 1 | 1 | 231B | PARQUET |
|
|
| Total | 9 | 3 | 275B | |
|
|
+-------+-------+--------+------+---------+
|
|
Returned 8 row(s) in 0.02s
|
|
[localhost:21000] > show column stats census;
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
| name | STRING | 4 | -1 | 5 | 4.5 |
|
|
| year | SMALLINT | 7 | -1 | 2 | 2 |
|
|
+--------+----------+------------------+--------+----------+----------+
|
|
Returned 2 row(s) in 0.02s</codeblock>
|
|
|
|
<p rev="1.4.0">
|
|
For examples showing how some queries work differently when statistics are available, see
|
|
<xref href="impala_perf_joins.xml#perf_joins_examples"/>. You can see how Impala executes a query
|
|
differently in each case by observing the <codeph>EXPLAIN</codeph> output before and after collecting
|
|
statistics. Measure the before and after query times, and examine the throughput numbers in before and
|
|
after <codeph>SUMMARY</codeph> or <codeph>PROFILE</codeph> output, to verify how much the improved plan
|
|
speeds up performance.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|