IMPALA-1654: [DOCS] DDL for multiple partitions

Syntax and usage notes for ALTER TABLE,
COMPUTE STATS, and SHOW FILES.

Mixed in a little bit with new Kudu syntax for
ALTER TABLE. Didn't include all new Kudu info
in this CR, the better to minimize merge conflicts.

Added note about performance/scalability of IMPALA-1654.

Added new Known Issue item for IMPALA-4106 under Performance category.

Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b
Reviewed-on: http://gerrit.cloudera.org:8080/5726
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This commit is contained in:
John Russell
2017-01-17 13:19:13 -08:00
committed by Impala Public Jenkins
parent cb52b2b8ae
commit eaefbb90ce
4 changed files with 186 additions and 3 deletions

View File

@@ -80,7 +80,11 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
<varname>col_spec</varname> ::= <varname>col_name</varname> <varname>type_name</varname>
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph>
<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph>
<varname>table_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...]
@@ -124,6 +128,80 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
an external table, the underlying data directory is not renamed or moved.
</p>
<p>
<b>Dropping or altering multiple partitions:</b>
</p>
<p rev="IMPALA-1654">
In <keyword keyref="impala28_full"/> and higher,
the expression for the partition clause with a <codeph>DROP</codeph> or <codeph>SET</codeph>
operation can include comparison operators such as <codeph>&lt;</codeph>, <codeph>IN</codeph>,
or <codeph>BETWEEN</codeph>, and Boolean operators such as <codeph>AND</codeph>
and <codeph>OR</codeph>.
</p>
<p rev="IMPALA-1654">
For example, you might drop a group of partitions corresponding to a particular date
range after the data <q>ages out</q>:
</p>
<codeblock><![CDATA[
alter table historical_data drop partition (year < 1995);
alter table historical_data drop partition (year = 1996 and month between 1 and 6);
]]>
</codeblock>
<p rev="IMPALA-1654">
For tables with multiple partition keys columns, you can specify multiple
conditions separated by commas, and the operation only applies to the partitions
that match all the conditions (similar to using an <codeph>AND</codeph> clause):
</p>
<codeblock><![CDATA[
alter table historical_data drop partition (year < 1995, last_name like 'A%');
]]>
</codeblock>
<p rev="IMPALA-1654">
This technique can also be used to change the file format of groups of partitions,
as part of an ETL pipeline that periodically consolidates and rewrites the underlying
data files in a different file format:
</p>
<codeblock><![CDATA[
alter table fast_growing_data partition (year = 2016, month in (10,11,12)) set fileformat parquet;
]]>
</codeblock>
<note>
<p rev="IMPALA-1654">
The extended syntax involving comparison operators and multiple partitions
applies to the <codeph>SET FILEFORMAT</codeph>, <codeph>SET TBLPROPERTIES</codeph>,
<codeph>SET SERDEPROPERTIES</codeph>, and <codeph>SET [UN]CACHED</codeph> clauses.
You can also use this syntax with the <codeph>PARTITION</codeph> clause
in the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, and with the
<codeph>PARTITION</codeph> clause of the <codeph>SHOW FILES</codeph> statement.
Some forms of <codeph>ALTER TABLE</codeph> still only apply to one partition
at a time: the <codeph>SET LOCATION</codeph> and <codeph>ADD PARTITION</codeph>
clauses. The <codeph>PARTITION</codeph> clauses in the <codeph>LOAD DATA</codeph>
and <codeph>INSERT</codeph> statements also only apply to one partition at a time.
</p>
<p>
A DDL statement that applies to multiple partitions is considered successful
(resulting in no changes) even if no partitions match the conditions.
The results are the same as if the <codeph>IF EXISTS</codeph> clause was specified.
</p>
<p>
The performance and scalability of this technique is similar to
issuing a sequence of single-partition <codeph>ALTER TABLE</codeph>
statements in quick succession. To minimize bottlenecks due to
communication with the metastore database, or causing other
DDL operations on the same table to wait, test the effects of
performing <codeph>ALTER TABLE</codeph> statements that affect
large numbers of partitions.
</p>
</note>
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
<p rev="2.6.0 CDH-39913 IMPALA-1878">

View File

@@ -52,7 +52,12 @@ under the License.
<codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname>
COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)]
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
<!-- Is kudu_partition_spec applicable here? -->
<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph>
<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph>
</codeblock>
<p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
@@ -108,6 +113,75 @@ COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varn
</ul>
</note>
<p rev="IMPALA-1654">
<b>Computing stats for groups of partitions:</b>
</p>
<p rev="IMPALA-1654">
In <keyword keyref="impala28_full"/> and higher, you can run <codeph>COMPUTE INCREMENTAL STATS</codeph>
on multiple partitions, instead of the entire table or one partition at a time. You include
comparison operators other than <codeph>=</codeph> in the <codeph>PARTITION</codeph> clause,
and the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement applies to all partitions that
match the comparison expression.
</p>
<p rev="IMPALA-1654">
For example, the <codeph>INT_PARTITIONS</codeph> table contains 4 partitions.
The following <codeph>COMPUTE INCREMENTAL STATS</codeph> statements affect some but not all
partitions, as indicated by the <codeph>Updated <varname>n</varname> partition(s)</codeph>
messages. The partitions that are affected depend on values in the partition key column <codeph>X</codeph>
that match the comparison expression in the <codeph>PARTITION</codeph> clause.
</p>
<codeblock rev="IMPALA-1654"><![CDATA[
show partitions int_partitions;
+-------+-------+--------+------+--------------+-------------------+---------+...
| x | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
+-------+-------+--------+------+--------------+-------------------+---------+...
| 99 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET |...
| 120 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |...
| 150 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |...
| 200 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |...
| Total | -1 | 0 | 0B | 0B | | |...
+-------+-------+--------+------+--------------+-------------------+---------+...
compute incremental stats int_partitions partition (x < 100);
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
compute incremental stats int_partitions partition (x in (100, 150, 200));
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 2 partition(s) and 1 column(s). |
+-----------------------------------------+
compute incremental stats int_partitions partition (x between 100 and 175);
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 2 partition(s) and 1 column(s). |
+-----------------------------------------+
compute incremental stats int_partitions partition (x in (100, 150, 200) or x < 100);
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
compute incremental stats int_partitions partition (x != 150);
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
]]>
</codeblock>
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
<p rev="2.3.0">

View File

@@ -248,6 +248,24 @@ https://issues.cloudera.org/browse/IMPALA-2144 - Don't have
</conbody>
<concept id="IMPALA-4106" rev="IMPALA-4106">
<title>Use Hive Metastore bulk API for dropping multiple partitions.</title>
<conbody>
<p>
The bulk partition dropping and setting feature of IMPALA-1654 is not as efficient
as it could be, because it currently does not use the Hive Metastore bulk API.
</p>
<p><b>Bug:</b> <xref keyref="IMPALA-4106">IMPALA-4106</xref></p>
<p><b>Severity:</b> High</p>
<p><b>Workaround:</b> Schedule <codeph>ALTER TABLE</codeph> operations that touch
many partitions for times when the table is not undergoing any other DDL operations,
and be prepared for the table to be locked for some time while the <codeph>ALTER TABLE</codeph>
is in progress. Test the performance of large-scale partition operations in a development
environment before trying on tables in a production system.
</p>
</conbody>
</concept>
<concept id="IMPALA-1480" rev="IMPALA-1480">
<!-- Not part of Alex's spreadsheet. Spreadsheet has IMPALA-1423 which mentions it's similar to this one but not a duplicate. -->

View File

@@ -49,7 +49,7 @@ SHOW TABLES [IN <varname>database_name</varname>] [[LIKE] '<varname>pattern</var
<ph rev="1.2.1">SHOW TABLE STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
<ph rev="1.2.1">SHOW COLUMN STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
<ph rev="1.4.0">SHOW PARTITIONS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
<ph rev="2.2.0">SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>key_col</varname>=<varname>value</varname> [, <varname>key_col</varname>=<varname>value</varname>]]</ph>
SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> <ph rev="IMPALA-1654">[PARTITION (<varname>key_col_expression</varname> [, <varname>key_col_expression</varname>]</ph>]
<ph rev="2.0.0">SHOW ROLES
SHOW CURRENT ROLES
@@ -113,6 +113,19 @@ show tables '*dim*|*fact*';</codeblock>
<codeph>MB</codeph> for megabytes, and <codeph>GB</codeph> for gigabytes.
</p>
<p rev="IMPALA-1654">
In <keyword keyref="impala28_full"/> and higher, you can use general
expressions with operators such as <codeph>&lt;</codeph>, <codeph>IN</codeph>,
<codeph>LIKE</codeph>, and <codeph>BETWEEN</codeph> in the <codeph>PARTITION</codeph>
clause, instead of only equality operators. For example:
<codeblock><![CDATA[
show files in sample_table partition (j < 5);
show files in sample_table partition (k = 3, l between 1 and 10);
show files in sample_table partition (month like 'J%');
]]>
</codeblock>
</p>
<note>
This statement applies to tables and partitions stored on HDFS, or in the Amazon Simple Storage System (S3).
It does not apply to views.