mirror of
https://github.com/apache/impala.git
synced 2025-12-29 18:01:07 -05:00
IMPALA-1654: [DOCS] DDL for multiple partitions
Syntax and usage notes for ALTER TABLE, COMPUTE STATS, and SHOW FILES. Mixed in a little bit with new Kudu syntax for ALTER TABLE. Didn't include all new Kudu info in this CR, the better to minimize merge conflicts. Added note about performance/scalability of IMPALA-1654. Added new Known Issue item for IMPALA-4106 under Performance category. Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b Reviewed-on: http://gerrit.cloudera.org:8080/5726 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
This commit is contained in:
committed by
Impala Public Jenkins
parent
cb52b2b8ae
commit
eaefbb90ce
@@ -80,7 +80,11 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
|
||||
|
||||
<varname>col_spec</varname> ::= <varname>col_name</varname> <varname>type_name</varname>
|
||||
|
||||
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
|
||||
<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph>
|
||||
|
||||
<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
|
||||
|
||||
<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph>
|
||||
|
||||
<varname>table_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...]
|
||||
|
||||
@@ -124,6 +128,80 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
|
||||
an external table, the underlying data directory is not renamed or moved.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
<b>Dropping or altering multiple partitions:</b>
|
||||
</p>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
In <keyword keyref="impala28_full"/> and higher,
|
||||
the expression for the partition clause with a <codeph>DROP</codeph> or <codeph>SET</codeph>
|
||||
operation can include comparison operators such as <codeph><</codeph>, <codeph>IN</codeph>,
|
||||
or <codeph>BETWEEN</codeph>, and Boolean operators such as <codeph>AND</codeph>
|
||||
and <codeph>OR</codeph>.
|
||||
</p>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
For example, you might drop a group of partitions corresponding to a particular date
|
||||
range after the data <q>ages out</q>:
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[
|
||||
alter table historical_data drop partition (year < 1995);
|
||||
alter table historical_data drop partition (year = 1996 and month between 1 and 6);
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
For tables with multiple partition keys columns, you can specify multiple
|
||||
conditions separated by commas, and the operation only applies to the partitions
|
||||
that match all the conditions (similar to using an <codeph>AND</codeph> clause):
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[
|
||||
alter table historical_data drop partition (year < 1995, last_name like 'A%');
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
This technique can also be used to change the file format of groups of partitions,
|
||||
as part of an ETL pipeline that periodically consolidates and rewrites the underlying
|
||||
data files in a different file format:
|
||||
</p>
|
||||
|
||||
<codeblock><![CDATA[
|
||||
alter table fast_growing_data partition (year = 2016, month in (10,11,12)) set fileformat parquet;
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<note>
|
||||
<p rev="IMPALA-1654">
|
||||
The extended syntax involving comparison operators and multiple partitions
|
||||
applies to the <codeph>SET FILEFORMAT</codeph>, <codeph>SET TBLPROPERTIES</codeph>,
|
||||
<codeph>SET SERDEPROPERTIES</codeph>, and <codeph>SET [UN]CACHED</codeph> clauses.
|
||||
You can also use this syntax with the <codeph>PARTITION</codeph> clause
|
||||
in the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, and with the
|
||||
<codeph>PARTITION</codeph> clause of the <codeph>SHOW FILES</codeph> statement.
|
||||
Some forms of <codeph>ALTER TABLE</codeph> still only apply to one partition
|
||||
at a time: the <codeph>SET LOCATION</codeph> and <codeph>ADD PARTITION</codeph>
|
||||
clauses. The <codeph>PARTITION</codeph> clauses in the <codeph>LOAD DATA</codeph>
|
||||
and <codeph>INSERT</codeph> statements also only apply to one partition at a time.
|
||||
</p>
|
||||
<p>
|
||||
A DDL statement that applies to multiple partitions is considered successful
|
||||
(resulting in no changes) even if no partitions match the conditions.
|
||||
The results are the same as if the <codeph>IF EXISTS</codeph> clause was specified.
|
||||
</p>
|
||||
<p>
|
||||
The performance and scalability of this technique is similar to
|
||||
issuing a sequence of single-partition <codeph>ALTER TABLE</codeph>
|
||||
statements in quick succession. To minimize bottlenecks due to
|
||||
communication with the metastore database, or causing other
|
||||
DDL operations on the same table to wait, test the effects of
|
||||
performing <codeph>ALTER TABLE</codeph> statements that affect
|
||||
large numbers of partitions.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
||||
|
||||
<p rev="2.6.0 CDH-39913 IMPALA-1878">
|
||||
|
||||
@@ -52,7 +52,12 @@ under the License.
|
||||
<codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname>
|
||||
COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)]
|
||||
|
||||
<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
|
||||
<!-- Is kudu_partition_spec applicable here? -->
|
||||
<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph>
|
||||
|
||||
<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
|
||||
|
||||
<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
|
||||
@@ -108,6 +113,75 @@ COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varn
|
||||
</ul>
|
||||
</note>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
<b>Computing stats for groups of partitions:</b>
|
||||
</p>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
In <keyword keyref="impala28_full"/> and higher, you can run <codeph>COMPUTE INCREMENTAL STATS</codeph>
|
||||
on multiple partitions, instead of the entire table or one partition at a time. You include
|
||||
comparison operators other than <codeph>=</codeph> in the <codeph>PARTITION</codeph> clause,
|
||||
and the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement applies to all partitions that
|
||||
match the comparison expression.
|
||||
</p>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
For example, the <codeph>INT_PARTITIONS</codeph> table contains 4 partitions.
|
||||
The following <codeph>COMPUTE INCREMENTAL STATS</codeph> statements affect some but not all
|
||||
partitions, as indicated by the <codeph>Updated <varname>n</varname> partition(s)</codeph>
|
||||
messages. The partitions that are affected depend on values in the partition key column <codeph>X</codeph>
|
||||
that match the comparison expression in the <codeph>PARTITION</codeph> clause.
|
||||
</p>
|
||||
|
||||
<codeblock rev="IMPALA-1654"><![CDATA[
|
||||
show partitions int_partitions;
|
||||
+-------+-------+--------+------+--------------+-------------------+---------+...
|
||||
| x | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
|
||||
+-------+-------+--------+------+--------------+-------------------+---------+...
|
||||
| 99 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET |...
|
||||
| 120 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |...
|
||||
| 150 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |...
|
||||
| 200 | -1 | 0 | 0B | NOT CACHED | NOT CACHED | TEXT |...
|
||||
| Total | -1 | 0 | 0B | 0B | | |...
|
||||
+-------+-------+--------+------+--------------+-------------------+---------+...
|
||||
|
||||
compute incremental stats int_partitions partition (x < 100);
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 1 partition(s) and 1 column(s). |
|
||||
+-----------------------------------------+
|
||||
|
||||
compute incremental stats int_partitions partition (x in (100, 150, 200));
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 2 partition(s) and 1 column(s). |
|
||||
+-----------------------------------------+
|
||||
|
||||
compute incremental stats int_partitions partition (x between 100 and 175);
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 2 partition(s) and 1 column(s). |
|
||||
+-----------------------------------------+
|
||||
|
||||
compute incremental stats int_partitions partition (x in (100, 150, 200) or x < 100);
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 3 partition(s) and 1 column(s). |
|
||||
+-----------------------------------------+
|
||||
|
||||
compute incremental stats int_partitions partition (x != 150);
|
||||
+-----------------------------------------+
|
||||
| summary |
|
||||
+-----------------------------------------+
|
||||
| Updated 3 partition(s) and 1 column(s). |
|
||||
+-----------------------------------------+
|
||||
]]>
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
||||
|
||||
<p rev="2.3.0">
|
||||
|
||||
@@ -248,6 +248,24 @@ https://issues.cloudera.org/browse/IMPALA-2144 - Don't have
|
||||
|
||||
</conbody>
|
||||
|
||||
<concept id="IMPALA-4106" rev="IMPALA-4106">
|
||||
<title>Use Hive Metastore bulk API for dropping multiple partitions.</title>
|
||||
<conbody>
|
||||
<p>
|
||||
The bulk partition dropping and setting feature of IMPALA-1654 is not as efficient
|
||||
as it could be, because it currently does not use the Hive Metastore bulk API.
|
||||
</p>
|
||||
<p><b>Bug:</b> <xref keyref="IMPALA-4106">IMPALA-4106</xref></p>
|
||||
<p><b>Severity:</b> High</p>
|
||||
<p><b>Workaround:</b> Schedule <codeph>ALTER TABLE</codeph> operations that touch
|
||||
many partitions for times when the table is not undergoing any other DDL operations,
|
||||
and be prepared for the table to be locked for some time while the <codeph>ALTER TABLE</codeph>
|
||||
is in progress. Test the performance of large-scale partition operations in a development
|
||||
environment before trying on tables in a production system.
|
||||
</p>
|
||||
</conbody>
|
||||
</concept>
|
||||
|
||||
<concept id="IMPALA-1480" rev="IMPALA-1480">
|
||||
|
||||
<!-- Not part of Alex's spreadsheet. Spreadsheet has IMPALA-1423 which mentions it's similar to this one but not a duplicate. -->
|
||||
|
||||
@@ -49,7 +49,7 @@ SHOW TABLES [IN <varname>database_name</varname>] [[LIKE] '<varname>pattern</var
|
||||
<ph rev="1.2.1">SHOW TABLE STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
|
||||
<ph rev="1.2.1">SHOW COLUMN STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
|
||||
<ph rev="1.4.0">SHOW PARTITIONS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
|
||||
<ph rev="2.2.0">SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>key_col</varname>=<varname>value</varname> [, <varname>key_col</varname>=<varname>value</varname>]]</ph>
|
||||
SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> <ph rev="IMPALA-1654">[PARTITION (<varname>key_col_expression</varname> [, <varname>key_col_expression</varname>]</ph>]
|
||||
|
||||
<ph rev="2.0.0">SHOW ROLES
|
||||
SHOW CURRENT ROLES
|
||||
@@ -113,6 +113,19 @@ show tables '*dim*|*fact*';</codeblock>
|
||||
<codeph>MB</codeph> for megabytes, and <codeph>GB</codeph> for gigabytes.
|
||||
</p>
|
||||
|
||||
<p rev="IMPALA-1654">
|
||||
In <keyword keyref="impala28_full"/> and higher, you can use general
|
||||
expressions with operators such as <codeph><</codeph>, <codeph>IN</codeph>,
|
||||
<codeph>LIKE</codeph>, and <codeph>BETWEEN</codeph> in the <codeph>PARTITION</codeph>
|
||||
clause, instead of only equality operators. For example:
|
||||
<codeblock><![CDATA[
|
||||
show files in sample_table partition (j < 5);
|
||||
show files in sample_table partition (k = 3, l between 1 and 10);
|
||||
show files in sample_table partition (month like 'J%');
|
||||
]]>
|
||||
</codeblock>
|
||||
</p>
|
||||
|
||||
<note>
|
||||
This statement applies to tables and partitions stored on HDFS, or in the Amazon Simple Storage System (S3).
|
||||
It does not apply to views.
|
||||
|
||||
Reference in New Issue
Block a user