IMPALA-1654: [DOCS] DDL for multiple partitions

Syntax and usage notes for ALTER TABLE, COMPUTE STATS, and SHOW FILES. Mixed in a little bit with new Kudu syntax for ALTER TABLE. Didn't include all new Kudu info in this CR, the better to minimize merge conflicts. Added note about performance/scalability of IMPALA-1654. Added new Known Issue item for IMPALA-4106 under Performance category. Change-Id: I2060552d5081e5f93b1b1f398414c52fa03f215b Reviewed-on: http://gerrit.cloudera.org:8080/5726 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
2025-12-29 18:01:07 -05:00 · 2017-01-17 13:19:13 -08:00
parent cb52b2b8ae
commit eaefbb90ce
4 changed files with 186 additions and 3 deletions
--- a/docs/topics/impala_alter_table.xml
+++ b/docs/topics/impala_alter_table.xml
@@ -80,7 +80,11 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>

 <varname>col_spec</varname> ::= <varname>col_name</varname> <varname>type_name</varname>

-<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
+<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph>
+
+<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
+
+<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph>

 <varname>table_properties</varname> ::= '<varname>name</varname>'='<varname>value</varname>'[, '<varname>name</varname>'='<varname>value</varname>' ...]

@@ -124,6 +128,80 @@ statsKey ::= numDVs | numNulls | avgSize | maxSize</ph>
      an external table, the underlying data directory is not renamed or moved.
    </p>

+    <p>
+      <b>Dropping or altering multiple partitions:</b>
+    </p>
+
+    <p rev="IMPALA-1654">
+      In <keyword keyref="impala28_full"/> and higher,
+      the expression for the partition clause with a <codeph>DROP</codeph> or <codeph>SET</codeph>
+      operation can include comparison operators such as <codeph>&lt;</codeph>, <codeph>IN</codeph>,
+      or <codeph>BETWEEN</codeph>, and Boolean operators such as <codeph>AND</codeph>
+      and <codeph>OR</codeph>.
+    </p>
+
+    <p rev="IMPALA-1654">
+      For example, you might drop a group of partitions corresponding to a particular date
+      range after the data <q>ages out</q>:
+    </p>
+
+<codeblock><![CDATA[
+alter table historical_data drop partition (year < 1995);
+alter table historical_data drop partition (year = 1996 and month between 1 and 6);
+]]>
+</codeblock>
+
+    <p rev="IMPALA-1654">
+      For tables with multiple partition keys columns, you can specify multiple
+      conditions separated by commas, and the operation only applies to the partitions
+      that match all the conditions (similar to using an <codeph>AND</codeph> clause):
+    </p>
+
+<codeblock><![CDATA[
+alter table historical_data drop partition (year < 1995, last_name like 'A%');
+]]>
+</codeblock>
+
+    <p rev="IMPALA-1654">
+      This technique can also be used to change the file format of groups of partitions,
+      as part of an ETL pipeline that periodically consolidates and rewrites the underlying
+      data files in a different file format:
+    </p>
+
+<codeblock><![CDATA[
+alter table fast_growing_data partition (year = 2016, month in (10,11,12)) set fileformat parquet;
+]]>
+</codeblock>
+
+    <note>
+      <p rev="IMPALA-1654">
+        The extended syntax involving comparison operators and multiple partitions
+        applies to the <codeph>SET FILEFORMAT</codeph>, <codeph>SET TBLPROPERTIES</codeph>,
+        <codeph>SET SERDEPROPERTIES</codeph>, and <codeph>SET [UN]CACHED</codeph> clauses.
+        You can also use this syntax with the <codeph>PARTITION</codeph> clause
+        in the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, and with the
+        <codeph>PARTITION</codeph> clause of the <codeph>SHOW FILES</codeph> statement.
+        Some forms of <codeph>ALTER TABLE</codeph> still only apply to one partition
+        at a time: the <codeph>SET LOCATION</codeph> and <codeph>ADD PARTITION</codeph>
+        clauses. The <codeph>PARTITION</codeph> clauses in the <codeph>LOAD DATA</codeph>
+        and <codeph>INSERT</codeph> statements also only apply to one partition at a time.
+      </p>
+      <p>
+        A DDL statement that applies to multiple partitions is considered successful
+        (resulting in no changes) even if no partitions match the conditions.
+        The results are the same as if the <codeph>IF EXISTS</codeph> clause was specified.
+      </p>
+      <p>
+        The performance and scalability of this technique is similar to
+        issuing a sequence of single-partition <codeph>ALTER TABLE</codeph>
+        statements in quick succession. To minimize bottlenecks due to
+        communication with the metastore database, or causing other
+        DDL operations on the same table to wait, test the effects of
+        performing <codeph>ALTER TABLE</codeph> statements that affect
+        large numbers of partitions.
+      </p>
+    </note>
+
    <p conref="../shared/impala_common.xml#common/s3_blurb"/>

    <p rev="2.6.0 CDH-39913 IMPALA-1878">
--- a/docs/topics/impala_compute_stats.xml
+++ b/docs/topics/impala_compute_stats.xml
@@ -52,7 +52,12 @@ under the License.
 <codeblock rev="2.1.0">COMPUTE STATS [<varname>db_name</varname>.]<varname>table_name</varname>
 COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>partition_spec</varname>)]

-<varname>partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
+<!-- Is kudu_partition_spec applicable here? -->
+<varname>partition_spec</varname> ::= <varname>simple_partition_spec</varname> | <ph rev="IMPALA-1654"><varname>complex_partition_spec</varname></ph> | <ph rev="kudu"><varname>kudu_partition_spec</varname></ph>
+
+<varname>simple_partition_spec</varname> ::= <varname>partition_col</varname>=<varname>constant_value</varname>
+
+<ph rev="IMPALA-1654"><varname>complex_partition_spec</varname> ::= <varname>comparison_expression_on_partition_col</varname></ph>
 </codeblock>

    <p conref="../shared/impala_common.xml#common/incremental_partition_spec"/>
@@ -108,6 +113,75 @@ COMPUTE INCREMENTAL STATS [<varname>db_name</varname>.]<varname>table_name</varn
      </ul>
    </note>

+    <p rev="IMPALA-1654">
+      <b>Computing stats for groups of partitions:</b>
+    </p>
+
+    <p rev="IMPALA-1654">
+      In <keyword keyref="impala28_full"/> and higher, you can run <codeph>COMPUTE INCREMENTAL STATS</codeph>
+      on multiple partitions, instead of the entire table or one partition at a time. You include
+      comparison operators other than <codeph>=</codeph> in the <codeph>PARTITION</codeph> clause,
+      and the <codeph>COMPUTE INCREMENTAL STATS</codeph> statement applies to all partitions that
+      match the comparison expression.
+    </p>
+
+    <p rev="IMPALA-1654">
+      For example, the <codeph>INT_PARTITIONS</codeph> table contains 4 partitions.
+      The following <codeph>COMPUTE INCREMENTAL STATS</codeph> statements affect some but not all
+      partitions, as indicated by the <codeph>Updated <varname>n</varname> partition(s)</codeph>
+      messages. The partitions that are affected depend on values in the partition key column <codeph>X</codeph>
+      that match the comparison expression in the <codeph>PARTITION</codeph> clause.
+    </p>
+
+<codeblock rev="IMPALA-1654"><![CDATA[
+show partitions int_partitions;
+-------+-------+--------+------+--------------+-------------------+---------+...
+| x     | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format  |...
+-------+-------+--------+------+--------------+-------------------+---------+...
+| 99    | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | PARQUET |...
+| 120   | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | TEXT    |...
+| 150   | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | TEXT    |...
+| 200   | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        | TEXT    |...
+| Total | -1    | 0      | 0B   | 0B           |                   |         |...
+-------+-------+--------+------+--------------+-------------------+---------+...
+
+compute incremental stats int_partitions partition (x < 100);
+-----------------------------------------+
+| summary                                 |
+-----------------------------------------+
+| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
+
+compute incremental stats int_partitions partition (x in (100, 150, 200));
+-----------------------------------------+
+| summary                                 |
+-----------------------------------------+
+| Updated 2 partition(s) and 1 column(s). |
+-----------------------------------------+
+
+compute incremental stats int_partitions partition (x between 100 and 175);
+-----------------------------------------+
+| summary                                 |
+-----------------------------------------+
+| Updated 2 partition(s) and 1 column(s). |
+-----------------------------------------+
+
+compute incremental stats int_partitions partition (x in (100, 150, 200) or x < 100);
+-----------------------------------------+
+| summary                                 |
+-----------------------------------------+
+| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
+
+compute incremental stats int_partitions partition (x != 150);
+-----------------------------------------+
+| summary                                 |
+-----------------------------------------+
+| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
+]]>
+</codeblock>
+
    <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>

    <p rev="2.3.0">
--- a/docs/topics/impala_known_issues.xml
+++ b/docs/topics/impala_known_issues.xml
@@ -248,6 +248,24 @@ https://issues.cloudera.org/browse/IMPALA-2144 - Don't have

    </conbody>

+    <concept id="IMPALA-4106" rev="IMPALA-4106">
+      <title>Use Hive Metastore bulk API for dropping multiple partitions.</title>
+      <conbody>
+        <p>
+          The bulk partition dropping and setting feature of IMPALA-1654 is not as efficient
+          as it could be, because it currently does not use the Hive Metastore bulk API.
+        </p>
+        <p><b>Bug:</b> <xref keyref="IMPALA-4106">IMPALA-4106</xref></p>
+        <p><b>Severity:</b> High</p>
+        <p><b>Workaround:</b> Schedule <codeph>ALTER TABLE</codeph> operations that touch
+          many partitions for times when the table is not undergoing any other DDL operations,
+          and be prepared for the table to be locked for some time while the <codeph>ALTER TABLE</codeph>
+          is in progress. Test the performance of large-scale partition operations in a development
+          environment before trying on tables in a production system.
+        </p>
+      </conbody>
+    </concept>
+
    <concept id="IMPALA-1480" rev="IMPALA-1480">

 <!-- Not part of Alex's spreadsheet. Spreadsheet has IMPALA-1423 which mentions it's similar to this one but not a duplicate. -->
--- a/docs/topics/impala_show.xml
+++ b/docs/topics/impala_show.xml
@@ -49,7 +49,7 @@ SHOW TABLES [IN <varname>database_name</varname>] [[LIKE] '<varname>pattern</var
 <ph rev="1.2.1">SHOW TABLE STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
 <ph rev="1.2.1">SHOW COLUMN STATS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
 <ph rev="1.4.0">SHOW PARTITIONS [<varname>database_name</varname>.]<varname>table_name</varname></ph>
-<ph rev="2.2.0">SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>key_col</varname>=<varname>value</varname> [, <varname>key_col</varname>=<varname>value</varname>]]</ph>
+SHOW FILES IN [<varname>database_name</varname>.]<varname>table_name</varname> <ph  rev="IMPALA-1654">[PARTITION (<varname>key_col_expression</varname> [, <varname>key_col_expression</varname>]</ph>]

 <ph rev="2.0.0">SHOW ROLES
 SHOW CURRENT ROLES
@@ -113,6 +113,19 @@ show tables '*dim*|*fact*';</codeblock>
        <codeph>MB</codeph> for megabytes, and <codeph>GB</codeph> for gigabytes.
      </p>

+      <p rev="IMPALA-1654">
+        In <keyword keyref="impala28_full"/> and higher, you can use general
+        expressions with operators such as <codeph>&lt;</codeph>, <codeph>IN</codeph>,
+        <codeph>LIKE</codeph>, and <codeph>BETWEEN</codeph> in the <codeph>PARTITION</codeph>
+        clause, instead of only equality operators. For example:
+<codeblock><![CDATA[
+show files in sample_table partition (j < 5);
+show files in sample_table partition (k = 3, l between 1 and 10);
+show files in sample_table partition (month like 'J%');
+]]>
+</codeblock>
+      </p>
+
      <note>
        This statement applies to tables and partitions stored on HDFS, or in the Amazon Simple Storage System (S3).
        It does not apply to views.