mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
Change-Id: If786fc3d3064b26b213afb685a2f310cebe904fe Reviewed-on: http://gerrit.cloudera.org:8080/10718 Reviewed-by: Alex Rodoni <arodoni@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
671 lines
31 KiB
XML
671 lines
31 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="partitioning">
|
|
|
|
<title>Partitioning for Impala Tables</title>
|
|
|
|
<titlealts audience="PDF">
|
|
|
|
<navtitle>Partitioning</navtitle>
|
|
|
|
</titlealts>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Performance"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">partitioning</indexterm>
|
|
By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the
|
|
data during loading, based on values from one or more columns, to speed up queries that test those columns. For example, with a
|
|
<codeph>school_records</codeph> table partitioned on a <codeph>year</codeph> column, there is a separate data directory for each
|
|
different year value, and all the data for that year is stored in a data file in that directory. A query that includes a
|
|
<codeph>WHERE</codeph> condition such as <codeph>YEAR=1966</codeph>, <codeph>YEAR IN (1989,1999)</codeph>, or <codeph>YEAR BETWEEN
|
|
1984 AND 1989</codeph> can examine only the data files from the appropriate directory or directories, greatly reducing the amount of
|
|
data to read and test.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
|
|
<p>
|
|
See <xref href="impala_tutorial.xml#tut_external_partition_data"/> for an example that illustrates the syntax for creating partitioned
|
|
tables, the underlying directory structure in HDFS, and how to attach a partitioned Impala external table to data files stored
|
|
elsewhere in HDFS.
|
|
</p>
|
|
|
|
<p>
|
|
Parquet is a popular format for partitioned Impala tables because it is well suited to handle huge data volumes. See
|
|
<xref href="impala_parquet.xml#parquet_performance"/> for performance considerations for partitioned Parquet tables.
|
|
</p>
|
|
|
|
<p>
|
|
See <xref href="impala_literals.xml#null"/> for details about how <codeph>NULL</codeph> values are represented in partitioned tables.
|
|
</p>
|
|
|
|
<p rev="2.2.0">
|
|
See <xref href="impala_s3.xml#s3"/> for details about setting up tables where some or all partitions reside on the Amazon Simple
|
|
Storage Service (S3).
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
<concept id="partitioning_choosing">
|
|
|
|
<title>When to Use Partitioned Tables</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Partitioning is typically appropriate for:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Tables that are very large, where reading the entire data set takes an impractical amount of time.
|
|
</li>
|
|
|
|
<li>
|
|
Tables that are always or almost always queried with conditions on the partitioning columns. In our example of a table partitioned
|
|
by year, <codeph>SELECT COUNT(*) FROM school_records WHERE year = 1985</codeph> is efficient, only examining a small fraction of
|
|
the data; but <codeph>SELECT COUNT(*) FROM school_records</codeph> has to process a separate data file for each year, resulting in
|
|
more overall work than in an unpartitioned table. You would probably not partition this way if you frequently queried the table
|
|
based on last name, student ID, and so on without testing the year.
|
|
</li>
|
|
|
|
<li>
|
|
Columns that have reasonable cardinality (number of different values). If a column only has a small number of values, for example
|
|
<codeph>Male</codeph> or <codeph>Female</codeph>, you do not gain much efficiency by eliminating only about 50% of the data to
|
|
read for each query. If a column has only a few rows matching each value, the number of directories to process can become a
|
|
limiting factor, and the data file in each directory could be too small to take advantage of the Hadoop mechanism for transmitting
|
|
data in multi-megabyte blocks. For example, you might partition census data by year, store sales data by year and month, and web
|
|
traffic data by year, month, and day. (Some users with high volumes of incoming data might even partition down to the individual
|
|
hour and minute.)
|
|
</li>
|
|
|
|
<li>
|
|
Data that already passes through an extract, transform, and load (ETL) pipeline. The values of the partitioning columns are
|
|
stripped from the original data files and represented by directory names, so loading data into a partitioned table involves some
|
|
sort of transformation or preprocessing.
|
|
</li>
|
|
</ul>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_sql">
|
|
|
|
<title>SQL Statements for Partitioned Tables</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In terms of Impala SQL syntax, partitioning affects these statements:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>: you specify a <codeph>PARTITIONED
|
|
BY</codeph> clause when creating the table to identify names and data types of the partitioning columns. These columns are not
|
|
included in the main list of columns for the table.
|
|
</li>
|
|
|
|
<li rev="2.5.0">
|
|
In <keyword keyref="impala25_full"/> and higher, you can also use the <codeph>PARTITIONED BY</codeph> clause in a <codeph>CREATE TABLE AS
|
|
SELECT</codeph> statement. This syntax lets you use a single statement to create a partitioned table, copy data into it, and
|
|
create new partitions based on the values in the inserted data.
|
|
</li>
|
|
|
|
<li>
|
|
<codeph><xref href="impala_alter_table.xml#alter_table">ALTER
|
|
TABLE</xref></codeph>: you can add or drop partitions, to work
|
|
with different portions of a huge data set. You can designate the HDFS
|
|
directory that holds the data files for a specific partition. With
|
|
data partitioned by date values, you might <q>age out</q> data that is
|
|
no longer relevant. <note
|
|
conref="../shared/impala_common.xml#common/add_partition_set_location"
|
|
/>
|
|
</li>
|
|
|
|
<li>
|
|
<codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph>:
|
|
When you insert data into a partitioned table, you identify the
|
|
partitioning columns. One or more values from each inserted row are
|
|
not stored in data files, but instead determine the directory where
|
|
that row value is stored. You can also specify which partition to load
|
|
a set of data into, with <codeph>INSERT OVERWRITE</codeph> statements;
|
|
you can replace the contents of a specific partition but you cannot
|
|
append data to a specific partition. <p rev="1.3.1"
|
|
conref="../shared/impala_common.xml#common/insert_inherit_permissions"
|
|
/>
|
|
</li>
|
|
|
|
<li>
|
|
Although the syntax of the <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement is the same whether or
|
|
not the table is partitioned, the way queries interact with partitioned tables can have a dramatic impact on performance and
|
|
scalability. The mechanism that lets queries skip certain partitions during a query is known as partition pruning; see
|
|
<xref href="impala_partitioning.xml#partition_pruning"/> for details.
|
|
</li>
|
|
|
|
<li rev="1.4.0">
|
|
In Impala 1.4 and later, there is a <codeph>SHOW PARTITIONS</codeph> statement that displays information about each partition in a
|
|
table. See <xref href="impala_show.xml#show"/> for details.
|
|
</li>
|
|
</ul>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_static_dynamic">
|
|
|
|
<title>Static and Dynamic Partitioning Clauses</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Specifying all the partition columns in a SQL statement is called <term>static partitioning</term>, because the statement affects a
|
|
single predictable partition. For example, you use static partitioning with an <codeph>ALTER TABLE</codeph> statement that affects
|
|
only one partition, or with an <codeph>INSERT</codeph> statement that inserts all values into the same partition:
|
|
</p>
|
|
|
|
<codeblock>insert into t1 <b>partition(x=10, y='a')</b> select c1 from some_other_table;
|
|
</codeblock>
|
|
|
|
<p>
|
|
When you specify some partition key columns in an <codeph>INSERT</codeph> statement, but leave out the values, Impala determines
|
|
which partition to insert. This technique is called <term>dynamic partitioning</term>:
|
|
</p>
|
|
|
|
<codeblock>insert into t1 <b>partition(x, y='b')</b> select c1, c2 from some_other_table;
|
|
-- Create new partition if necessary based on variable year, month, and day; insert a single value.
|
|
insert into weather <b>partition (year, month, day)</b> select 'cloudy',2014,4,21;
|
|
-- Create new partition if necessary for specified year and month but variable day; insert a single value.
|
|
insert into weather <b>partition (year=2014, month=04, day)</b> select 'sunny',22;
|
|
</codeblock>
|
|
|
|
<p>
|
|
The more key columns you specify in the <codeph>PARTITION</codeph> clause, the fewer columns you need in the <codeph>SELECT</codeph>
|
|
list. The trailing columns in the <codeph>SELECT</codeph> list are substituted in order for the partition key columns with no
|
|
specified value.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_refresh" rev="2.7.0 IMPALA-1683">
|
|
|
|
<title>Refreshing a Single Partition</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The <codeph>REFRESH</codeph> statement is typically used with partitioned tables when new data files are loaded into a partition by
|
|
some non-Impala mechanism, such as a Hive or Spark job. The <codeph>REFRESH</codeph> statement makes Impala aware of the new data
|
|
files so that they can be used in Impala queries. Because partitioned tables typically contain a high volume of data, the
|
|
<codeph>REFRESH</codeph> operation for a full partitioned table can take significant time.
|
|
</p>
|
|
|
|
<p>
|
|
In <keyword keyref="impala27_full"/> and higher, you can include a <codeph>PARTITION (<varname>partition_spec</varname>)</codeph> clause in the
|
|
<codeph>REFRESH</codeph> statement so that only a single partition is refreshed. For example, <codeph>REFRESH big_table PARTITION
|
|
(year=2017, month=9, day=30)</codeph>. The partition spec must include all the partition key columns. See
|
|
<xref href="impala_refresh.xml#refresh"/> for more details and examples of <codeph>REFRESH</codeph> syntax and usage.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_permissions">
|
|
|
|
<title>Permissions for Partition Subdirectories</title>
|
|
|
|
<conbody>
|
|
|
|
<p rev="1.3.1"
|
|
conref="../shared/impala_common.xml#common/insert_inherit_permissions"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_pruning">
|
|
|
|
<title>Partition Pruning for Queries</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions. If
|
|
you can arrange for queries to prune large numbers of unnecessary partitions from the query execution plan, the queries use fewer
|
|
resources and are thus proportionally faster and more scalable.
|
|
</p>
|
|
|
|
<p>
|
|
For example, if a table is partitioned by columns <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>, then
|
|
<codeph>WHERE</codeph> clauses such as <codeph>WHERE year = 2013</codeph>, <codeph>WHERE year < 2010</codeph>, or <codeph>WHERE
|
|
year BETWEEN 1995 AND 1998</codeph> allow Impala to skip the data files in all partitions outside the specified range. Likewise,
|
|
<codeph>WHERE year = 2013 AND month BETWEEN 1 AND 3</codeph> could prune even more partitions, reading the data files for only a
|
|
portion of one year.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
|
|
</conbody>
|
|
|
|
<concept id="partition_pruning_checking">
|
|
|
|
<title>Checking if Partition Pruning Happens for a Query</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To check the effectiveness of partition pruning for a query, check the <codeph>EXPLAIN</codeph> output for the query before
|
|
running it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation
|
|
<codeph>#partitions=1/3</codeph> in the <codeph>EXPLAIN</codeph> plan confirms that Impala can do the appropriate partition
|
|
pruning.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > insert into census partition (year=2010) values ('Smith'),('Jones');
|
|
[localhost:21000] > insert into census partition (year=2011) values ('Smith'),('Jones'),('Doe');
|
|
[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Doe');
|
|
[localhost:21000] > select name from census where year=2010;
|
|
+-------+
|
|
| name |
|
|
+-------+
|
|
| Smith |
|
|
| Jones |
|
|
+-------+
|
|
[localhost:21000] > explain select name from census <b>where year=2010</b>;
|
|
+------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------+
|
|
| PLAN FRAGMENT 0 |
|
|
| PARTITION: UNPARTITIONED |
|
|
| |
|
|
| 1:EXCHANGE |
|
|
| |
|
|
| PLAN FRAGMENT 1 |
|
|
| PARTITION: RANDOM |
|
|
| |
|
|
| STREAM DATA SINK |
|
|
| EXCHANGE ID: 1 |
|
|
| UNPARTITIONED |
|
|
| |
|
|
| 0:SCAN HDFS |
|
|
| table=predicate_propagation.census <b>#partitions=1/3</b> size=12B |
|
|
+------------------------------------------------------------------+</codeblock>
|
|
|
|
<p rev="1.4.0">
|
|
For a report of the volume of data that was actually read and processed at each stage of the query, check the output of the
|
|
<codeph>SUMMARY</codeph> command immediately after running the query. For a more detailed analysis, look at the output of the
|
|
<codeph>PROFILE</codeph> command; it includes this same summary report near the start of the profile output.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_pruning_sql">
|
|
|
|
<title>What SQL Constructs Work with Partition Pruning</title>
|
|
|
|
<conbody>
|
|
|
|
<p rev="1.2.2">
|
|
<indexterm audience="hidden">predicate propagation</indexterm>
|
|
Impala can even do partition pruning in cases where the partition key column is not directly compared to a constant, by applying
|
|
the transitive property to other parts of the <codeph>WHERE</codeph> clause. This technique is known as predicate propagation, and
|
|
is available in Impala 1.2.2 and later. In this example, the census table includes another column indicating when the data was
|
|
collected, which happens in 10-year intervals. Even though the query does not compare the partition key column
|
|
(<codeph>YEAR</codeph>) to a constant value, Impala can deduce that only the partition <codeph>YEAR=2010</codeph> is required, and
|
|
again only reads 1 out of 3 partitions.
|
|
</p>
|
|
|
|
<codeblock rev="1.2.2">[localhost:21000] > drop table census;
|
|
[localhost:21000] > create table census (name string, census_year int) partitioned by (year int);
|
|
[localhost:21000] > insert into census partition (year=2010) values ('Smith',2010),('Jones',2010);
|
|
[localhost:21000] > insert into census partition (year=2011) values ('Smith',2020),('Jones',2020),('Doe',2020);
|
|
[localhost:21000] > insert into census partition (year=2012) values ('Smith',2020),('Doe',2020);
|
|
[localhost:21000] > select name from census where year = census_year and census_year=2010;
|
|
+-------+
|
|
| name |
|
|
+-------+
|
|
| Smith |
|
|
| Jones |
|
|
+-------+
|
|
[localhost:21000] > explain select name from census <b>where year = census_year and census_year=2010</b>;
|
|
+------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------+
|
|
| PLAN FRAGMENT 0 |
|
|
| PARTITION: UNPARTITIONED |
|
|
| |
|
|
| 1:EXCHANGE |
|
|
| |
|
|
| PLAN FRAGMENT 1 |
|
|
| PARTITION: RANDOM |
|
|
| |
|
|
| STREAM DATA SINK |
|
|
| EXCHANGE ID: 1 |
|
|
| UNPARTITIONED |
|
|
| |
|
|
| 0:SCAN HDFS |
|
|
| table=predicate_propagation.census <b>#partitions=1/3</b> size=22B |
|
|
| predicates: census_year = 2010, year = census_year |
|
|
+------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/partitions_and_views"/>
|
|
|
|
<p
|
|
conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="dynamic_partition_pruning">
|
|
|
|
<title>Dynamic Partition Pruning</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The original mechanism uses to prune partitions is <term>static partition pruning</term>, in which the conditions in the
|
|
<codeph>WHERE</codeph> clause are analyzed to determine in advance which partitions can be safely skipped. In <keyword keyref="impala25_full"/>
|
|
and higher, Impala can perform <term>dynamic partition pruning</term>, where information about the partitions is collected during
|
|
the query, and Impala prunes unnecessary partitions in ways that were impractical to predict in advance.
|
|
</p>
|
|
|
|
<p>
|
|
For example, if partition key columns are compared to literal values in a <codeph>WHERE</codeph> clause, Impala can perform static
|
|
partition pruning during the planning phase to only read the relevant partitions:
|
|
</p>
|
|
|
|
<codeblock>
|
|
-- The query only needs to read 3 partitions whose key values are known ahead of time.
|
|
-- That's static partition pruning.
|
|
SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);
|
|
</codeblock>
|
|
|
|
<p>
|
|
Dynamic partition pruning involves using information only available
|
|
at run time, such as the result of a subquery. The following example
|
|
shows a simple dynamic partition pruning.
|
|
</p>
|
|
|
|
<codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/>
|
|
|
|
<p>
|
|
In the above example, Impala evaluates the subquery, sends the
|
|
subquery results to all Impala nodes participating in the query, and
|
|
then each <cmdname>impalad</cmdname> daemon uses the dynamic partition
|
|
pruning optimization to read only the partitions with the relevant key
|
|
values.
|
|
</p>
|
|
|
|
<p>
|
|
The output query plan from the <codeph>EXPLAIN</codeph> statement
|
|
shows that runtime filters are enabled. The plan also shows that it
|
|
expects to read all 5 partitions of the <codeph>yy</codeph> table,
|
|
indicating that static partition pruning will not happen.
|
|
</p>
|
|
|
|
<p>
|
|
The Filter summary in the <codeph>PROFILE</codeph> output shows that
|
|
the scan node filtered out based on a runtime filter of dynamic
|
|
partition pruning.
|
|
</p>
|
|
|
|
<codeblock>Filter 0 (1.00 MB):
|
|
- Files processed: 3
|
|
- <b>Files rejected: 1 (1)</b>
|
|
- Files total: 3 (3)
|
|
</codeblock>
|
|
|
|
<p>
|
|
Dynamic partition pruning is especially effective for queries involving joins of several large partitioned tables. Evaluating the
|
|
<codeph>ON</codeph> clauses of the join predicates might normally require reading data from all partitions of certain tables. If
|
|
the <codeph>WHERE</codeph> clauses of the query refer to the partition key columns, Impala can now often skip reading many of the
|
|
partitions while evaluating the <codeph>ON</codeph> clauses. The dynamic partition pruning optimization reduces the amount of I/O
|
|
and the amount of intermediate data stored and transmitted across the network during the query.
|
|
</p>
|
|
|
|
<p
|
|
conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
|
|
|
|
<p>
|
|
Dynamic partition pruning is part of the runtime filtering feature, which applies to other kinds of queries in addition to queries
|
|
against partitioned tables. See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for full details about this feature.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_key_columns">
|
|
|
|
<title>Partition Key Columns</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The columns you choose as the partition keys should be ones that are frequently used to filter query results in important,
|
|
large-scale queries. Popular examples are some combination of year, month, and day when the data has associated time values, and
|
|
geographic region when the data is associated with some place.
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
For time-based data, split out the separate parts into their own columns, because Impala cannot partition based on a
|
|
<codeph>TIMESTAMP</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The data type of the partition columns does not have a significant effect on the storage required, because the values from those
|
|
columns are not stored in the data files, rather they are represented as strings inside HDFS directory names.
|
|
</p>
|
|
</li>
|
|
|
|
<li rev="IMPALA-2499">
|
|
<p>
|
|
In <keyword keyref="impala25_full"/> and higher, you can enable the <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph> query option to speed up
|
|
queries that only refer to partition key columns, such as <codeph>SELECT MAX(year)</codeph>. This setting is not enabled by
|
|
default because the query behavior is slightly different if the table contains partition directories without actual data inside.
|
|
See <xref href="impala_optimize_partition_key_scans.xml#optimize_partition_key_scans"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p
|
|
conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Remember that when Impala queries data stored in HDFS, it is most efficient to use multi-megabyte files to take advantage of the
|
|
HDFS block size. For Parquet tables, the block size (and ideal size of the data files) is <ph rev="parquet_block_size">256 MB in
|
|
Impala 2.0 and later</ph>. Therefore, avoid specifying too many partition key columns, which could result in individual
|
|
partitions containing only small amounts of data. For example, if you receive 1 GB of data per day, you might partition by year,
|
|
month, and day; while if you receive 5 GB of data per minute, you might partition by year, month, day, hour, and minute. If you
|
|
have data with a geographic component, you might partition based on postal code if you have many megabytes of data for each
|
|
postal code, but if not, you might partition by some larger region such as city, state, or country. state
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p conref="../shared/impala_common.xml#common/partition_key_optimization"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="mixed_format_partitions">
|
|
|
|
<title>Setting Different File Formats for Partitions</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Partitioned tables have the flexibility to use different file formats for different partitions. (For background information about
|
|
the different file formats Impala supports, see <xref href="impala_file_formats.xml#file_formats"/>.) For example, if you originally
|
|
received data in text format, then received new data in RCFile format, and eventually began receiving data in Parquet format, all
|
|
that data could reside in the same table for queries. You just need to ensure that the table is structured so that the data files
|
|
that use different file formats reside in separate partitions.
|
|
</p>
|
|
|
|
<p>
|
|
For example, here is how you might switch from text to Parquet data as you receive data for different years:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table census (name string) partitioned by (year smallint);
|
|
[localhost:21000] > alter table census add partition (year=2012); -- Text format;
|
|
|
|
[localhost:21000] > alter table census add partition (year=2013); -- Text format switches to Parquet before data loaded;
|
|
[localhost:21000] > alter table census partition (year=2013) set fileformat parquet;
|
|
|
|
[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Jones'),('Lee'),('Singh');
|
|
[localhost:21000] > insert into census partition (year=2013) values ('Flores'),('Bogomolov'),('Cooper'),('Appiah');</codeblock>
|
|
|
|
<p>
|
|
At this point, the HDFS directory for <codeph>year=2012</codeph> contains a text-format data file, while the HDFS directory for
|
|
<codeph>year=2013</codeph> contains a Parquet data file. As always, when loading non-trivial data, you would use <codeph>INSERT ...
|
|
SELECT</codeph> or <codeph>LOAD DATA</codeph> to import data in large batches, rather than <codeph>INSERT ... VALUES</codeph> which
|
|
produces small files that are inefficient for real-world queries.
|
|
</p>
|
|
|
|
<p>
|
|
For other file types that Impala cannot create natively, you can switch into Hive and issue the <codeph>ALTER TABLE ... SET
|
|
FILEFORMAT</codeph> statements and <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statements there. After switching back to
|
|
Impala, issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement so that Impala recognizes any partitions or new
|
|
data added through Hive.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_management">
|
|
|
|
<title>Managing Partitions</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can add, drop, set the expected file format, or set the HDFS location of the data files for individual partitions within an
|
|
Impala table. See <xref href="impala_alter_table.xml#alter_table"/> for syntax details, and
|
|
<xref href="impala_partitioning.xml#mixed_format_partitions"/> for tips on managing tables containing partitions with different file
|
|
formats.
|
|
</p>
|
|
|
|
<note
|
|
conref="../shared/impala_common.xml#common/add_partition_set_location"/>
|
|
|
|
<p>
|
|
What happens to the data files when a partition is dropped depends on whether the partitioned table is designated as internal or
|
|
external. For an internal (managed) table, the data files are deleted. For example, if data in the partitioned table is a copy of
|
|
raw data files stored elsewhere, you might save disk space by dropping older partitions that are no longer required for reporting,
|
|
knowing that the original data is still available if needed later. For an external table, the data files are left alone. For
|
|
example, dropping a partition without deleting the associated files lets Impala consider a smaller set of partitions, improving
|
|
query efficiency and reducing overhead for DDL operations on the table; if the data is needed again later, you can add the partition
|
|
again. See <xref href="impala_tables.xml#tables" /> for details and examples.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="kudu 2.8.0" id="partition_kudu">
|
|
|
|
<title>Using Partitioning with Kudu Tables</title>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Kudu"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Kudu tables use a more fine-grained partitioning scheme than tables containing HDFS data files. You specify a <codeph>PARTITION
|
|
BY</codeph> clause with the <codeph>CREATE TABLE</codeph> statement to identify how to divide the values from the partition key
|
|
columns.
|
|
</p>
|
|
|
|
<p>
|
|
See <xref href="impala_kudu.xml#kudu_partitioning"/> for
|
|
details and examples of the partitioning techniques
|
|
for Kudu tables.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_stats">
|
|
<title>Keeping Statistics Up to Date for Partitioned Tables</title>
|
|
<conbody>
|
|
|
|
<p>
|
|
Because the <codeph>COMPUTE STATS</codeph> statement can be resource-intensive to run on a partitioned table
|
|
as new partitions are added, Impala includes a variation of this statement that allows computing statistics
|
|
on a per-partition basis such that stats can be incrementally updated when new partitions are added.
|
|
</p>
|
|
|
|
<note type="important">
|
|
<p conref="../shared/impala_common.xml#common/cs_or_cis"/>
|
|
<p
|
|
conref="../shared/impala_common.xml#common/incremental_stats_after_full"/>
|
|
<p conref="../shared/impala_common.xml#common/incremental_stats_caveats"/>
|
|
</note>
|
|
|
|
<p rev="2.1.0">
|
|
The <codeph>COMPUTE INCREMENTAL STATS</codeph> variation computes statistics only for partitions that were
|
|
added or changed since the last <codeph>COMPUTE INCREMENTAL STATS</codeph> statement, rather than the entire
|
|
table. It is typically used for tables where a full <codeph>COMPUTE STATS</codeph>
|
|
operation takes too long to be practical each time a partition is added or dropped. See
|
|
<xref href="impala_perf_stats.xml#perf_stats_incremental"/> for full usage details.
|
|
</p>
|
|
|
|
<codeblock conref="../shared/impala_common.xml#common/compute_stats_walkthrough"/>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
</concept>
|