mirror of
https://github.com/apache/impala.git
synced 2026-01-05 21:00:54 -05:00
A number of 'CDH' by itself were turned into substitution variables resolving to 'Apache Hadoop'. Also fixed some stray instances of CDH version numbers. In some cases, 'CDH' or 'CDH 5' by itself was superfluous and was just removed. Change-Id: I979ea73ccaa5873d4108545f18f598072fb5e05f Reviewed-on: http://gerrit.cloudera.org:8080/6352 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
607 lines
29 KiB
XML
607 lines
29 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="partitioning">
|
|
|
|
<title>Partitioning for Impala Tables</title>
|
|
|
|
<titlealts audience="PDF">
|
|
|
|
<navtitle>Partitioning</navtitle>
|
|
|
|
</titlealts>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Performance"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">partitioning</indexterm>
|
|
By default, all the data files for a table are located in a single directory. Partitioning is a technique for physically dividing the
|
|
data during loading, based on values from one or more columns, to speed up queries that test those columns. For example, with a
|
|
<codeph>school_records</codeph> table partitioned on a <codeph>year</codeph> column, there is a separate data directory for each
|
|
different year value, and all the data for that year is stored in a data file in that directory. A query that includes a
|
|
<codeph>WHERE</codeph> condition such as <codeph>YEAR=1966</codeph>, <codeph>YEAR IN (1989,1999)</codeph>, or <codeph>YEAR BETWEEN
|
|
1984 AND 1989</codeph> can examine only the data files from the appropriate directory or directories, greatly reducing the amount of
|
|
data to read and test.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
|
|
<p>
|
|
See <xref href="impala_tutorial.xml#tut_external_partition_data"/> for an example that illustrates the syntax for creating partitioned
|
|
tables, the underlying directory structure in HDFS, and how to attach a partitioned Impala external table to data files stored
|
|
elsewhere in HDFS.
|
|
</p>
|
|
|
|
<p>
|
|
Parquet is a popular format for partitioned Impala tables because it is well suited to handle huge data volumes. See
|
|
<xref href="impala_parquet.xml#parquet_performance"/> for performance considerations for partitioned Parquet tables.
|
|
</p>
|
|
|
|
<p>
|
|
See <xref href="impala_literals.xml#null"/> for details about how <codeph>NULL</codeph> values are represented in partitioned tables.
|
|
</p>
|
|
|
|
<p rev="2.2.0">
|
|
See <xref href="impala_s3.xml#s3"/> for details about setting up tables where some or all partitions reside on the Amazon Simple
|
|
Storage Service (S3).
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
<concept id="partitioning_choosing">
|
|
|
|
<title>When to Use Partitioned Tables</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Partitioning is typically appropriate for:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Tables that are very large, where reading the entire data set takes an impractical amount of time.
|
|
</li>
|
|
|
|
<li>
|
|
Tables that are always or almost always queried with conditions on the partitioning columns. In our example of a table partitioned
|
|
by year, <codeph>SELECT COUNT(*) FROM school_records WHERE year = 1985</codeph> is efficient, only examining a small fraction of
|
|
the data; but <codeph>SELECT COUNT(*) FROM school_records</codeph> has to process a separate data file for each year, resulting in
|
|
more overall work than in an unpartitioned table. You would probably not partition this way if you frequently queried the table
|
|
based on last name, student ID, and so on without testing the year.
|
|
</li>
|
|
|
|
<li>
|
|
Columns that have reasonable cardinality (number of different values). If a column only has a small number of values, for example
|
|
<codeph>Male</codeph> or <codeph>Female</codeph>, you do not gain much efficiency by eliminating only about 50% of the data to
|
|
read for each query. If a column has only a few rows matching each value, the number of directories to process can become a
|
|
limiting factor, and the data file in each directory could be too small to take advantage of the Hadoop mechanism for transmitting
|
|
data in multi-megabyte blocks. For example, you might partition census data by year, store sales data by year and month, and web
|
|
traffic data by year, month, and day. (Some users with high volumes of incoming data might even partition down to the individual
|
|
hour and minute.)
|
|
</li>
|
|
|
|
<li>
|
|
Data that already passes through an extract, transform, and load (ETL) pipeline. The values of the partitioning columns are
|
|
stripped from the original data files and represented by directory names, so loading data into a partitioned table involves some
|
|
sort of transformation or preprocessing.
|
|
</li>
|
|
</ul>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_sql">
|
|
|
|
<title>SQL Statements for Partitioned Tables</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In terms of Impala SQL syntax, partitioning affects these statements:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<codeph><xref href="impala_create_table.xml#create_table">CREATE TABLE</xref></codeph>: you specify a <codeph>PARTITIONED
|
|
BY</codeph> clause when creating the table to identify names and data types of the partitioning columns. These columns are not
|
|
included in the main list of columns for the table.
|
|
</li>
|
|
|
|
<li rev="2.5.0">
|
|
In <keyword keyref="impala25_full"/> and higher, you can also use the <codeph>PARTITIONED BY</codeph> clause in a <codeph>CREATE TABLE AS
|
|
SELECT</codeph> statement. This syntax lets you use a single statement to create a partitioned table, copy data into it, and
|
|
create new partitions based on the values in the inserted data.
|
|
</li>
|
|
|
|
<li>
|
|
<codeph><xref href="impala_alter_table.xml#alter_table">ALTER TABLE</xref></codeph>: you can add or drop partitions, to work with
|
|
different portions of a huge data set. You can designate the HDFS directory that holds the data files for a specific partition.
|
|
With data partitioned by date values, you might <q>age out</q> data that is no longer relevant.
|
|
<note conref="../shared/impala_common.xml#common/add_partition_set_location"/>
|
|
</li>
|
|
|
|
<li>
|
|
<codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph>: When you insert data into a partitioned table, you identify
|
|
the partitioning columns. One or more values from each inserted row are not stored in data files, but instead determine the
|
|
directory where that row value is stored. You can also specify which partition to load a set of data into, with <codeph>INSERT
|
|
OVERWRITE</codeph> statements; you can replace the contents of a specific partition but you cannot append data to a specific
|
|
partition.
|
|
<p rev="1.3.1" conref="../shared/impala_common.xml#common/insert_inherit_permissions"/>
|
|
</li>
|
|
|
|
<li>
|
|
Although the syntax of the <codeph><xref href="impala_select.xml#select">SELECT</xref></codeph> statement is the same whether or
|
|
not the table is partitioned, the way queries interact with partitioned tables can have a dramatic impact on performance and
|
|
scalability. The mechanism that lets queries skip certain partitions during a query is known as partition pruning; see
|
|
<xref href="impala_partitioning.xml#partition_pruning"/> for details.
|
|
</li>
|
|
|
|
<li rev="1.4.0">
|
|
In Impala 1.4 and later, there is a <codeph>SHOW PARTITIONS</codeph> statement that displays information about each partition in a
|
|
table. See <xref href="impala_show.xml#show"/> for details.
|
|
</li>
|
|
</ul>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_static_dynamic">
|
|
|
|
<title>Static and Dynamic Partitioning Clauses</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Specifying all the partition columns in a SQL statement is called <term>static partitioning</term>, because the statement affects a
|
|
single predictable partition. For example, you use static partitioning with an <codeph>ALTER TABLE</codeph> statement that affects
|
|
only one partition, or with an <codeph>INSERT</codeph> statement that inserts all values into the same partition:
|
|
</p>
|
|
|
|
<codeblock>insert into t1 <b>partition(x=10, y='a')</b> select c1 from some_other_table;
|
|
</codeblock>
|
|
|
|
<p>
|
|
When you specify some partition key columns in an <codeph>INSERT</codeph> statement, but leave out the values, Impala determines
|
|
which partition to insert. This technique is called <term>dynamic partitioning</term>:
|
|
</p>
|
|
|
|
<codeblock>insert into t1 <b>partition(x, y='b')</b> select c1, c2 from some_other_table;
|
|
-- Create new partition if necessary based on variable year, month, and day; insert a single value.
|
|
insert into weather <b>partition (year, month, day)</b> select 'cloudy',2014,4,21;
|
|
-- Create new partition if necessary for specified year and month but variable day; insert a single value.
|
|
insert into weather <b>partition (year=2014, month=04, day)</b> select 'sunny',22;
|
|
</codeblock>
|
|
|
|
<p>
|
|
The more key columns you specify in the <codeph>PARTITION</codeph> clause, the fewer columns you need in the <codeph>SELECT</codeph>
|
|
list. The trailing columns in the <codeph>SELECT</codeph> list are substituted in order for the partition key columns with no
|
|
specified value.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_refresh" rev="2.7.0 IMPALA-1683">
|
|
|
|
<title>Refreshing a Single Partition</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The <codeph>REFRESH</codeph> statement is typically used with partitioned tables when new data files are loaded into a partition by
|
|
some non-Impala mechanism, such as a Hive or Spark job. The <codeph>REFRESH</codeph> statement makes Impala aware of the new data
|
|
files so that they can be used in Impala queries. Because partitioned tables typically contain a high volume of data, the
|
|
<codeph>REFRESH</codeph> operation for a full partitioned table can take significant time.
|
|
</p>
|
|
|
|
<p>
|
|
In <keyword keyref="impala27_full"/> and higher, you can include a <codeph>PARTITION (<varname>partition_spec</varname>)</codeph> clause in the
|
|
<codeph>REFRESH</codeph> statement so that only a single partition is refreshed. For example, <codeph>REFRESH big_table PARTITION
|
|
(year=2017, month=9, day=30)</codeph>. The partition spec must include all the partition key columns. See
|
|
<xref href="impala_refresh.xml#refresh"/> for more details and examples of <codeph>REFRESH</codeph> syntax and usage.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_permissions">
|
|
|
|
<title>Permissions for Partition Subdirectories</title>
|
|
|
|
<conbody>
|
|
|
|
<p rev="1.3.1" conref="../shared/impala_common.xml#common/insert_inherit_permissions"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_pruning">
|
|
|
|
<title>Partition Pruning for Queries</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions. If
|
|
you can arrange for queries to prune large numbers of unnecessary partitions from the query execution plan, the queries use fewer
|
|
resources and are thus proportionally faster and more scalable.
|
|
</p>
|
|
|
|
<p>
|
|
For example, if a table is partitioned by columns <codeph>YEAR</codeph>, <codeph>MONTH</codeph>, and <codeph>DAY</codeph>, then
|
|
<codeph>WHERE</codeph> clauses such as <codeph>WHERE year = 2013</codeph>, <codeph>WHERE year < 2010</codeph>, or <codeph>WHERE
|
|
year BETWEEN 1995 AND 1998</codeph> allow Impala to skip the data files in all partitions outside the specified range. Likewise,
|
|
<codeph>WHERE year = 2013 AND month BETWEEN 1 AND 3</codeph> could prune even more partitions, reading the data files for only a
|
|
portion of one year.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
|
|
</conbody>
|
|
|
|
<concept id="partition_pruning_checking">
|
|
|
|
<title>Checking if Partition Pruning Happens for a Query</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To check the effectiveness of partition pruning for a query, check the <codeph>EXPLAIN</codeph> output for the query before
|
|
running it. For example, this example shows a table with 3 partitions, where the query only reads 1 of them. The notation
|
|
<codeph>#partitions=1/3</codeph> in the <codeph>EXPLAIN</codeph> plan confirms that Impala can do the appropriate partition
|
|
pruning.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > insert into census partition (year=2010) values ('Smith'),('Jones');
|
|
[localhost:21000] > insert into census partition (year=2011) values ('Smith'),('Jones'),('Doe');
|
|
[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Doe');
|
|
[localhost:21000] > select name from census where year=2010;
|
|
+-------+
|
|
| name |
|
|
+-------+
|
|
| Smith |
|
|
| Jones |
|
|
+-------+
|
|
[localhost:21000] > explain select name from census <b>where year=2010</b>;
|
|
+------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------+
|
|
| PLAN FRAGMENT 0 |
|
|
| PARTITION: UNPARTITIONED |
|
|
| |
|
|
| 1:EXCHANGE |
|
|
| |
|
|
| PLAN FRAGMENT 1 |
|
|
| PARTITION: RANDOM |
|
|
| |
|
|
| STREAM DATA SINK |
|
|
| EXCHANGE ID: 1 |
|
|
| UNPARTITIONED |
|
|
| |
|
|
| 0:SCAN HDFS |
|
|
| table=predicate_propagation.census <b>#partitions=1/3</b> size=12B |
|
|
+------------------------------------------------------------------+</codeblock>
|
|
|
|
<p rev="1.4.0">
|
|
For a report of the volume of data that was actually read and processed at each stage of the query, check the output of the
|
|
<codeph>SUMMARY</codeph> command immediately after running the query. For a more detailed analysis, look at the output of the
|
|
<codeph>PROFILE</codeph> command; it includes this same summary report near the start of the profile output.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_pruning_sql">
|
|
|
|
<title>What SQL Constructs Work with Partition Pruning</title>
|
|
|
|
<conbody>
|
|
|
|
<p rev="1.2.2">
|
|
<indexterm audience="hidden">predicate propagation</indexterm>
|
|
Impala can even do partition pruning in cases where the partition key column is not directly compared to a constant, by applying
|
|
the transitive property to other parts of the <codeph>WHERE</codeph> clause. This technique is known as predicate propagation, and
|
|
is available in Impala 1.2.2 and later. In this example, the census table includes another column indicating when the data was
|
|
collected, which happens in 10-year intervals. Even though the query does not compare the partition key column
|
|
(<codeph>YEAR</codeph>) to a constant value, Impala can deduce that only the partition <codeph>YEAR=2010</codeph> is required, and
|
|
again only reads 1 out of 3 partitions.
|
|
</p>
|
|
|
|
<codeblock rev="1.2.2">[localhost:21000] > drop table census;
|
|
[localhost:21000] > create table census (name string, census_year int) partitioned by (year int);
|
|
[localhost:21000] > insert into census partition (year=2010) values ('Smith',2010),('Jones',2010);
|
|
[localhost:21000] > insert into census partition (year=2011) values ('Smith',2020),('Jones',2020),('Doe',2020);
|
|
[localhost:21000] > insert into census partition (year=2012) values ('Smith',2020),('Doe',2020);
|
|
[localhost:21000] > select name from census where year = census_year and census_year=2010;
|
|
+-------+
|
|
| name |
|
|
+-------+
|
|
| Smith |
|
|
| Jones |
|
|
+-------+
|
|
[localhost:21000] > explain select name from census <b>where year = census_year and census_year=2010</b>;
|
|
+------------------------------------------------------------------+
|
|
| Explain String |
|
|
+------------------------------------------------------------------+
|
|
| PLAN FRAGMENT 0 |
|
|
| PARTITION: UNPARTITIONED |
|
|
| |
|
|
| 1:EXCHANGE |
|
|
| |
|
|
| PLAN FRAGMENT 1 |
|
|
| PARTITION: RANDOM |
|
|
| |
|
|
| STREAM DATA SINK |
|
|
| EXCHANGE ID: 1 |
|
|
| UNPARTITIONED |
|
|
| |
|
|
| 0:SCAN HDFS |
|
|
| table=predicate_propagation.census <b>#partitions=1/3</b> size=22B |
|
|
| predicates: census_year = 2010, year = census_year |
|
|
+------------------------------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/partitions_and_views"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="dynamic_partition_pruning">
|
|
|
|
<title>Dynamic Partition Pruning</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The original mechanism uses to prune partitions is <term>static partition pruning</term>, in which the conditions in the
|
|
<codeph>WHERE</codeph> clause are analyzed to determine in advance which partitions can be safely skipped. In <keyword keyref="impala25_full"/>
|
|
and higher, Impala can perform <term>dynamic partition pruning</term>, where information about the partitions is collected during
|
|
the query, and Impala prunes unnecessary partitions in ways that were impractical to predict in advance.
|
|
</p>
|
|
|
|
<p>
|
|
For example, if partition key columns are compared to literal values in a <codeph>WHERE</codeph> clause, Impala can perform static
|
|
partition pruning during the planning phase to only read the relevant partitions:
|
|
</p>
|
|
|
|
<codeblock>
|
|
-- The query only needs to read 3 partitions whose key values are known ahead of time.
|
|
-- That's static partition pruning.
|
|
SELECT COUNT(*) FROM sales_table WHERE year IN (2005, 2010, 2015);
|
|
</codeblock>
|
|
|
|
<p>
|
|
Dynamic partition pruning involves using information only available at run time, such as the result of a subquery:
|
|
</p>
|
|
|
|
<codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/>
|
|
|
|
<!-- Former example. Not sure it really would trigger DPP. SELECT COUNT(*) FROM sales_table WHERE year = (SELECT MAX(year) FROM some_other_table); -->
|
|
|
|
<p>
|
|
In this case, Impala evaluates the subquery, sends the subquery results to all Impala nodes participating in the query, and then
|
|
each <cmdname>impalad</cmdname> daemon uses the dynamic partition pruning optimization to read only the partitions with the
|
|
relevant key values.
|
|
</p>
|
|
|
|
<p>
|
|
Dynamic partition pruning is especially effective for queries involving joins of several large partitioned tables. Evaluating the
|
|
<codeph>ON</codeph> clauses of the join predicates might normally require reading data from all partitions of certain tables. If
|
|
the <codeph>WHERE</codeph> clauses of the query refer to the partition key columns, Impala can now often skip reading many of the
|
|
partitions while evaluating the <codeph>ON</codeph> clauses. The dynamic partition pruning optimization reduces the amount of I/O
|
|
and the amount of intermediate data stored and transmitted across the network during the query.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>
|
|
|
|
<p>
|
|
Dynamic partition pruning is part of the runtime filtering feature, which applies to other kinds of queries in addition to queries
|
|
against partitioned tables. See <xref href="impala_runtime_filtering.xml#runtime_filtering"/> for full details about this feature.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_key_columns">
|
|
|
|
<title>Partition Key Columns</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The columns you choose as the partition keys should be ones that are frequently used to filter query results in important,
|
|
large-scale queries. Popular examples are some combination of year, month, and day when the data has associated time values, and
|
|
geographic region when the data is associated with some place.
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
For time-based data, split out the separate parts into their own columns, because Impala cannot partition based on a
|
|
<codeph>TIMESTAMP</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The data type of the partition columns does not have a significant effect on the storage required, because the values from those
|
|
columns are not stored in the data files, rather they are represented as strings inside HDFS directory names.
|
|
</p>
|
|
</li>
|
|
|
|
<li rev="IMPALA-2499">
|
|
<p>
|
|
In <keyword keyref="impala25_full"/> and higher, you can enable the <codeph>OPTIMIZE_PARTITION_KEY_SCANS</codeph> query option to speed up
|
|
queries that only refer to partition key columns, such as <codeph>SELECT MAX(year)</codeph>. This setting is not enabled by
|
|
default because the query behavior is slightly different if the table contains partition directories without actual data inside.
|
|
See <xref href="impala_optimize_partition_key_scans.xml#optimize_partition_key_scans"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p conref="../shared/impala_common.xml#common/complex_types_partitioning"/>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Remember that when Impala queries data stored in HDFS, it is most efficient to use multi-megabyte files to take advantage of the
|
|
HDFS block size. For Parquet tables, the block size (and ideal size of the data files) is <ph rev="parquet_block_size">256 MB in
|
|
Impala 2.0 and later</ph>. Therefore, avoid specifying too many partition key columns, which could result in individual
|
|
partitions containing only small amounts of data. For example, if you receive 1 GB of data per day, you might partition by year,
|
|
month, and day; while if you receive 5 GB of data per minute, you might partition by year, month, day, hour, and minute. If you
|
|
have data with a geographic component, you might partition based on postal code if you have many megabytes of data for each
|
|
postal code, but if not, you might partition by some larger region such as city, state, or country. state
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p conref="../shared/impala_common.xml#common/partition_key_optimization"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="mixed_format_partitions">
|
|
|
|
<title>Setting Different File Formats for Partitions</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Partitioned tables have the flexibility to use different file formats for different partitions. (For background information about
|
|
the different file formats Impala supports, see <xref href="impala_file_formats.xml#file_formats"/>.) For example, if you originally
|
|
received data in text format, then received new data in RCFile format, and eventually began receiving data in Parquet format, all
|
|
that data could reside in the same table for queries. You just need to ensure that the table is structured so that the data files
|
|
that use different file formats reside in separate partitions.
|
|
</p>
|
|
|
|
<p>
|
|
For example, here is how you might switch from text to Parquet data as you receive data for different years:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table census (name string) partitioned by (year smallint);
|
|
[localhost:21000] > alter table census add partition (year=2012); -- Text format;
|
|
|
|
[localhost:21000] > alter table census add partition (year=2013); -- Text format switches to Parquet before data loaded;
|
|
[localhost:21000] > alter table census partition (year=2013) set fileformat parquet;
|
|
|
|
[localhost:21000] > insert into census partition (year=2012) values ('Smith'),('Jones'),('Lee'),('Singh');
|
|
[localhost:21000] > insert into census partition (year=2013) values ('Flores'),('Bogomolov'),('Cooper'),('Appiah');</codeblock>
|
|
|
|
<p>
|
|
At this point, the HDFS directory for <codeph>year=2012</codeph> contains a text-format data file, while the HDFS directory for
|
|
<codeph>year=2013</codeph> contains a Parquet data file. As always, when loading non-trivial data, you would use <codeph>INSERT ...
|
|
SELECT</codeph> or <codeph>LOAD DATA</codeph> to import data in large batches, rather than <codeph>INSERT ... VALUES</codeph> which
|
|
produces small files that are inefficient for real-world queries.
|
|
</p>
|
|
|
|
<p>
|
|
For other file types that Impala cannot create natively, you can switch into Hive and issue the <codeph>ALTER TABLE ... SET
|
|
FILEFORMAT</codeph> statements and <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statements there. After switching back to
|
|
Impala, issue a <codeph>REFRESH <varname>table_name</varname></codeph> statement so that Impala recognizes any partitions or new
|
|
data added through Hive.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="partition_management">
|
|
|
|
<title>Managing Partitions</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can add, drop, set the expected file format, or set the HDFS location of the data files for individual partitions within an
|
|
Impala table. See <xref href="impala_alter_table.xml#alter_table"/> for syntax details, and
|
|
<xref href="impala_partitioning.xml#mixed_format_partitions"/> for tips on managing tables containing partitions with different file
|
|
formats.
|
|
</p>
|
|
|
|
<note conref="../shared/impala_common.xml#common/add_partition_set_location"/>
|
|
|
|
<p>
|
|
What happens to the data files when a partition is dropped depends on whether the partitioned table is designated as internal or
|
|
external. For an internal (managed) table, the data files are deleted. For example, if data in the partitioned table is a copy of
|
|
raw data files stored elsewhere, you might save disk space by dropping older partitions that are no longer required for reporting,
|
|
knowing that the original data is still available if needed later. For an external table, the data files are left alone. For
|
|
example, dropping a partition without deleting the associated files lets Impala consider a smaller set of partitions, improving
|
|
query efficiency and reducing overhead for DDL operations on the table; if the data is needed again later, you can add the partition
|
|
again. See <xref href="impala_tables.xml#tables" /> for details and examples.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="kudu 2.8.0" id="partition_kudu">
|
|
|
|
<title>Using Partitioning with Kudu Tables</title>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Kudu"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Kudu tables use a more fine-grained partitioning scheme than tables containing HDFS data files. You specify a <codeph>PARTITION
|
|
BY</codeph> clause with the <codeph>CREATE TABLE</codeph> statement to identify how to divide the values from the partition key
|
|
columns.
|
|
</p>
|
|
|
|
<p>
|
|
See <xref href="impala_kudu.xml#kudu_partitioning"/> for
|
|
details and examples of the partitioning techniques
|
|
for Kudu tables.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|