mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Testing: built docs locally Change-Id: I67a861a56269648c5f8c2e9697861bf95587f731 Reviewed-on: http://gerrit.cloudera.org:8080/23738 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Daniel Vanko <dvanko@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
1128 lines
48 KiB
XML
1128 lines
48 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="impala_iceberg">
|
||
|
||
<title id="iceberg">Using Impala with Iceberg Tables</title>
|
||
<titlealts audience="PDF"><navtitle>Iceberg Tables</navtitle></titlealts>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Iceberg"/>
|
||
<data name="Category" value="Querying"/>
|
||
<data name="Category" value="Data Analysts"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Tables"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
|
||
<p>
|
||
<indexterm audience="hidden">Iceberg</indexterm>
|
||
Impala now supports Apache Iceberg which is an open table format for huge analytic datasets.
|
||
With this functionality, you can access any existing Iceberg tables using SQL and perform
|
||
analytics over them. Using Impala you can create and write Iceberg tables in different
|
||
Iceberg Catalogs (e.g. HiveCatalog, HadoopCatalog). It also supports location-based
|
||
tables (HadoopTables).
|
||
</p>
|
||
|
||
<p>
|
||
For more information on Iceberg, see <xref keyref="upstream_iceberg_site"/>.
|
||
</p>
|
||
|
||
<p outputclass="toc inpage"/>
|
||
</conbody>
|
||
|
||
<concept id="iceberg_features">
|
||
<title>Overview of Iceberg features</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Concepts"/>
|
||
</metadata>
|
||
</prolog>
|
||
<conbody>
|
||
<ul>
|
||
<li>
|
||
ACID compliance: DML operations are atomic, queries always read a consistent snapshot.
|
||
</li>
|
||
<li>
|
||
Hidden partitioning: Iceberg produces partition values by taking a column value and
|
||
optionally transforming it. Partition information is stored in the Iceberg metadata
|
||
files. Iceberg is able to TRUNCATE column values or calculate
|
||
a hash of them and use it for partitioning. Readers don't need to be aware of the
|
||
partitioning of the table.
|
||
</li>
|
||
<li>
|
||
Partition layout evolution: When the data volume or the query patterns change you
|
||
can update the layout of a table. Since hidden partitioning is used, you don't need to
|
||
rewrite the data files during partition layout evolution.
|
||
</li>
|
||
<li>
|
||
Schema evolution: supports add, drop, update, or rename schema elements,
|
||
and has no side-effects.
|
||
</li>
|
||
<li>
|
||
Time travel: enables reproducible queries that use exactly the same table
|
||
snapshot, or lets users easily examine changes.
|
||
</li>
|
||
<li>
|
||
Cloning Iceberg tables: create an empty Iceberg table based on the definition of
|
||
another Iceberg table.
|
||
</li>
|
||
</ul>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_create">
|
||
|
||
<title>Creating Iceberg tables with Impala</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Concepts"/>
|
||
</metadata>
|
||
</prolog>
|
||
|
||
<conbody>
|
||
<p>
|
||
When you have an existing Iceberg table that is not yet present in the Hive Metastore,
|
||
you can use the <codeph>CREATE EXTERNAL TABLE</codeph> command in Impala to add the table to the Hive
|
||
Metastore and make Impala able to interact with this table. Currently Impala supports
|
||
HadoopTables, HadoopCatalog, and HiveCatalog. If you have an existing table in HiveCatalog,
|
||
and you are using the same Hive Metastore, you need no further actions.
|
||
</p>
|
||
<ul>
|
||
<li>
|
||
<b>HadoopTables</b>. When the table already exists in a HadoopTable it means there is
|
||
a location on the file system that contains your table. Use the following command
|
||
to add this table to Impala's catalog:
|
||
<codeblock>
|
||
CREATE EXTERNAL TABLE ice_hadoop_tbl
|
||
STORED AS ICEBERG
|
||
LOCATION '/path/to/table'
|
||
TBLPROPERTIES('iceberg.catalog'='hadoop.tables');
|
||
</codeblock>
|
||
</li>
|
||
<li>
|
||
<b>HadoopCatalog</b>. A table in HadoopCatalog means that there is a catalog location
|
||
in the file system under which Iceberg tables are stored. Use the following command
|
||
to add a table in a HadoopCatalog to Impala:
|
||
<codeblock>
|
||
CREATE EXTERNAL TABLE ice_hadoop_cat
|
||
STORED AS ICEBERG
|
||
TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
|
||
'iceberg.catalog_location'='/path/to/catalog',
|
||
'iceberg.table_identifier'='namespace.table');
|
||
</codeblock>
|
||
</li>
|
||
<li>
|
||
Alternatively, you can also use custom catalogs to use existing tables. It means you need to define
|
||
your catalog in hive-site.xml.
|
||
The advantage of this method is that other engines are more likely to be able to interact with this table.
|
||
Please note that the automatic metadata update will not work for these tables, you will have to manually
|
||
call REFRESH on the table when it changes outside Impala.
|
||
To globally register different catalogs, set the following Hadoop configurations:
|
||
<table rowsep="1" colsep="1" id="iceberg_custom_catalogs">
|
||
<tgroup cols="2">
|
||
<colspec colname="c1" colnum="1"/>
|
||
<colspec colname="c2" colnum="2"/>
|
||
<thead>
|
||
<row>
|
||
<entry>Config Key</entry>
|
||
<entry>Description</entry>
|
||
</row>
|
||
</thead>
|
||
<tbody>
|
||
<row>
|
||
<entry>iceberg.catalog.<catalog_name>.type</entry>
|
||
<entry>type of catalog: hive, hadoop, or left unset if using a custom catalog</entry>
|
||
</row>
|
||
<row>
|
||
<entry>iceberg.catalog.<catalog_name>.catalog-impl</entry>
|
||
<entry>catalog implementation, must not be null if type is empty</entry>
|
||
</row>
|
||
<row>
|
||
<entry>iceberg.catalog.<catalog_name>.<key></entry>
|
||
<entry>any config key and value pairs for the catalog</entry>
|
||
</row>
|
||
</tbody>
|
||
</tgroup>
|
||
</table>
|
||
<p>
|
||
For example, to register a HadoopCatalog called 'hadoop', set the following properties in hive-site.xml:
|
||
<codeblock>
|
||
iceberg.catalog.hadoop.type=hadoop;
|
||
iceberg.catalog.hadoop.warehouse=hdfs://example.com:8020/warehouse;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
Then in the CREATE TABLE statement you can just refer to the catalog name:
|
||
<codeblock>
|
||
CREATE EXTERNAL TABLE ice_catalogs STORED AS ICEBERG TBLPROPERTIES('iceberg.catalog'='<CATALOG-NAME>');
|
||
</codeblock>
|
||
</p>
|
||
</li>
|
||
<li>
|
||
If the table already exists in HiveCatalog then Impala should be able to see it without any additional
|
||
commands.
|
||
</li>
|
||
</ul>
|
||
|
||
<p>
|
||
You can also create new Iceberg tables with Impala. You can use the same commands as above, just
|
||
omit the <codeph>EXTERNAL</codeph> keyword. To create an Iceberg table in HiveCatalog the following
|
||
CREATE TABLE statement can be used:
|
||
<codeblock>
|
||
CREATE TABLE ice_t (i INT) STORED AS ICEBERG;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
By default Impala assumes that the Iceberg table uses Parquet data files. ORC and AVRO are also supported,
|
||
but we need to tell Impala via setting the table property 'write.format.default' to e.g. 'ORC'.
|
||
</p>
|
||
<p>
|
||
You can also use <codeph>CREATE TABLE AS SELECT</codeph> to create new Iceberg tables, e.g.:
|
||
<codeblock>
|
||
CREATE TABLE ice_ctas STORED AS ICEBERG AS SELECT i, b FROM value_tbl;
|
||
|
||
CREATE TABLE ice_ctas_part PARTITIONED BY(d) STORED AS ICEBERG AS SELECT s, ts, d FROM value_tbl;
|
||
|
||
CREATE TABLE ice_ctas_part_spec PARTITIONED BY SPEC (truncate(3, s)) STORED AS ICEBERG AS SELECT cast(t as INT), s, d FROM value_tbl;
|
||
</codeblock>
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_scan_metrics">
|
||
<title>Iceberg Scan Metrics</title>
|
||
<conbody>
|
||
<p>
|
||
When Impala runs queries on Iceberg tables, sometimes it uses Iceberg's
|
||
'planFiles()' API during planning. As it is an expensive call, Impala avoids it
|
||
when possible, but it is necessary in the following cases:
|
||
- if one or more predicates are pushed down to Iceberg
|
||
- if there is time travel.
|
||
|
||
The call to 'planFiles()', on the other hand, also collects metrics, e.g. the
|
||
total Iceberg planning time, the number of data/delete files and manifests and how
|
||
many of these can be skipped.
|
||
|
||
These metrics are integrated into the query profile under the "Frontend" section.
|
||
As they are per-table, if multiple tables are scanned for the query, there will be
|
||
multiple sections in the profile.
|
||
|
||
Note that for Iceberg tables where Iceberg's 'planFiles()' API was not used in
|
||
planning, the metrics are not available and the profile will contain a short note
|
||
describing this.
|
||
|
||
To facilitate pairing the metrics with scans, the metrics header
|
||
references the plan node responsible for the scan. This will always be
|
||
the top level node for the scan, so it can be a SCAN node, a JOIN node
|
||
or a UNION node depending on whether the table has delete files.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_v2">
|
||
<title>Iceberg V2 tables</title>
|
||
<conbody>
|
||
<p>
|
||
Iceberg V2 tables support row-level modifications (DELETE, UPDATE) via "merge-on-read", which means instead
|
||
of rewriting existing data files, separate so-called delete files are being written that store information
|
||
about the deleted records. There are two kinds of delete files in Iceberg:
|
||
<ul>
|
||
<li>position deletes</li>
|
||
<li>equality deletes</li>
|
||
</ul>
|
||
Impala only supports position delete files. These files contain the file path and file position of the deleted
|
||
rows.
|
||
</p>
|
||
<p>
|
||
One can create Iceberg V2 tables via the <codeph>CREATE TABLE</codeph> statement, they just need to specify
|
||
the 'format-version' table property:
|
||
<codeblock>
|
||
CREATE TABLE ice_v2 (i int) STORED BY ICEBERG TBLPROPERTIES('format-version'='2');
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
It is also possible to upgrade existing Iceberg V1 tables to Iceberg V2 tables. One can use the following
|
||
<codeph>ALTER TABLE</codeph> statement to do so:
|
||
<codeblock>
|
||
ALTER TABLE ice_v1_to_v2 SET TBLPROPERTIES('format-version'='2');
|
||
</codeblock>
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_drop">
|
||
<title>Dropping Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
One can use <codeph>DROP TABLE</codeph> statement to remove an Iceberg table:
|
||
<codeblock>
|
||
DROP TABLE ice_t;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
When <codeph>external.table.purge</codeph> table property is set to true, then the
|
||
<codeph>DROP TABLE</codeph> statement will also delete the data files. This property
|
||
is set to true when Impala creates the Iceberg table via <codeph>CREATE TABLE</codeph>.
|
||
When <codeph>CREATE EXTERNAL TABLE</codeph> is used (the table already exists in some
|
||
catalog) then this <codeph>external.table.purge</codeph> is set to false, i.e.
|
||
<codeph>DROP TABLE</codeph> doesn't remove any files, only the table definition
|
||
in HMS.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_types">
|
||
<title>Supported Data Types for Iceberg Columns</title>
|
||
<conbody>
|
||
|
||
<p>
|
||
You can get information about the supported Iceberg data types in
|
||
<xref href="https://iceberg.apache.org/docs/latest/schemas/" scope="external" format="html">
|
||
the Iceberg spec</xref>.
|
||
</p>
|
||
|
||
<p>
|
||
The Iceberg data types can be mapped to the following SQL types in Impala:
|
||
<table rowsep="1" colsep="1" id="iceberg_types_sql_types">
|
||
<tgroup cols="2">
|
||
<colspec colname="c1" colnum="1"/>
|
||
<colspec colname="c2" colnum="2"/>
|
||
<thead>
|
||
<row>
|
||
<entry>Iceberg type</entry>
|
||
<entry>SQL type in Impala</entry>
|
||
</row>
|
||
</thead>
|
||
<tbody>
|
||
<row>
|
||
<entry>boolean</entry>
|
||
<entry>BOOLEAN</entry>
|
||
</row>
|
||
<row>
|
||
<entry>int</entry>
|
||
<entry>INTEGER</entry>
|
||
</row>
|
||
<row>
|
||
<entry>long</entry>
|
||
<entry>BIGINT</entry>
|
||
</row>
|
||
<row>
|
||
<entry>float</entry>
|
||
<entry>FLOAT</entry>
|
||
</row>
|
||
<row>
|
||
<entry>double</entry>
|
||
<entry>DOUBLE</entry>
|
||
</row>
|
||
<row>
|
||
<entry>decimal(P, S)</entry>
|
||
<entry>DECIMAL(P, S)</entry>
|
||
</row>
|
||
<row>
|
||
<entry>date</entry>
|
||
<entry>DATE</entry>
|
||
</row>
|
||
<row>
|
||
<entry>time</entry>
|
||
<entry>Not supported</entry>
|
||
</row>
|
||
<row>
|
||
<entry>timestamp</entry>
|
||
<entry>TIMESTAMP</entry>
|
||
</row>
|
||
<row>
|
||
<entry>timestamptz</entry>
|
||
<entry>Only read support via TIMESTAMP</entry>
|
||
</row>
|
||
<row>
|
||
<entry>string</entry>
|
||
<entry>STRING</entry>
|
||
</row>
|
||
<row>
|
||
<entry>uuid</entry>
|
||
<entry>Not supported</entry>
|
||
</row>
|
||
<row>
|
||
<entry>fixed(L)</entry>
|
||
<entry>Not supported</entry>
|
||
</row>
|
||
<row>
|
||
<entry>binary</entry>
|
||
<entry>Not supported</entry>
|
||
</row>
|
||
<row>
|
||
<entry>struct</entry>
|
||
<entry>STRUCT (read only)</entry>
|
||
</row>
|
||
<row>
|
||
<entry>list</entry>
|
||
<entry>ARRAY (read only)</entry>
|
||
</row>
|
||
<row>
|
||
<entry>map</entry>
|
||
<entry>MAP (read only)</entry>
|
||
</row>
|
||
</tbody>
|
||
</tgroup>
|
||
</table>
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
|
||
<concept id="iceberg_schema_evolution">
|
||
<title>Schema evolution of Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
Iceberg assigns unique field ids to schema elements which means it is possible
|
||
to reorder/delete/change columns and still be able to correctly read current and
|
||
old data files. Impala supports the following statements to modify a table's schema:
|
||
<ul>
|
||
<li><codeph>ALTER TABLE ... RENAME TO ...</codeph> (renames the table if the Iceberg catalog supports it)</li>
|
||
<li><codeph>ALTER TABLE ... CHANGE COLUMN ...</codeph> (change name and type of a column iff the new type is compatible with the old type)</li>
|
||
<li><codeph>ALTER TABLE ... ADD COLUMNS ...</codeph> (adds columns to the end of the table)</li>
|
||
<li><codeph>ALTER TABLE ... DROP COLUMN ...</codeph></li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
Valid type promotions are:
|
||
<ul>
|
||
<li>int to long</li>
|
||
<li>float to double</li>
|
||
<li>decimal(P, S) to decimal(P', S) if P' > P – widen the precision of decimal types.</li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
Impala currently does not support schema evolution for tables with AVRO file format.
|
||
</p>
|
||
<p>
|
||
See
|
||
<xref href="https://iceberg.apache.org/docs/latest/evolution/#schema-evolution" scope="external" format="html">
|
||
schema evolution </xref> for more details.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_partitioning">
|
||
<title>Partitioning Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
<xref href="https://iceberg.apache.org/docs/latest/partitioning/" scope="external" format="html">
|
||
The Iceberg spec </xref> has information about partitioning Iceberg tables. With Iceberg,
|
||
we are not limited to value-based partitioning, we can also partition our tables via
|
||
several partition transforms.
|
||
</p>
|
||
<p>
|
||
Partition transforms are IDENTITY, BUCKET, TRUNCATE, YEAR, MONTH, DAY, HOUR, and VOID.
|
||
Impala supports all of these transforms. To create a partitioned Iceberg table, one
|
||
needs to add a <codeph>PARTITIONED BY SPEC</codeph> clause to the CREATE TABLE statement, e.g.:
|
||
<codeblock>
|
||
CREATE TABLE ice_p (i INT, d DATE, s STRING, t TIMESTAMP)
|
||
PARTITIONED BY SPEC (BUCKET(5, i), MONTH(d), TRUNCATE(3, s), HOUR(t))
|
||
STORED AS ICEBERG;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
Iceberg also supports
|
||
<xref href="https://iceberg.apache.org/docs/latest/evolution/#partition-evolution" scope="external" format="html">
|
||
partition evolution</xref> which means that the partitioning of a table can be changed, even
|
||
without the need of rewriting existing data files. You can change an existing table's
|
||
partitioning via an <codeph>ALTER TABLE SET PARTITION SPEC</codeph> statement, e.g.:
|
||
<codeblock>
|
||
ALTER TABLE ice_p SET PARTITION SPEC (VOID(i), VOID(d), TRUNCATE(3, s), HOUR(t), i);
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
Please keep in mind that for Iceberg V1 tables:
|
||
<ul>
|
||
<li>Do not reorder partition fields</li>
|
||
<li>Do not drop partition fields; instead replace the field’s transform with the void transform</li>
|
||
<li>Only add partition fields at the end of the previous partition spec</li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
You can also use the legacy syntax to create identity-partitioned Iceberg tables:
|
||
<codeblock>
|
||
CREATE TABLE ice_p (i INT, b INT) PARTITIONED BY (p1 INT, p2 STRING) STORED AS ICEBERG;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
One can inspect a table's partition spec by the <codeph>SHOW PARTITIONS</codeph> or
|
||
<codeph>SHOW CREATE TABLE</codeph> commands.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_inserts">
|
||
<title>Inserting data into Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
Impala is also able to insert new data to Iceberg tables. Currently the <codeph>INSERT INTO</codeph>
|
||
and <codeph>INSERT OVERWRITE</codeph> DML statements are supported. One can also remove the
|
||
contents of an Iceberg table via the <codeph>TRUNCATE</codeph> command.
|
||
</p>
|
||
<p>
|
||
Since Iceberg uses hidden partitioning it means you don't need a partition clause in your INSERT
|
||
statements. E.g. insertion to a partitioned table looks like:
|
||
<codeblock>
|
||
CREATE TABLE ice_p (i INT, b INT) PARTITIONED BY SPEC (bucket(17, i)) STORED AS ICEBERG;
|
||
INSERT INTO ice_p VALUES (1, 2);
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
<codeph>INSERT OVERWRITE</codeph> statements can replace data in the table with the result of a query.
|
||
For partitioned tables Impala does a dynamic overwrite, which means partitions that have rows produced
|
||
by the SELECT query will be replaced. And partitions that have no rows produced by the SELECT query
|
||
remain untouched. INSERT OVERWRITE is not allowed for tables that use the BUCKET partition transform
|
||
because dynamic overwrite behavior would be too random in this case. If one needs to replace all
|
||
contents of a table, they can still use <codeph>TRUNCATE</codeph> and <codeph>INSERT INTO</codeph>.
|
||
</p>
|
||
<p>
|
||
Impala can only write Iceberg tables with Parquet data files.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_delete">
|
||
<title>Delete data from Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
Since <keyword keyref="impala43"/> Impala is able to run <codeph>DELETE</codeph> statements against
|
||
Iceberg V2 tables. E.g.:
|
||
<codeblock>
|
||
DELETE FROM ice_t where i = 3;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
More information about the <codeph>DELETE</codeph> statement can be found at <xref href="impala_delete.xml#delete"/>.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_drop_partition">
|
||
<title>Dropping partitions from Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
Since <keyword keyref="impala44"/> Impala is able to run <codeph>ALTER TABLE DROP PARTITION</codeph> statements. E.g.:
|
||
<codeblock>
|
||
ALTER TABLE ice_t DROP PARTITION (i = 3);
|
||
ALTER TABLE ice_t DROP PARTITION (day(date_col) < '2024-10-01');
|
||
ALTER TABLE ice_t DROP PARTITION (year(timestamp_col) = '2024');
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
Any non-identity transforms must be included in the partition selector like <codeph>(day(date_col))</codeph>. Operands for filtering date and
|
||
timestamp-based columns with transforms must be provided as strings, for example: <codeph>(day(date_col) = '2024-10-01')</codeph>.
|
||
This is a metadata-only operation, the datafiles targeted by the deleted partitions do not get purged or removed from the filesystem,
|
||
only a new snapshot is getting created with the remaining partitions.
|
||
</p>
|
||
<p>
|
||
Limitations:
|
||
<ul>
|
||
<li>Binary filter predicates must consist of one partition selector and one constant expression;
|
||
e.g.: <codeph>(day(date_col) = '2024-10-01')</codeph> is allowed, but <codeph>(another_date_col = date_col)</codeph> is not allowed.</li>
|
||
<li>Filtering expressions must target the latest partition spec of the table.</li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
More information about the <codeph>ALTER TABLE DROP PARTITION</codeph> statement can be found at
|
||
<xref href="impala_alter_table.xml"/>.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_update">
|
||
<title>Updating data in Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
Since <keyword keyref="impala44"/> Impala is able to run <codeph>UPDATE</codeph> statements against
|
||
Iceberg V2 tables. E.g.:
|
||
<codeblock>
|
||
UPDATE ice_t SET val = val + 1;
|
||
UPDATE ice_t SET k = 4 WHERE i = 5;
|
||
UPDATE ice_t SET ice_t.k = o.k, ice_t.j = o.j, FROM ice_t, other_table o where ice_t.id = o.id;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
The UPDATE FROM statement can be used to update a target Iceberg table based on a source table (or view) that doesn't need
|
||
to be an Iceberg table. If there are multiple matches on the JOIN condition, Impala will raise an error.
|
||
</p>
|
||
<p>
|
||
Limitations:
|
||
<ul>
|
||
<li>Only the merge-on-read update mode is supported.</li>
|
||
<li>Only writes position delete files, i.e. no support for writing equality deletes.</li>
|
||
<li>Cannot update tables with complex types.</li>
|
||
<li>
|
||
Can only write data and delete files in Parquet format. This means if table properties 'write.format.default'
|
||
and 'write.delete.format.default' are set, their values must be PARQUET.
|
||
</li>
|
||
<li>
|
||
Updating partitioning column with non-constant expression via the UPDATE FROM statement is not allowed.
|
||
This limitation could be eliminated by using a <codeph>MERGE</codeph> statement.
|
||
</li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
More information about the <codeph>UPDATE</codeph> statement can be found at <xref href="impala_update.xml#update"/>.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_merge">
|
||
<title>Merging data into Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
Impala can execute MERGE statements against Iceberg tables, e.g:
|
||
<codeblock>
|
||
MERGE INTO ice_t USING source ON ice_t.a = source.id WHEN NOT MATCHED THEN INSERT VALUES(id, source.column1);
|
||
MERGE INTO ice_t USING source ON ice_t.a = source.id WHEN MATCHED THEN DELETE;
|
||
MERGE INTO ice_t USING source ON ice_t.a = source.id WHEN MATCHED THEN UPDATE SET b = source.b;
|
||
MERGE INTO ice_t USING source ON ice_t.a = source.id
|
||
WHEN MATCHED AND ice_t.a < 100 THEN UPDATE SET b = source.b
|
||
WHEN MATCHED THEN DELETE
|
||
WHEN NOT MATCHED THEN INSERT VALUES(id, source.column1);
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
The limitations of the <codeph>UPDATE</codeph> statement also apply to the <codeph>MERGE</codeph> statement.
|
||
</p>
|
||
<p>
|
||
More information about the <codeph>MERGE</codeph> statement can be found at <xref href="impala_merge.xml"/>.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_load">
|
||
<title>Loading data into Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
<codeph>LOAD DATA</codeph> statement can be used to load a single file or directory into
|
||
an existing Iceberg table. This operation is executed differently compared to HMS tables, the
|
||
data is being inserted into the table via sequentially executed statements, which has
|
||
some limitations:
|
||
<ul>
|
||
<li>Only Parquet or ORC files can be loaded.</li>
|
||
<li><codeph>PARTITION</codeph> clause is not supported, but the partition transformations
|
||
are respected.</li>
|
||
<li>The loaded files will be re-written as Parquet files.</li>
|
||
</ul>
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_optimize_table">
|
||
<title>Optimizing (Compacting) Iceberg tables</title>
|
||
<conbody>
|
||
<p>
|
||
Frequent updates and row-level modifications on Iceberg tables can write many small
|
||
data files and delete files, which have to be merged-on-read.
|
||
This causes read performance to degrade over time.
|
||
The following statement can be used to compact the table and optimize it for reading.
|
||
<codeblock>
|
||
OPTIMIZE TABLE [<varname>db_name</varname>.]<varname>table_name</varname> [FILE_SIZE_THRESHOLD_MB=<varname>value</varname>];
|
||
</codeblock>
|
||
</p>
|
||
|
||
<p>
|
||
The <codeph>OPTIMIZE TABLE</codeph> statement rewrites the table, executing the
|
||
following tasks:
|
||
<ul>
|
||
<li>Merges delete files with the corresponding data files.</li>
|
||
<li>Compacts data files that are smaller than the specified file size threshold in megabytes.</li>
|
||
</ul>
|
||
If no <codeph>FILE_SIZE_THRESHOLD_MB</codeph> was specified, the command compacts
|
||
ALL files and also
|
||
<ul>
|
||
<li>Converts data files to the latest table schema.</li>
|
||
<li>Rewrites all partitions according to the latest partition spec.</li>
|
||
</ul>
|
||
</p>
|
||
|
||
<p>
|
||
To execute table optimization:
|
||
<ul>
|
||
<li>The user needs ALL privileges on the table.</li>
|
||
<li>The table can contain any file formats that Impala can read, but <codeph>write.format.default</codeph>
|
||
has to be <codeph>parquet</codeph>.</li>
|
||
<li>General write limitations apply, e.g. the table cannot contain complex types.</li>
|
||
</ul>
|
||
</p>
|
||
|
||
<p>
|
||
When a table is optimized, a new snapshot is created. The old table state is still
|
||
accessible by time travel to previous snapshots, because the rewritten data and
|
||
delete files are not removed physically.
|
||
Issue the <codeph>ALTER TABLE ... EXECUTE expire_snapshots(...)</codeph> command
|
||
to remove the old files from the file system.
|
||
</p>
|
||
<p>
|
||
Note that <codeph>OPTIMIZE TABLE</codeph> without a specified <codeph>FILE_SIZE_THRESHOLD_MB</codeph>
|
||
rewrites the entire table, therefore the operation can take a long time to complete
|
||
depending on the size of the table.
|
||
It is recommended to specify a file size threshold for recurring table maintenance
|
||
jobs to save resources.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_time_travel">
|
||
<title>Time travel for Iceberg tables</title>
|
||
<conbody>
|
||
|
||
<p>
|
||
Iceberg stores the table states in a chain of snapshots. By default, Impala uses the current
|
||
snapshot of the table. But for Iceberg tables, it is also possible to query an earlier state of
|
||
the table.
|
||
</p>
|
||
|
||
<p>
|
||
We can use the clauses <codeph>FOR SYSTEM_TIME AS OF</codeph> with a timestamp and
|
||
<codeph>FOR SYSTEM_VERSION AS OF</codeph> with a snapshot id in <codeph>SELECT</codeph> queries, e.g.:
|
||
<codeblock>
|
||
SELECT * FROM ice_t FOR SYSTEM_TIME AS OF '2022-01-04 10:00:00';
|
||
SELECT * FROM ice_t FOR SYSTEM_TIME AS OF now() - interval 5 days;
|
||
SELECT * FROM ice_t FOR SYSTEM_VERSION AS OF 123456;
|
||
</codeblock>
|
||
</p>
|
||
|
||
<p>
|
||
If one needs to check the available snapshots of a table they can use the <codeph>DESCRIBE HISTORY</codeph>
|
||
statement with the following syntax:
|
||
<codeblock>
|
||
DESCRIBE HISTORY [<varname>db_name</varname>.]<varname>table_name</varname>
|
||
[FROM <varname>timestamp</varname>];
|
||
|
||
DESCRIBE HISTORY [<varname>db_name</varname>.]<varname>table_name</varname>
|
||
[BETWEEN <varname>timestamp</varname> AND <varname>timestamp</varname>]
|
||
</codeblock>
|
||
For example:
|
||
<codeblock>
|
||
DESCRIBE HISTORY ice_t FROM '2022-01-04 10:00:00';
|
||
DESCRIBE HISTORY ice_t FROM now() - interval 5 days;
|
||
DESCRIBE HISTORY ice_t BETWEEN '2022-01-04 10:00:00' AND '2022-01-05 10:00:00';
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
The output of the <codeph>DESCRIBE HISTORY</codeph> statement is formed
|
||
of the following columns:
|
||
<ul>
|
||
<li><codeph>creation_time</codeph>: the snapshot's creation timestamp.</li>
|
||
<li><codeph>snapshot_id</codeph>: the snapshot's ID or null.</li>
|
||
<li><codeph>parent_id</codeph>: the snapshot's parent ID or null.</li>
|
||
<li><codeph>is_current_ancestor</codeph>: TRUE if the snapshot is a current ancestor of the table.</li>
|
||
</ul>
|
||
</p>
|
||
|
||
<p rev="4.3.0 IMPALA-10893">
|
||
Please note that time travel queries are executed using the old schema of the table
|
||
from the point specified by the time travel parameters.
|
||
Prior to Impala 4.3.0 the current table schema is used to query an older
|
||
snapshot of the table, which might have had a different schema in the past.
|
||
</p>
|
||
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_execute_rollback">
|
||
<title>Rolling Iceberg tables back to a previous state</title>
|
||
<conbody>
|
||
<p>
|
||
Iceberg table modifications cause new table snapshots to be created;
|
||
these snapshots represent an earlier version of the table.
|
||
The <codeph>ALTER TABLE [<varname>db_name</varname>.]<varname>table_name</varname> EXECUTE ROLLBACK</codeph>
|
||
statement can be used to roll back the table to a previous snapshot.
|
||
</p>
|
||
|
||
<p>
|
||
For example, to roll the table back to the snapshot id <codeph>123456</codeph> use:
|
||
<codeblock>
|
||
ALTER TABLE ice_tbl EXECUTE ROLLBACK(123456);
|
||
</codeblock>
|
||
To roll the table back to the most recent (newest) snapshot
|
||
that has a creation timestamp that is older than the timestamp '2022-01-04 10:00:00' use:
|
||
<codeblock>
|
||
ALTER TABLE ice_tbl EXECUTE ROLLBACK('2022-01-04 10:00:00');
|
||
</codeblock>
|
||
The timestamp is evaluated using the Timezone for the current session.
|
||
</p>
|
||
|
||
<p>
|
||
It is only possible to roll back to a snapshot that is a current ancestor of the table.
|
||
</p>
|
||
<p>
|
||
When a table is rolled back to a snapshot, a new snapshot is
|
||
created with the same snapshot id, but with a new creation timestamp.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_expire_snapshots">
|
||
<title>Expiring snapshots</title>
|
||
<conbody>
|
||
<p>
|
||
Iceberg snapshots accumulate until they are deleted by a user action. Snapshots
|
||
can be deleted with <codeph>ALTER TABLE ... EXECUTE expire_snapshots(...)</codeph>
|
||
statement, which will expire snapshots that are older than the specified
|
||
timestamp. For example:
|
||
<codeblock>
|
||
ALTER TABLE ice_tbl EXECUTE expire_snapshots('2022-01-04 10:00:00');
|
||
ALTER TABLE ice_tbl EXECUTE expire_snapshots(now() - interval 5 days);
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
Expire snapshots:
|
||
<ul>
|
||
<li>removes data files that are no longer referenced by non-expired snapshots.</li>
|
||
<li>does not remove orphaned data files.</li>
|
||
<li>does not remove old metadata files by default.</li>
|
||
<li>respects the minimum number of snapshots to keep:
|
||
<codeph>history.expire.min-snapshots-to-keep</codeph> table property.</li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
Old metadata file clean up can be configured with
|
||
<codeph>write.metadata.delete-after-commit.enabled=true</codeph> and
|
||
<codeph>write.metadata.previous-versions-max</codeph> table properties. This
|
||
allows automatic metadata file removal after operations that modify metadata
|
||
such as expiring snapshots or inserting data.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_remove_orphan_files">
|
||
<title>Removing orphan files</title>
|
||
<conbody>
|
||
<p>
|
||
Failures can leave files that are not referenced by table metadata. These are
|
||
called orphan files. And in some cases normal snapshot expiration may not be able
|
||
to determine a file is no longer needed and delete it. Impala can remove these
|
||
orphan files with
|
||
<codeph>ALTER TABLE ... EXECUTE remove_orphan_files(...)</codeph>
|
||
statement, which will remove all orphan files that has modification time older
|
||
than the specified timestamp. For example:
|
||
<codeblock>
|
||
-- Remove orphan files older than '2022-01-04 10:00:00'.
|
||
ALTER TABLE ice_tbl EXECUTE remove_orphan_files('2022-01-04 10:00:00');
|
||
|
||
-- Remove orphan files older than 5 days from now.
|
||
ALTER TABLE ice_tbl EXECUTE remove_orphan_files(now() - interval 5 days);
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
Note that this is a destructive query that will wipe out any files within Iceberg
|
||
table's 'data' and 'metadata' directory that is not addressable by any valid
|
||
snapshots. It is dangerous to remove orphan files with a retention interval
|
||
shorter than the time expected for any write to complete because it might corrupt
|
||
the table if in-progress files are considered orphaned and are deleted. It is
|
||
recommended to set timestamp a day ago or older for this remove orphan files
|
||
query.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_repair_metadata">
|
||
<title>Repair table metadata</title>
|
||
<conbody>
|
||
<p>
|
||
Users should always use the engine/Iceberg API to interact with Iceberg tables;
|
||
e.g. to remove a partition, use Impala and issue the DROP PARTITION statement
|
||
instead of deleting the partition directory.
|
||
Deleting files directly from storage without going through the Iceberg API
|
||
corrupts the table, and makes queries that try to read the missing files fail
|
||
with the following error message:
|
||
<codeph>Iceberg table [...] cannot be fully loaded due to unavailable
|
||
files</codeph>.
|
||
</p>
|
||
<p>
|
||
This happens because the metadata files are still referencing the missing data
|
||
files. This erroneous state can be fixed by restoring the deleted files on the
|
||
file system.
|
||
If this is not intended or not possible, the dangling references can be removed
|
||
from the Iceberg metadata with the
|
||
<codeph>ALTER TABLE ... EXECUTE repair_metadata()</codeph>
|
||
statement, so that the table becomes functional again.
|
||
<codeblock>
|
||
-- Use the statement simply without parameters:
|
||
ALTER TABLE ice_tbl EXECUTE repair_metadata();
|
||
</codeblock>
|
||
</p>
|
||
<note>
|
||
This operation does not restore the deleted content. Execute only if
|
||
there is no intention to restore the missing data.
|
||
<p>
|
||
Impala can repair the table only if the missing files are data files,
|
||
but it cannot repair the table if there are missing delete files.
|
||
</p>
|
||
</note>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_metadata_tables">
|
||
<title>Iceberg metadata tables</title>
|
||
<conbody>
|
||
<p>
|
||
Iceberg stores extensive metadata for each table (e.g. snapshots, manifests, data
|
||
and delete files etc.), which is accessible in Impala in the form of virtual
|
||
tables called metadata tables.
|
||
</p>
|
||
<p>
|
||
Metadata tables can be queried just like regular tables, including filtering,
|
||
aggregation and joining with other metadata and regular tables. On the other hand,
|
||
they are read-only, so it is not possible to change, add or remove records from
|
||
them, they cannot be dropped and new metadata tables cannot be created. Metadata
|
||
changes made in other ways (not through metadata tables) are reflected in the
|
||
tables.
|
||
</p>
|
||
<p>
|
||
To list the metadata tables available for an Iceberg table, use the <codeph>SHOW
|
||
METADATA TABLES</codeph> command:
|
||
|
||
<codeblock>
|
||
SHOW METADATA TABLES IN [db.]tbl [[LIKE] “pattern”]
|
||
</codeblock>
|
||
|
||
It is possible to filter the result using <codeph>pattern</codeph>. All Iceberg
|
||
tables have the same metadata tables, so this command is mostly for convenience.
|
||
Using <codeph>SHOW METADATA TABLES</codeph> on a non-Iceberg table results in an
|
||
error.
|
||
</p>
|
||
<p>
|
||
Just like regular tables, metadata tables have schemas that can be queried with
|
||
the <codeph>DESCRIBE</codeph> command. Note, however, that <codeph>DESCRIBE
|
||
FORMATTED|EXTENDED</codeph> are not available for metadata tables.
|
||
</p>
|
||
<p>
|
||
Example:
|
||
<codeblock>
|
||
DESCRIBE functional_parquet.iceberg_alltypes_part.history;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
To retrieve information from metadata tables, use the usual
|
||
<codeph>SELECT</codeph> statement. You can select any subset of the columns or all
|
||
of them using ‘*’. Note that in contrast to regular tables, <codeph>SELECT
|
||
*</codeph> on metadata tables always includes complex-typed columns in the result.
|
||
Therefore, the query option <codeph>EXPAND_COMPLEX_TYPES</codeph> only applies to
|
||
regular tables. This holds also in queries that mix metadata tables and regular
|
||
tables: for <codeph>SELECT *</codeph> expressions from metadata tables, complex
|
||
types will always be included, and for <codeph>SELECT *</codeph> expressions from
|
||
regular tables, complex types will be included if and only if
|
||
<codeph>EXPAND_COMPLEX_TYPES</codeph> is true.
|
||
</p>
|
||
<p>
|
||
Note that unnesting collections from metadata tables is not supported.
|
||
</p>
|
||
<p>
|
||
Example:
|
||
<codeblock>
|
||
SELECT
|
||
s.operation,
|
||
h.is_current_ancestor,
|
||
s.summary
|
||
FROM functional_parquet.iceberg_alltypes_part.history h
|
||
JOIN functional_parquet.iceberg_alltypes_part.snapshots s
|
||
ON h.snapshot_id = s.snapshot_id
|
||
WHERE s.operation = 'append'
|
||
ORDER BY made_current_at;
|
||
</codeblock>
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_puffin_stats">
|
||
<title>Iceberg Puffin statistics</title>
|
||
<conbody>
|
||
<p>
|
||
Impala supports reading NDV (Number of Distinct Values) statistics from Puffin files.
|
||
For the Puffin specification, see <xref keyref="upstream_iceberg_puffin_site"/>.
|
||
</p>
|
||
<p>
|
||
If there are Puffin stats for multiple snapshots, Impala chooses the most recent
|
||
one for each column. Note that this means that the stats for different columns may
|
||
come from different snapshots.
|
||
</p>
|
||
<p>
|
||
In case there are both HMS and Puffin NDV stats for a column, the more recent one
|
||
will be used. For HMS stats we use the 'impala.computeStatsSnapshotId' table
|
||
property which stores, for each column, the snapshot for which HMS stats were
|
||
calculated. We compare this with the snapshot of the Puffin stats to decide which
|
||
is more recent.
|
||
</p>
|
||
<p>
|
||
Reading Puffin stats is disabled by default; set the "--enable_reading_puffin_stats"
|
||
startup flag to true to enable it.
|
||
</p>
|
||
<p>
|
||
Some engines, e.g. Trino, also write the NDV as a property (with key "ndv") in the
|
||
"statistics" section of the metadata.json file for each blob, in addition to the
|
||
Puffin file. If such a property is present for a blob, Impala will read the value
|
||
from the metadata.json file instead of the Puffin file to reduce file I/O.
|
||
</p>
|
||
<p>
|
||
Note that it is currently not possible to drop Puffin stats from Impala.
|
||
For this reason, it is possible to disable reading Puffin stats in two ways:
|
||
<ul>
|
||
<li>Globally, with the aforementioned
|
||
<codeph>enable_reading_puffin_stats</codeph> startup flag - when it is set
|
||
to false, Impala will never read Puffin stats.</li>
|
||
<li>For specific tables, by setting the
|
||
<codeph>impala.iceberg_read_puffin_stats</codeph> table property to
|
||
"false".</li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
Note that Impala does not yet support writing Puffin statistics files.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_table_cloning">
|
||
<title>Cloning Iceberg tables (LIKE clause)</title>
|
||
<conbody>
|
||
<p>
|
||
Use <codeph>CREATE TABLE ... LIKE ...</codeph> to create an empty Iceberg table
|
||
based on the definition of another Iceberg table, including any column attributes in
|
||
the original table:
|
||
<codeblock>
|
||
CREATE TABLE new_ice_tbl LIKE orig_ice_tbl;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
Because of the Data Types of Iceberg and Impala do not correspond one by one, Impala
|
||
can only clone between Iceberg tables.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_table_properties">
|
||
<title>Iceberg table properties</title>
|
||
<conbody>
|
||
<p>
|
||
We can set the following table properties for Iceberg tables:
|
||
<ul>
|
||
<li>
|
||
<codeph>iceberg.catalog</codeph>: controls which catalog is used for this Iceberg table.
|
||
It can be 'hive.catalog' (default), 'hadoop.catalog', 'hadoop.tables', or a name that
|
||
identifies a catalog defined in the Hadoop configurations, e.g. hive-site.xml
|
||
</li>
|
||
<li><codeph>iceberg.catalog_location</codeph>: Iceberg table catalog location when <codeph>iceberg.catalog</codeph> is <codeph>'hadoop.catalog'</codeph></li>
|
||
<li><codeph>iceberg.table_identifier</codeph>: Iceberg table identifier. We use <database>.<table> instead if this property is not set</li>
|
||
<li><codeph>write.format.default</codeph>: data file format of the table. Impala can read AVRO, ORC and PARQUET data files in Iceberg tables, and can write PARQUET data files only.</li>
|
||
<li><codeph>write.parquet.compression-codec</codeph>:
|
||
Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY
|
||
(default value), LZ4, ZSTD. The table property will be ignored if
|
||
<codeph>COMPRESSION_CODEC</codeph> query option is set.
|
||
</li>
|
||
<li><codeph>write.parquet.compression-level</codeph>:
|
||
Parquet compression level. Used with ZSTD compression only.
|
||
Supported range is [1, 22]. Default value is 3. The table property
|
||
will be ignored if <codeph>COMPRESSION_CODEC</codeph> query option is set.
|
||
</li>
|
||
<li><codeph>write.parquet.row-group-size-bytes</codeph>:
|
||
Parquet row group size in bytes. Supported range is [8388608,
|
||
2146435072] (8MB - 2047MB). The table property will be ignored if
|
||
<codeph>PARQUET_FILE_SIZE</codeph> query option is set.
|
||
If neither the table property nor the <codeph>PARQUET_FILE_SIZE</codeph> query option
|
||
is set, the way Impala calculates row group size will remain
|
||
unchanged.
|
||
</li>
|
||
<li><codeph>write.parquet.page-size-bytes</codeph>:
|
||
Parquet page size in bytes. Used for PLAIN encoding. Supported range
|
||
is [65536, 1073741824] (64KB - 1GB).
|
||
If the table property is unset, the way Impala calculates page size
|
||
will remain unchanged.
|
||
</li>
|
||
<li><codeph>write.parquet.dict-size-bytes</codeph>:
|
||
Parquet dictionary page size in bytes. Used for dictionary encoding.
|
||
Supported range is [65536, 1073741824] (64KB - 1GB).
|
||
If the table property is unset, the way Impala calculates dictionary
|
||
page size will remain unchanged.
|
||
</li>
|
||
</ul>
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
|
||
<concept id="iceberg_manifest_caching">
|
||
<title>Iceberg manifest caching</title>
|
||
<conbody>
|
||
<p>
|
||
Starting from version 1.1.0, Apache Iceberg provides a mechanism to cache the
|
||
contents of Iceberg manifest files in memory. This manifest caching feature helps
|
||
to reduce repeated reads of small Iceberg manifest files from remote storage by
|
||
Coordinators and Catalogd. This feature can be enabled for Impala Coordinators and
|
||
Catalogd by setting properties in Hadoop's core-site.xml as in the following:
|
||
<codeblock>
|
||
iceberg.io-impl=org.apache.iceberg.hadoop.HadoopFileIO;
|
||
iceberg.io.manifest.cache-enabled=true;
|
||
iceberg.io.manifest.cache.max-total-bytes=104857600;
|
||
iceberg.io.manifest.cache.expiration-interval-ms=3600000;
|
||
iceberg.io.manifest.cache.max-content-length=8388608;
|
||
</codeblock>
|
||
</p>
|
||
<p>
|
||
The description of each property is as follows:
|
||
<ul>
|
||
<li>
|
||
<codeph>iceberg.io-impl</codeph>: custom FileIO implementation to use in a
|
||
catalog. Must be set to enable manifest caching. Impala defaults to
|
||
HadoopFileIO. It is recommended to not change this to other than HadoopFileIO.
|
||
</li>
|
||
<li>
|
||
<codeph>iceberg.io.manifest.cache-enabled</codeph>: enable/disable the
|
||
manifest caching feature.
|
||
</li>
|
||
<li>
|
||
<codeph>iceberg.io.manifest.cache.max-total-bytes</codeph>: maximum total
|
||
amount of bytes to cache in the manifest cache. Must be a positive value.
|
||
</li>
|
||
<li>
|
||
<codeph>iceberg.io.manifest.cache.expiration-interval-ms</codeph>: maximum
|
||
duration for which an entry stays in the manifest cache. Must be a
|
||
non-negative value. Setting zero means cache entries expire only if it gets
|
||
evicted due to memory pressure from
|
||
<codeph>iceberg.io.manifest.cache.max-total-bytes</codeph>.
|
||
</li>
|
||
<li>
|
||
<codeph>iceberg.io.manifest.cache.max-content-length</codeph>: maximum length
|
||
of a manifest file to be considered for caching in bytes. Manifest files with
|
||
a length exceeding this property value will not be cached. Must be set with a
|
||
positive value and lower than
|
||
<codeph>iceberg.io.manifest.cache.max-total-bytes</codeph>.
|
||
</li>
|
||
</ul>
|
||
</p>
|
||
<p>
|
||
Manifest caching only works for tables that are loaded with either of
|
||
HadoopCatalogs or HiveCatalogs. Individual HadoopCatalog and HiveCatalog will have
|
||
separate manifest caches with the same configuration. By default, only 8 catalogs
|
||
can have their manifest cache active in memory. This number can be raised by
|
||
setting a higher value in the java system property
|
||
<codeph>iceberg.io.manifest.cache.fileio-max</codeph>.
|
||
</p>
|
||
</conbody>
|
||
</concept>
|
||
</concept>
|