mirror of
https://github.com/apache/impala.git
synced 2026-02-03 00:00:40 -05:00
This patch implements the migration from legacy Hive tables to Iceberg
tables. The target Iceberg tables inherit the location of the original
Hive tables. The Hive table has to be a non-transactional table.
To migrate a Hive format table stored in a distributed system or object
store to an Iceberg table use the command:
ALTER TABLE [dbname.]table_name CONVERT TO ICEBERG [TBLPROPERTIES(...)];
Currently only 'iceberg.catalog' is allowed as a table property.
For example
- ALTER TABLE hive_table CONVERT TO ICEBERG;
- ALTER TABLE hive_table CONVERT TO ICEBERG TBLPROPERTIES(
'iceberg.catalog' = 'hadoop.tables');
The HDFS table to be converted must follow those requirements:
- table is not a transactional table
- InputFormat must be either PARQUET, ORC, or AVRO
This is an in-place migration so the original data files of the legacy
Hive table are re-used and not moved, copied or re-created by this
operation. The new Iceberg table will have the 'external.table.purge'
property set to true after the migration.
NUM_THREADS_FOR_TABLE_MIGRATION query option can control the maximum
number of threads to execute the table conversion. The default value is
one, meaning that table conversion runs on one thread. It can be
configured in a range of [0, 1024]. Zero means that the number of CPU
cores will be the degree of parallelism. A value greater than zero will
imply the number of threads used for table conversion, however, there
is a cap of the number of CPU cores as the highest degree of
parallelism.
Process of migration:
- Step 1: Setting table properties,
e.g. 'external.table.purge'=false on the HDFS table.
- Step 2: Rename the HDFS table to a temporary table name using a name
format of "<original_table_name>_tmp_<random_ID>".
- Step 3: Refresh the renamed HDFS table.
- Step 4: Create an external Iceberg table by Iceberg API using the
data of the Hdfs table.
- Step 5 (Optional): For an Iceberg table in Hadoop Tables, run a
CREATE TABLE query to add the Iceberg table to HMS as well.
- Step 6 (Optional): For an Iceberg table in Hive catalog, run an
INVALIDATE METADATA to make the new table available for all
coordinators right after the conversion finished.
- Step 7 (Optional): For an Iceberg table in Hadoop Tables, set the
'external.table.purge' property to true in an ALTER TABLE
query.
- Step 8: Drop the temporary HDFS table.
Testing:
- Add e2e tests
- Add FE UTs
- Manually tested the runtime performance for a table that is
unpartitioned and has 10k data files. The runtime is around 10-13s.
Co-authored-by: lipenglin <lipenglin@apache.org>
Change-Id: Iacdad996d680fe545cc9a45e6bc64a348a64cd80
Reviewed-on: http://gerrit.cloudera.org:8080/20077
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tamas Mate <tmater@apache.org>