mirror of
https://github.com/apache/impala.git
synced 2025-12-29 09:04:47 -05:00
Schema resolution doesn't work correctly for migrated partitioned Iceberg tables that have complex types. When we face a Parquet/ORC file in an Iceberg table that doesn't have field IDs in the file metadata, we assume that it is an old data file before migration, and the schema is the very first one, hence we can mimic Iceberg's field ID generation to assign field IDs to the file schema elements. This process didn't take the partition columns into account. Partition columns are not part of the data file but they still get field IDs. This only matters when there are complex types in the table, as partition columns are always the last columns in legacy Hive tables, and field IDs are assigned via a "BFS-like" traversal. I.e. if there are only primitive types in the table we don't have any problems, but the children of complex types columns are assigned incorrectly. This patch fixes field ID generation by taking the number of partitions into account. If none of the partition columns are included in the data file (common case) we adjust the file-level field IDs accordingly. It is also OK to have all the partition columns in the data files (it is not common, but we've seen such data files). We raise an error in other cases (some partition columns are in the data file, while others aren't). Testing: * e2e tests added * added negative tests Change-Id: Ie32952021b63d6b55b8820489e434bfc2a91580b Reviewed-on: http://gerrit.cloudera.org:8080/21761 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
iceberg_migrated_complextypes_and_partition_columns_in_data_files.parquet:
iceberg_migrated_complextypes_and_partition_columns_in_data_files.orc:
The data file in result_date=2024-08-26 was originally part of a non-partitioned
legacy table, i.e. it includes the column result_date and does not have Iceberg
field IDs.
They were written via Hive using the following commands:
CREATE TABLE array_struct_table_test_negative (id INT, name STRING, teststeps
ARRAY<STRUCT<step_number:INT,step_description:STRING>>, result_date STRING)
STORED AS PARQUET;
INSERT INTO array_struct_table_test_negative VALUES
(1, 'Test 1', `ARRAY`(NAMED_STRUCT('step_number', 1, 'step_description', 'Step 1 description'), NAMED_STRUCT('step_number', 2, 'step_description', 'Step 2 description')), '2024-08-26'),
(2, 'Test 2', `ARRAY`(NAMED_STRUCT('step_number', 1, 'step_description', 'Step 1 description'), NAMED_STRUCT('step_number', 2, 'step_description', 'Step 2 description'), NAMED_STRUCT('step_number', 3, 'step_description', 'Step 3 description')), '2024-08-26');
Same for ORC.