mirror of
https://github.com/apache/impala.git
synced 2026-01-07 00:02:28 -05:00
When Hive (and probably other engines as well) converts a legacy Hive table to Iceberg it doesn't rewrite the data files. It means that the data files don't have write ids neither partition column data. Currently Impala expects the partition columns to be present in the data files, so it is not able to read converted partitioned tables. With this patch Impala loads partition values from the Iceberg metadata. The extra metadata information is attached to the file descriptor objects and propageted to the scanners. This metadata contains the Iceberg data file format (later it could be used to handle mixed-format tables), and partition data. We use the partition data in the HdfsScanner to create the template tuple that contains the partition values of identity-partitioned columns. This is not only true to migrated tables, but all Iceberg tables with identity partitions, which means we also save some IO and CPU time for such columns. The partition information could also be used for Dynamic Partition Pruning later. We use the (human-readable) string representation of the partition data when storing them in the flat buffers. This helps debugging, also it provides the needed flexibility when the partition columns evolve (e.g. INT -> BIGINT, DECIMAL(4,2) -> DECIMAL(6,2)). Testing * e2e test for all data types that can be used to partition a table * e2e test for migrated partitioned table + schema evolution (without renaming columns) * e2e for table where all columns are used as identity-partitions Change-Id: Iac11a02de709d43532056f71359c49d20c1be2b8 Reviewed-on: http://gerrit.cloudera.org:8080/18240 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
718 lines
36 KiB
Plaintext
718 lines
36 KiB
Plaintext
bad_parquet_data.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"is"
|
|
"fun"
|
|
|
|
bad_compressed_dict_page_size.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single string column 'col' with one row ("a"). The compressed_page_size field
|
|
in dict page header is modifed to 0 to test if it is correctly handled.
|
|
|
|
bad_rle_literal_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the values 1, 3, 7 stored
|
|
in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are all literals (and not repeated), but the literal count
|
|
is incorrectly 0 in the file to test that such data corruption is
|
|
proprly handled.
|
|
|
|
bad_rle_repeat_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the value 7 repeated 7 times
|
|
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are a single repeated run (and not literals), but the repeat count
|
|
is incorrectly 0 in the file to test that such data corruption is proprly
|
|
handled.
|
|
|
|
zero_rows_zero_row_groups.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows and no row groups.
|
|
|
|
zero_rows_one_row_group.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows but one row group.
|
|
|
|
huge_num_rows.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates 2 * MAX_INT32 rows.
|
|
The single row group also has the same number of rows in the metadata.
|
|
|
|
repeated_values.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"parquet"
|
|
"parquet"
|
|
|
|
multiple_rowgroups.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Populated with:
|
|
hive> set parquet.block.size=500;
|
|
hive> INSERT INTO TABLE tbl
|
|
SELECT l_comment FROM tpch.lineitem LIMIT 1000;
|
|
|
|
alltypesagg_hive_13_1.parquet:
|
|
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
|
|
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;
|
|
|
|
bad_column_metadata.parquet:
|
|
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
|
|
Schema:
|
|
{"type": "record",
|
|
"namespace": "org.apache.impala",
|
|
"name": "bad_column_metadata",
|
|
"fields": [
|
|
{"name": "id", "type": ["null", "long"]},
|
|
{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
|
|
]
|
|
}
|
|
Contains 3 row groups, each with ten rows and each array containing ten elements. The
|
|
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
|
|
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
|
|
there are 11 values (instead of 10). The third rowgroup has the correct metadata.
|
|
|
|
data-bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
data-pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 streams
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
out_of_range_timestamp.parquet:
|
|
Generated with a hacked version of Impala parquet writer.
|
|
Contains a single timestamp column with 4 values, 2 of which are out of range
|
|
and should be read as NULL by Impala:
|
|
1399-12-31 00:00:00 (invalid - date too small)
|
|
1400-01-01 00:00:00
|
|
9999-12-31 00:00:00
|
|
10000-01-01 00:00:00 (invalid - date too large)
|
|
|
|
table_with_header.csv:
|
|
Created with a text editor, contains a header line before the data rows.
|
|
|
|
table_with_header_2.csv:
|
|
Created with a text editor, contains two header lines before the data rows.
|
|
|
|
table_with_header.gz, table_with_header_2.gz:
|
|
Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.
|
|
|
|
deprecated_statistics.parquet:
|
|
Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
|
|
Contains a copy of the data in functional.alltypessmall with statistics that use the old
|
|
'min'/'max' fields.
|
|
|
|
repeated_root_schema.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
|
|
repetition level of the root schema is set to REPEATED.
|
|
Reproduction steps:
|
|
1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
|
|
file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
|
|
2: Run test_compute_stats and grab the created Parquet file for
|
|
alltypes_parquet table.
|
|
|
|
binary_decimal_dictionary.parquet,
|
|
binary_decimal_no_dictionary.parquet:
|
|
Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
|
|
Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
|
|
and non-dictionary encoding respectively.
|
|
|
|
alltypes_agg_bitpacked_def_levels.parquet:
|
|
Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
|
|
of the standard RLE-encoded levels. See
|
|
https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
|
|
is a single file containing all of the alltypesagg data, which includes a mix of
|
|
null and non-null values. This is not actually a valid Parquet file because the
|
|
bit-packed levels are written in the reverse order specified in the Parquet spec
|
|
for BIT_PACKED. However, this is the order that Impala attempts to read the levels
|
|
in - see IMPALA-3006.
|
|
|
|
signed_integer_logical_types.parquet:
|
|
Generated using a utility that uses the java Parquet API.
|
|
The file has the following schema:
|
|
schema {
|
|
optional int32 id;
|
|
optional int32 tinyint_col (INT_8);
|
|
optional int32 smallint_col (INT_16);
|
|
optional int32 int_col;
|
|
optional int64 bigint_col;
|
|
}
|
|
|
|
min_max_is_nan.parquet:
|
|
Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
|
|
Created to test the read path for a Parquet file with invalid metadata, namely when
|
|
'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
|
|
NaN
|
|
42
|
|
|
|
bad_codec.parquet:
|
|
Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
|
|
compression codec. The data in the file is the whole of the "alltypestiny" data set, with
|
|
the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
|
|
int_col int, bigint_col bigint, float_col float, double_col double,
|
|
date_string_col string, string_col string, timestamp_col timestamp, year int, month int
|
|
|
|
num_values_def_levels_mismatch.parquet:
|
|
A file with a single boolean column with page metadata reporting 2 values but only def
|
|
levels for a single literal value. Generated by hacking Impala's parquet writer to
|
|
increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK
|
|
(IMPALA-6589).
|
|
|
|
rle_encoded_bool.parquet:
|
|
Parquet v1 file with RLE encoded boolean column "b" and int column "i".
|
|
Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows,
|
|
139 with value false, and 140 with value true. "i" is always 1 if "b" is True
|
|
and always 0 if "b" is false.
|
|
|
|
dict_encoding_with_large_bit_width.parquet:
|
|
Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified
|
|
Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead
|
|
to DCHECK errors (IMPALA-7147).
|
|
|
|
decimal_stored_as_int32.parquet:
|
|
Parquet file generated by Spark 2.3.1 that contains decimals stored as int32.
|
|
Impala needs to be able to read such values (IMPALA-5542)
|
|
|
|
decimal_stored_as_int64.parquet:
|
|
Parquet file generated by Spark 2.3.1 that contains decimals stored as int64.
|
|
Impala needs to be able to read such values (IMPALA-5542)
|
|
|
|
decimal_padded_fixed_len_byte_array.parquet:
|
|
Parquet file generated by a hacked Impala where decimals are encoded as
|
|
FIXED_LEN_BYTE_ARRAY with an extra byte of padding (IMPALA-2515). The
|
|
data is the same as functional.decimal_tbl.
|
|
|
|
decimal_padded_fixed_len_byte_array2.parquet:
|
|
Parquet file generated by a hacked Impala where decimals are encoded as
|
|
FIXED_LEN_BYTE_ARRAY with an extra byte of padding (IMPALA-2515).
|
|
The Impala was hacked to limit dictionaries to 2000 entries, and
|
|
PARQUET_PAGE_ROW_COUNT_LIMIT was set to 1234. The resulted in
|
|
a file with two dictionary encoded pages and multiple plain
|
|
encoded pages. The values are distributed across the full
|
|
range of DECIMAL(10, 0). The file was created as follows:
|
|
|
|
create table d(d decimal(10, 0)) stored as parquet;
|
|
set num_nodes=1;
|
|
set PARQUET_PAGE_ROW_COUNT_LIMIT=1234;
|
|
insert into d
|
|
select distinct cast((1000000000 - (o_orderkey * 110503)) % 1000000000 as decimal(9, 0))
|
|
from tpch_parquet.orders where o_orderkey < 50000;
|
|
|
|
|
|
primitive_type_widening.parquet:
|
|
Parquet file that contains two rows with the following schema:
|
|
- int32 tinyint_col1
|
|
- int32 tinyint_col2
|
|
- int32 tinyint_col3
|
|
- int32 tinyint_col4
|
|
- int32 smallint_col1
|
|
- int32 smallint_col2
|
|
- int32 smallint_col3
|
|
- int32 int_col1
|
|
- int32 int_col2
|
|
- float float_col
|
|
It is used to test primitive type widening (IMPALA-6373).
|
|
|
|
corrupt_footer_len_decr.parquet:
|
|
Parquet file that contains one row of the following schema:
|
|
- bigint c
|
|
The footer size is manually modified (using hexedit) to be the original file size minus
|
|
1, to cause metadata deserialization in footer parsing to fail, thus trigger the printing
|
|
of an error message with incorrect file offset, to verify that it's fixed by IMPALA-6442.
|
|
|
|
corrupt_footer_len_incr.parquet:
|
|
Parquet file that contains one row of the following schema:
|
|
- bigint c
|
|
The footer size is manually modified (using hexedit) to be larger than the original file
|
|
size and cause footer parsing to fail. It's used to test an error message related to
|
|
IMPALA-6442.
|
|
|
|
hive_single_value_timestamp.parq:
|
|
Parquet file written by Hive with the followin schema:
|
|
i int, timestamp d
|
|
Contains a single row. It is used to test IMPALA-7559 which only occurs when all values
|
|
in a column chunk are the same timestamp and the file is written with parquet-mr (which
|
|
is used by Hive).
|
|
|
|
out_of_range_time_of_day.parquet:
|
|
IMPALA-7595: Parquet file that contains timestamps where the time part is out of the
|
|
valid range [0..24H). Before the fix, select * returned these values:
|
|
1970-01-01 -00:00:00.000000001 (invalid - negative time of day)
|
|
1970-01-01 00:00:00
|
|
1970-01-01 23:59:59.999999999
|
|
1970-01-01 24:00:00 (invalid - time of day should be less than a whole day)
|
|
|
|
strings_with_quotes.csv:
|
|
Various strings with quotes in them to reproduce bugs like IMPALA-7586.
|
|
|
|
int64_timestamps_plain.parq:
|
|
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
|
|
Timestamp logical types. Has the following columns:
|
|
new_logical_milli_utc, new_logical_milli_local,
|
|
new_logical_micro_utc, new_logical_micro_local
|
|
|
|
int64_timestamps_dict.parq:
|
|
Parquet file generated with Parquet-mr that contains dictionary encoded int64 columns
|
|
with Timestamp logical types. Has the following columns:
|
|
id,
|
|
new_logical_milli_utc, new_logical_milli_local,
|
|
new_logical_micro_utc, new_logical_micro_local
|
|
|
|
int64_timestamps_at_dst_changes.parquet:
|
|
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
|
|
Timestamp logical types. The file contains 3 row groups, and all row groups contain
|
|
3 distinct values, so there is a "min", a "max", and a "middle" value. The values were
|
|
selected in such a way that the UTC->CET conversion changes the order of the values (this
|
|
is possible during Summer->Winter DST change) and "middle" falls outside the "min".."max"
|
|
range after conversion. This means that a naive stat filtering implementation could drop
|
|
"middle" incorrectly.
|
|
Example (all dates are 2017-10-29):
|
|
UTC: 00:45:00, 01:00:00, 01:10:00 =>
|
|
CET: 02:45:00, 02:00:00, 02:10:00
|
|
Columns: rawvalue bigint, rowgroup int, millisutc timsestamp, microsutc timestamp
|
|
|
|
int64_timestamps_nano.parquet:
|
|
Parquet file generated with Parquet-mr that contains int64 columns with nanosecond
|
|
precision. Tested separately from the micro/millisecond columns because of the different
|
|
valid range.
|
|
Columns: rawvalue bigint, nanoutc timestamp, nanononutc timestamp
|
|
|
|
out_of_range_timestamp_hive_211.parquet:
|
|
Hive-generated file with an out-of-range timestamp. Generated with Hive 2.1.1 using
|
|
the following query:
|
|
create table alltypes_hive stored as parquet as
|
|
select * from functional.alltypes
|
|
union all
|
|
select -1, false, 0, 0, 0, 0, 0, 0, '', '', cast('1399-01-01 00:00:00' as timestamp), 0, 0
|
|
|
|
out_of_range_timestamp2_hive_211.parquet:
|
|
Hive-generated file with out-of-range timestamps every second value, to exercise code
|
|
paths in Parquet scanner for non-repeated runs. Generated with Hive 2.1.1 using
|
|
the following query:
|
|
create table hive_invalid_timestamps stored as parquet as
|
|
select id,
|
|
case id % 3
|
|
when 0 then timestamp_col
|
|
when 1 then NULL
|
|
when 2 then cast('1300-01-01 9:9:9' as timestamp)
|
|
end timestamp_col
|
|
from functional.alltypes
|
|
sort by id
|
|
|
|
decimal_rtf_tbl.txt:
|
|
This was generated using formulas in Google Sheets. The goal was to create various
|
|
decimal values that covers the 3 storage formats with various precision and scale.
|
|
This is a reasonably large table that is used for testing min-max filters
|
|
with decimal types on Kudu.
|
|
|
|
decimal_rtf_tiny_tbl.txt:
|
|
Small table with specific decimal values picked from decimal_rtf_tbl.txt so that
|
|
min-max filter based pruning can be tested with decimal types on Kudu.
|
|
|
|
date_tbl.orc
|
|
Small orc table with one DATE column, created by Hive.
|
|
|
|
date_tbl.avro
|
|
Small avro table with one DATE column, created by Hive.
|
|
|
|
date_tbl.parquet
|
|
Small parquet table with one DATE column, created by Parquet MR.
|
|
|
|
out_of_range_date.parquet:
|
|
Generated with a hacked version of Impala parquet writer.
|
|
Contains a single DATE column with 9 values, 4 of which are out of range
|
|
and should be read as NULL by Impala:
|
|
-0001-12-31 (invalid - date too small)
|
|
0000-01-01 (invalid - date too small)
|
|
0000-01-02 (invalid - date too small)
|
|
1969-12-31
|
|
1970-01-01
|
|
1970-01-02
|
|
9999-12-30
|
|
9999-12-31
|
|
10000-01-01 (invalid - date too large)
|
|
|
|
out_of_range_date.orc:
|
|
Created using a pre-3.1. Hive version (2.1.1.) to contain an out-of-range date value.
|
|
Took advantage of the incompatibility between Hive and Impala when Hive (before 3.1)
|
|
writes a date before 1582-10-15 is interpreted incorrectly by Impala. The values I wrote
|
|
with Hive:
|
|
2019-10-04, 1582-10-15, 0001-01-01, 9999-12-31
|
|
This is interpreted by Impala as:
|
|
2019-10-04, 1582-10-15, 0000-12-30 (invalid - date too small), 9999-12-31
|
|
|
|
|
|
|
|
hive2_pre_gregorian.parquet:
|
|
Small parquet table with one DATE column, created by Hive 2.1.1.
|
|
Used to demonstrate parquet interoperability issues between Hive and Impala for dates
|
|
before the introduction of Gregorian calendar in 1582-10-15.
|
|
|
|
hive2_pre_gregorian.orc:
|
|
Same as the above but in ORC format instead of Parquet.
|
|
|
|
decimals_1_10.parquet:
|
|
Contains two decimal columns, one with precision 1, the other with precision 10.
|
|
I used Hive 2.1.1 with a modified version of Parquet-MR (6901a20) to create tiny,
|
|
misaligned pages in order to test the value-skipping logic in the Parquet column readers.
|
|
The modification in Parquet-MR was to set MIN_SLAB_SIZE to 1. You can find the change
|
|
here: https://github.com/boroknagyz/parquet-mr/tree/tiny_pages
|
|
hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=5
|
|
--hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
|
|
create table decimals_1_10 (d_1 DECIMAL(1, 0), d_10 DECIMAL(10, 0)) stored as PARQUET
|
|
insert into decimals_1_10 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5),
|
|
(NULL, 1), (2, 2), (3, 3), (4, 4), (5, 5),
|
|
(1, 1), (NULL, 2), (3, 3), (4, 4), (5, 5),
|
|
(1, 1), (2, 2), (NULL, 3), (4, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (NULL, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (4, 4), (NULL, 5),
|
|
(NULL, 1), (NULL, 2), (3, 3), (4, 4), (5, 5),
|
|
(1, 1), (NULL, 2), (3, 3), (NULL, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (NULL, 4), (NULL, 5),
|
|
(NULL, 1), (2, 2), (NULL, 3), (NULL, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
|
|
|
|
nested_decimals.parquet:
|
|
Contains two columns, one is a decimal column, the other is an array of decimals.
|
|
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
|
|
hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16
|
|
--hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
|
|
create table nested_decimals (d_38 Decimal(38, 0), arr array<Decimal(1, 0)>) stored as parquet;
|
|
insert into nested_decimals select 1, array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) ) union all
|
|
select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all
|
|
select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all
|
|
select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all
|
|
select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all
|
|
|
|
select 1, array(cast (1 as decimal(1,0)) ) union all
|
|
select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all
|
|
select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all
|
|
select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all
|
|
select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all
|
|
|
|
select 1, array(cast (NULL as decimal(1, 0)), NULL, NULL) union all
|
|
select 2, array(cast (2 as decimal(1,0)), NULL, NULL) union all
|
|
select 3, array(cast (3 as decimal(1,0)), NULL, cast (3 as decimal(1,0))) union all
|
|
select 4, array(NULL, cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), NULL) union all
|
|
select 5, array(NULL, cast (5 as decimal(1,0)), NULL, NULL, cast (5 as decimal(1,0)) ) union all
|
|
|
|
select 6, array(cast (6 as decimal(1,0)), NULL, cast (6 as decimal(1,0)) ) union all
|
|
select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), NULL ) union all
|
|
select 8, array(NULL, NULL, cast (8 as decimal(1,0)) ) union all
|
|
select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)) ) union all
|
|
select 6, array(NULL, NULL, NULL, cast (6 as decimal(1,0)) );
|
|
|
|
double_nested_decimals.parquet:
|
|
Contains two columns, one is a decimal column, the other is an array of arrays of
|
|
decimals. I used Hive 2.1.1 with a modified Parquet-MR, see description
|
|
at decimals_1_10.parquet.
|
|
hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16
|
|
--hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
|
|
create table double_nested_decimals (d_38 Decimal(38, 0), arr array<array<Decimal(1, 0)>>) stored as parquet;
|
|
insert into double_nested_decimals select 1, array(array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) )) union all
|
|
select 2, array(array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) )) union all
|
|
select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) )) union all
|
|
select 4, array(array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) )) union all
|
|
select 5, array(array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) )) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (NULL as decimal(1,0))), array(cast (3 as decimal(1,0))), NULL ) union all
|
|
select 4, array(NULL, NULL, array(cast (NULL as decimal(1,0)), NULL, NULL, NULL, NULL) ) union all
|
|
select 5, array(array(NULL, cast (5 as decimal(1,0)), NULL, NULL, NULL) ) union all
|
|
|
|
select 6, array(array(cast (6 as decimal(1,0)), NULL), array(cast (6 as decimal(1,0))) ) union all
|
|
select 7, array(array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0))), NULL ) union all
|
|
select 8, array(array(NULL, NULL, cast (8 as decimal(1,0))) ) union all
|
|
select 7, array(array(cast (7 as decimal(1,0)), cast (NULL as decimal(1,0))), array(cast (7 as decimal(1,0))) ) union all
|
|
select 6, array(array(NULL, NULL, cast (6 as decimal(1,0))), array(NULL, cast (6 as decimal(1,0))) );
|
|
|
|
alltypes_tiny_pages.parquet:
|
|
Created from 'functional.alltypes' with small page sizes.
|
|
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
|
|
I used the following commands to create the file:
|
|
hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.page.size.row.check.min=7
|
|
create table alltypes_tiny_pages stored as parquet as select * from functional_parquet.alltypes
|
|
|
|
alltypes_tiny_pages_plain.parquet:
|
|
Created from 'functional.alltypes' with small page sizes without dictionary encoding.
|
|
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
|
|
I used the following commands to create the file:
|
|
hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=7
|
|
create table alltypes_tiny_pages_plain stored as parquet as select * from functional_parquet.alltypes
|
|
|
|
parent_table:
|
|
Created manually. Contains two columns, an INT and a STRING column. Together they form primary key for the table. This table is used to test primary key and foreign key
|
|
relationships along with parent_table_2 and child_table.
|
|
|
|
parent_table_2:
|
|
Created manually. Contains just one int column which is also the table's primary key. This table is used to test primary key and foreign key
|
|
relationships along with parent_table and child_table.
|
|
|
|
child_table:
|
|
Created manually. Contains four columns. 'seq' column is the primary key of this table. ('id', 'year') form a foreign key referring to parent_table('id', 'year') and 'a' is a
|
|
foreign key referring to parent_table_2's primary column 'a'.
|
|
|
|
out_of_range_timestamp.orc:
|
|
Created with Hive. ORC file with a single timestamp column 'ts'.
|
|
Contains one row (1300-01-01 00:00:00) which is outside Impala's valid time range.
|
|
|
|
corrupt_schema.orc:
|
|
ORC file from IMPALA-9277, generated by fuzz test. The file contains malformed metadata.
|
|
|
|
corrupt_root_type.orc:
|
|
ORC file for IMPALA-9249, generated by fuzz test. The root type of the schema is not
|
|
struct, which used to hit a DCHECK.
|
|
|
|
hll_sketches_from_hive.parquet:
|
|
This file contains a table that has some string columns to store serialized Apache
|
|
DataSketches HLL sketches created by Hive. Each column contains a sketch for a
|
|
specific data type. Covers the following types: TINYINT, INT, BIGINT, FLOAT, DOUBLE,
|
|
STRING, CHAR and VARCHAR. Has an additional column for NULL values.
|
|
|
|
hll_sketches_from_impala.parquet:
|
|
This holds the same sketches as hll_sketches_from_hive.parquet but these sketches were
|
|
created by Impala instead of Hive.
|
|
|
|
cpc_sketches_from_hive.parquet:
|
|
This file contains a table that has some string columns to store serialized Apache
|
|
DataSketches CPC sketches created by Hive. Each column contains a sketch for a
|
|
specific data type. Covers the following types: TINYINT, INT, BIGINT, FLOAT, DOUBLE,
|
|
STRING, CHAR and VARCHAR. Has an additional column for NULL values.
|
|
|
|
cpc_sketches_from_impala.parquet:
|
|
This holds the same sketches as cpc_sketches_from_hive.parquet but these sketches were
|
|
created by Impala instead of Hive.
|
|
|
|
theta_sketches_from_hive.parquet:
|
|
This file contains a table that has some string columns to store serialized Apache
|
|
DataSketches Theta sketches created by Hive. Each column contains a sketch for a
|
|
specific data type. Covers the following types: TINYINT, INT, BIGINT, FLOAT, DOUBLE,
|
|
STRING, CHAR and VARCHAR. Has an additional column for NULL values.
|
|
|
|
theta_sketches_from_impala.parquet:
|
|
This holds the same sketches as theta_sketches_from_hive.parquet but these sketches were
|
|
created by Impala instead of Hive.
|
|
|
|
kll_sketches_from_hive.parquet:
|
|
This file contains a table that has some string columns to store serialized Apache
|
|
DataSketches KLL sketches created by Hive. Each column is for a different purpose:
|
|
- 'f': Float with distinct values.
|
|
- 'repetitions': Float with some repetition in the values.
|
|
- 'some_nulls': Float values and some NULLs.
|
|
- 'all_nulls': All values are NULLs.
|
|
- 'some_nans': Floats with some NaN values.
|
|
- 'all_nans': All values are NaNs.
|
|
|
|
kll_sketches_from_impala.parquet:
|
|
This holds the same sketches as kll_sketches_from_hive.parquet but these sketches were
|
|
created by Impala instead of Hive.
|
|
|
|
iceberg:
|
|
IMPALA-9741: Support querying Iceberg table by impala
|
|
We generated data by spark-shell, version is 2.4.x, and there are two tables' data in
|
|
testdata/data/iceberg_test. Iceberg table location contains
|
|
two directories: metadata, which contains table metadata managed
|
|
by iceberg; data, which contains the data files.
|
|
|
|
iceberg_test/hadoop_catalog/ice/complextypestbl_iceberg_orc:
|
|
Iceberg table generated by Hive 3.1 + Iceberg 0.11. Originally it was a HiveCatalog
|
|
table, so I've renamed the metadata JSON files and added a version-hint.text file.
|
|
I've also edited the metadata JSON and AVRO files to remove 'hdfs://localhost:20500',
|
|
and updated the file paths. Now it can be used as a HadoopCatalog table.
|
|
|
|
hudi_parquet:
|
|
IMPALA-8778: Support read Apache Hudi tables
|
|
Hudi parquet is a special format of parquet files managed by Apache Hudi
|
|
(hudi.incubator.apache.org) to provide ACID transaction.
|
|
In order to provide snapshot isolation between writer and queries,
|
|
Hudi will write a newer version of the existing parquet file
|
|
if there is any update comes into the file.
|
|
Hudi store the indexing information and version information in the file name.
|
|
For example:
|
|
`ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet`
|
|
`ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file
|
|
`20200112194517` is the timestamp of this version
|
|
If there is a record was updated in this file, Hudi will write a new file with
|
|
the same indexing hash but a newer version depends on the time of writing.
|
|
`ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet`
|
|
If the impala table was refreshed after this file was written, impala will
|
|
only query on the file with latest version.
|
|
|
|
streaming.orc:
|
|
ORC file generated by Hive Streaming Ingestion. I used a slightly altered version of
|
|
TestStreaming.testNoBuckets() from Hive 3.1 to generate this file. It contains
|
|
values coming from two transactions. The file has two stripes (one per transaction).
|
|
|
|
alltypes_non_acid.orc:
|
|
Non-acid ORC file generated by Hive 3.1 with the following command:
|
|
CREATE TABLE alltypes_clone STORED AS ORC AS SELECT * FROM functional.alltypes.
|
|
It's used as an original file in ACID tests.
|
|
|
|
dateless_timestamps.parq:
|
|
Parquet file generated before the removal of dateless timestamp support.
|
|
Created as: CREATE TABLE timestamps_pq(t TIMESTAMP) STORED AS parquet;
|
|
It contains the folloving lines:
|
|
1996-04-22 10:00:00.432100000
|
|
1996-04-22 10:00:00.432100000
|
|
1996-04-22 10:00:00
|
|
1996-04-22 10:00:00
|
|
1996-04-22 00:00:00
|
|
10:00:00.432100000
|
|
10:00:00
|
|
|
|
full_acid_schema_but_no_acid_version.orc
|
|
IMPALA-10115: Impala should check file schema as well to check full ACIDv2 files
|
|
Genereted by query-based compaction by Hive 3.1. The file has full ACIDv2 schema, but
|
|
doesn't have the file user metadata 'hive.acid.version'.
|
|
|
|
alltypes_empty_pages.parquet
|
|
Parquet file that contians empty data pages. Needed to test IMPALA-9952.
|
|
Generated by a modified Impala (git hash e038db44 (between 3.4 and 4.0)). I modified
|
|
HdfsParquetTableWriter::ShouldStartNewPage() to randomly start a new page:
|
|
int64_t r = random(); if (r % 2 + r % 3 + r % 5 == 0) return true;
|
|
Also modified HdfsParquetTableWriter::NewPage() to randomly insert empty pages:
|
|
if (r ... ) pages_.push_back(DataPage());
|
|
|
|
alltypes_invalid_pages.parquet
|
|
Parquet file that contains invalid data pages Needed to test IMPALA-9952.
|
|
Generated by a modified Impala (git hash e038db44 (between 3.4 and 4.0)). I modified
|
|
HdfsParquetTableWriter::ShouldStartNewPage() to randomly start a new page:
|
|
int64_t r = random(); if (r % 2 + r % 3 + r % 5 == 0) return true;
|
|
Also modified HdfsParquetTableWriter::BaseColumnWriter::Flush to randomly invalidate
|
|
the offset index:
|
|
if (r ... ) location.offset = -1;
|
|
|
|
customer_multiblock_page_index.parquet
|
|
Parquet file that contains multiple blocks in a single file Needed to test IMPALA-10310.
|
|
In order to generate this file, execute the following instruments:
|
|
// use 1.11.0 to generate page index
|
|
1. export HIVE_AUX_JARS_PATH=/path/parquet-hadoop-1.11.0.jar
|
|
// in hive shell
|
|
2. SET parquet.block.size=8192; // use little block size
|
|
3. SET parquet.page.row.count.limit=10; // little page row count generate multi pages
|
|
4. CREATE TABLE customer_multiblock_page_index_6
|
|
STORED AS PARQUET
|
|
TBLPROPERTIES('parquet.compression'='SNAPPY')
|
|
AS SELECT * FROM tpcds.customer
|
|
WHERE c_current_cdemo_sk IS NOT NULL ORDER BY c_current_cdemo_sk LIMIT 2000;
|
|
generated file will contains multi blocks, multi pages per block.
|
|
|
|
IMPALA-10361: Use field id to resolve columns for Iceberg tables
|
|
We generated data by spark-shell, version is 2.4.x, and table data is in
|
|
testdata/data/iceberg_test/hadoop_catalog/iceberg_resolution_test, this table
|
|
are generated with HadoopCatalog and Parquet fileformat. We use this table to
|
|
test complex types for field id resolving.
|
|
|
|
alltypes_tiny_rle_dictionary.parquet:
|
|
Tiny file using the RLE_DICTIONARY encoding.
|
|
Started impala with -write_new_parquet_dictionary_encodings=true
|
|
set max_fs_writers=1;
|
|
create table att stored as parquet as
|
|
select * from functional_parquet.alltypestiny;
|
|
|
|
timestamp_with_local_timezone.orc:
|
|
ORC file that contains column with type 'timestamp with local timezone'.
|
|
Generated by Spark/Iceberg.
|
|
|
|
parquet-bloom-filtering.parquet:
|
|
Generated by hacking
|
|
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestBloomFiltering.java
|
|
(65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7).
|
|
The schema was completely changed to allow us to test types supported in Parquet Bloom
|
|
filters.
|
|
|
|
ComplexTypesTbl/arrays.orc and arrays.parq
|
|
These tables hold 3 columns, an int ID and two arrays, one with int and the second with
|
|
string. The purpose of introducing these tables is to give more test coverage for zipping
|
|
unnests. There are rows where the 2 arrays are of the same lenght, or one of them is
|
|
longer than the other plus there are NULL and empty arrays as well.
|
|
|
|
binary_decimal_precision_and_scale_widening.parquet
|
|
Parquet file written with schema (decimal(9,2), decimal(18,2), decimal(38,2)). The rows
|
|
inside the file are carefully chosen so that they don't cause an overflow when being read
|
|
by an Impala table with a higher precision/scale.
|
|
|
|
iceberg_test/hadoop_catalog/ice/airports_parquet:
|
|
Regular Parquet table converted to Iceberg, which means that the data file doesn't contain
|
|
field ids.
|
|
|
|
iceberg_test/hadoop_catalog/ice/airports_orc:
|
|
Regular ORC table converted to Iceberg, which means that the data file doesn't contain
|
|
field ids.
|
|
|
|
too_many_def_levels.parquet:
|
|
Written by Impala 2.5. The Parquet pages have more encoded def levels than num_values.
|
|
|
|
partition_col_in_parquet.parquet:
|
|
Written by Impala 4.0. Parquet file with INT and DATE column. Values in the DATE columns
|
|
are identical. There's only a single value per page in the Parquet file (written by
|
|
setting query option 'parquet_page_row_count_limit' to 1).
|
|
|
|
no_scale.parquet
|
|
Generated by code injection, removed scale from written parquet files:
|
|
Status HdfsParquetTableWriter::WriteFileFooter() {
|
|
+ file_metadata_.schema[1].__set_scale(1);
|
|
+ file_metadata_.schema[1].__isset.scale = false;
|
|
+ file_metadata_.schema[1].logicalType.DECIMAL.scale = 1;
|
|
+ file_metadata_.schema[1].logicalType.__isset.DECIMAL = false;
|
|
+ file_metadata_.schema[1].__isset.logicalType = false;
|
|
create table my_decimal_tbl (d1 decimal(4,2)) stored as parquet;
|
|
insert into my_decimal_tbl values (cast(0 as decimal(4,2)));
|
|
|
|
iceberg_test/hadoop_catalog/ice/alltypes_part:
|
|
iceberg_test/hadoop_catalog/ice/alltypes_part_orc:
|
|
Generated by Hive 3.1 + Iceberg 0.11. Then the JSON and AVRO files were manually edited
|
|
to make these tables correspond to an Iceberg table in a HadoopCatalog instead of
|
|
HiveCatalog.
|
|
alltypes_part has PARQUET data files, alltypes_part_orc has ORC data files. They have
|
|
identity partitions with all the supported data types.
|
|
|
|
iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution:
|
|
iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc:
|
|
Generated by Hive 3.1 + Iceberg 0.11. Then the JSON and AVRO files were manually edited
|
|
to make these tables correspond to an Iceberg table in a HadoopCatalog instead of
|
|
HiveCatalog.
|
|
iceberg_legacy_partition_schema_evolution has PARQUET data files,
|
|
iceberg_legacy_partition_schema_evolution_orc has ORC data files.
|
|
The tables that have the following schema changes since table migration:
|
|
* Partition INT column to BIGINT
|
|
* Partition FLOAT column to DOUBLE
|
|
* Partition DECIMAL(5,3) column to DECIMAL(8,3)
|
|
* Non-partition column has been moved to end of the schema
|