mirror of
https://github.com/apache/impala.git
synced 2026-01-07 00:02:28 -05:00
IMPALA-6592 revealed a gap in test coverage for files with invalid/unsupported Parquet codecs. This adds a test that reproduces the bug that was present in my IMPALA-4835 patch. master is unaffected by this bug. I also hid the conversion tables and made the conversion go through functions that validate the enum values, to make it easier to track down problems like this in the future. Testing: Ran exhaustive tests. Change-Id: I1502ea7b7f39aa09f0ed2677e84219b37c64c416 Reviewed-on: http://gerrit.cloudera.org:8080/9500 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins
165 lines
6.3 KiB
Plaintext
165 lines
6.3 KiB
Plaintext
bad_parquet_data.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"is"
|
|
"fun"
|
|
|
|
bad_compressed_dict_page_size.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single string column 'col' with one row ("a"). The compressed_page_size field
|
|
in dict page header is modifed to 0 to test if it is correctly handled.
|
|
|
|
bad_rle_literal_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the values 1, 3, 7 stored
|
|
in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are all literals (and not repeated), but the literal count
|
|
is incorrectly 0 in the file to test that such data corruption is
|
|
proprly handled.
|
|
|
|
bad_rle_repeat_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the value 7 repeated 7 times
|
|
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are a single repeated run (and not literals), but the repeat count
|
|
is incorrectly 0 in the file to test that such data corruption is proprly
|
|
handled.
|
|
|
|
zero_rows_zero_row_groups.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows and no row groups.
|
|
|
|
zero_rows_one_row_group.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows but one row group.
|
|
|
|
huge_num_rows.parquet
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates 2 * MAX_INT32 rows.
|
|
The single row group also has the same number of rows in the metadata.
|
|
|
|
repeated_values.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"parquet"
|
|
"parquet"
|
|
|
|
multiple_rowgroups.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Populated with:
|
|
hive> set parquet.block.size=500;
|
|
hive> INSERT INTO TABLE tbl
|
|
SELECT l_comment FROM tpch.lineitem LIMIT 1000;
|
|
|
|
alltypesagg_hive_13_1.parquet:
|
|
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
|
|
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;
|
|
|
|
bad_column_metadata.parquet:
|
|
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
|
|
Schema:
|
|
{"type": "record",
|
|
"namespace": "org.apache.impala",
|
|
"name": "bad_column_metadata",
|
|
"fields": [
|
|
{"name": "id", "type": ["null", "long"]},
|
|
{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
|
|
]
|
|
}
|
|
Contains 3 row groups, each with ten rows and each array containing ten elements. The
|
|
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
|
|
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
|
|
there are 11 values (instead of 10). The third rowgroup has the correct metadata.
|
|
|
|
data-bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
data-pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 streams
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
out_of_range_timestamp.parquet:
|
|
Generated with a hacked version of Impala parquet writer.
|
|
Contains a single timestamp column with 4 values, 2 of which are out of range
|
|
and should be read as NULL by Impala:
|
|
1399-12-31 00:00:00 (invalid - date too small)
|
|
1400-01-01 00:00:00
|
|
9999-12-31 00:00:00
|
|
10000-01-01 00:00:00 (invalid - date too large)
|
|
|
|
table_with_header.csv:
|
|
Created with a text editor, contains a header line before the data rows.
|
|
|
|
table_with_header_2.csv:
|
|
Created with a text editor, contains two header lines before the data rows.
|
|
|
|
table_with_header.gz, table_with_header_2.gz:
|
|
Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.
|
|
|
|
deprecated_statistics.parquet:
|
|
Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
|
|
Contains a copy of the data in functional.alltypessmall with statistics that use the old
|
|
'min'/'max' fields.
|
|
|
|
repeated_root_schema.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
|
|
repetition level of the root schema is set to REPEATED.
|
|
Reproduction steps:
|
|
1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
|
|
file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
|
|
2: Run test_compute_stats and grab the created Parquet file for
|
|
alltypes_parquet table.
|
|
|
|
binary_decimal_dictionary.parquet,
|
|
binary_decimal_no_dictionary.parquet:
|
|
Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
|
|
Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
|
|
and non-dictionary encoding respectively.
|
|
|
|
alltypes_agg_bitpacked_def_levels.parquet:
|
|
Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
|
|
of the standard RLE-encoded levels. See
|
|
https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
|
|
is a single file containing all of the alltypesagg data, which includes a mix of
|
|
null and non-null values. This is not actually a valid Parquet file because the
|
|
bit-packed levels are written in the reverse order specified in the Parquet spec
|
|
for BIT_PACKED. However, this is the order that Impala attempts to read the levels
|
|
in - see IMPALA-3006.
|
|
|
|
signed_integer_logical_types.parquet:
|
|
Generated using a utility that uses the java Parquet API.
|
|
The file has the following schema:
|
|
schema {
|
|
optional int32 id;
|
|
optional int32 tinyint_col (INT_8);
|
|
optional int32 smallint_col (INT_16);
|
|
optional int32 int_col;
|
|
optional int64 bigint_col;
|
|
}
|
|
|
|
min_max_is_nan.parquet:
|
|
Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
|
|
Created to test the read path for a Parquet file with invalid metadata, namely when
|
|
'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
|
|
NaN
|
|
42
|
|
|
|
bad_codec.parquet:
|
|
Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
|
|
compression codec. The data in the file is the whole of the "alltypestiny" data set, with
|
|
the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
|
|
int_col int, bigint_col bigint, float_col float, double_col double,
|
|
date_string_col string, string_col string, timestamp_col timestamp, year int, month int
|