Files
impala/testdata/data/README
Alex Behm 025fd3bd7f IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
Adds handling and testing for a specific Parquet data corruption
scenario with plain dictionary encoded values.

The problematic scenario is when the repeat or literal count of
the RLE-encoded dictionary indexes is decoded as 0 - an invalid value.

There are several other cases of data corruption that are not yet
handled gracefully. This patch only handles one specific case.

Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Reviewed-on: http://gerrit.cloudera.org:8080/3299
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-06-07 17:29:59 -07:00

73 lines
2.4 KiB
Plaintext

bad_parquet_data.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"is"
"fun"
bad_rle_literal_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the values 1, 3, 7 stored
in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are all literals (and not repeated), but the literal count
is incorrectly 0 in the file to test that such data corruption is
proprly handled.
bad_rle_repeat_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the value 7 repeated 7 times
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are a single repeated run (and not literals), but the repeat count
is incorrectly 0 in the file to test that such data corruption is proprly
handled.
repeated_values.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"parquet"
"parquet"
multiple_rowgroups.parquet:
Generated with parquet-mr 1.2.5
Populated with:
hive> set parquet.block.size=500;
hive> INSERT INTO TABLE tbl
SELECT l_comment FROM tpch.lineitem LIMIT 1000;
alltypesagg_hive_13_1.parquet:
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;
bad_column_metadata.parquet:
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
Schema:
{"type": "record",
"namespace": "com.cloudera.impala",
"name": "bad_column_metadata",
"fields": [
{"name": "id", "type": ["null", "long"]},
{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
]
}
Contains 3 row groups, each with ten rows and each array containing ten elements. The
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
there are 11 values (instead of 10). The third rowgroup has the correct metadata.
data-bzip2.bz2
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size < 8M
large_bzip2.bz2
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size > 8M
data-pbzip2.bz2
Generated with pbzip2, contains multiple bzip2 streams
Contains 1 column, uncompressed data size < 8M
large_pbzip2.bz2
Generated with pbzip2, contains multiple bzip2 stream
Contains 1 column, uncompressed data size > 8M