mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
Switch the decoders to using more batch-oriented interfaces. As an intermediate step this doesn't make the interfaces of LevelDecoder or DictDecoder batch-oriented, only the lower-level utility classes. The next step would be to change those interfaces to be batch-oriented and make according optimisations in parquet. This could deliver much larger perf improvements than the current patch. The high-level changes are. * BitReader -> BatchedBitReader, which is built to unpack runs of 32 bit-packed values efficiently. * RleDecoder -> RleBatchDecoder, which exposes the repeated and literal runs to the caller and uses BatchedBitReader to unpack literal runs efficiently. * Dict decoding uses RleBatchDecoder to decode repeated runs efficiently and uses the BitPacking utilities to unpack and encode in a single step. Also removes an older benchmark that isn't too interesting (since the batch-oriented approach to encoding and decoding is so much faster than the value-by-value approach). Testing: * Ran core tests. * Updated unit tests to exercise new code. * Added test coverage for the deprecated bit-packed level encoding to that it still works (there was no coverage previously). Perf: Single-node benchmarks showed a few % performance gain. 16 node cluster benchmarks only showed a gain for TPC-H nested. Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6 Reviewed-on: http://gerrit.cloudera.org:8080/8267 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins
bad_parquet_data.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"is"
"fun"
bad_rle_literal_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the values 1, 3, 7 stored
in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are all literals (and not repeated), but the literal count
is incorrectly 0 in the file to test that such data corruption is
proprly handled.
bad_rle_repeat_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the value 7 repeated 7 times
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are a single repeated run (and not literals), but the repeat count
is incorrectly 0 in the file to test that such data corruption is proprly
handled.
zero_rows_zero_row_groups.parquet:
Generated by hacking Impala's Parquet writer.
The file metadata indicates zero rows and no row groups.
zero_rows_one_row_group.parquet:
Generated by hacking Impala's Parquet writer.
The file metadata indicates zero rows but one row group.
huge_num_rows.parquet
Generated by hacking Impala's Parquet writer.
The file metadata indicates 2 * MAX_INT32 rows.
The single row group also has the same number of rows in the metadata.
repeated_values.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"parquet"
"parquet"
multiple_rowgroups.parquet:
Generated with parquet-mr 1.2.5
Populated with:
hive> set parquet.block.size=500;
hive> INSERT INTO TABLE tbl
SELECT l_comment FROM tpch.lineitem LIMIT 1000;
alltypesagg_hive_13_1.parquet:
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;
bad_column_metadata.parquet:
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
Schema:
{"type": "record",
"namespace": "org.apache.impala",
"name": "bad_column_metadata",
"fields": [
{"name": "id", "type": ["null", "long"]},
{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
]
}
Contains 3 row groups, each with ten rows and each array containing ten elements. The
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
there are 11 values (instead of 10). The third rowgroup has the correct metadata.
data-bzip2.bz2:
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size < 8M
large_bzip2.bz2:
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size > 8M
data-pbzip2.bz2:
Generated with pbzip2, contains multiple bzip2 streams
Contains 1 column, uncompressed data size < 8M
large_pbzip2.bz2:
Generated with pbzip2, contains multiple bzip2 stream
Contains 1 column, uncompressed data size > 8M
out_of_range_timestamp.parquet:
Generated with a hacked version of Impala parquet writer.
Contains a single timestamp column with 4 values, 2 of which are out of range
and should be read as NULL by Impala:
1399-12-31 00:00:00 (invalid - date too small)
1400-01-01 00:00:00
9999-12-31 00:00:00
10000-01-01 00:00:00 (invalid - date too large)
table_with_header.csv:
Created with a text editor, contains a header line before the data rows.
table_with_header_2.csv:
Created with a text editor, contains two header lines before the data rows.
table_with_header.gz, table_with_header_2.gz:
Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.
deprecated_statistics.parquet:
Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
Contains a copy of the data in functional.alltypessmall with statistics that use the old
'min'/'max' fields.
repeated_root_schema.parquet:
Generated by hacking Impala's Parquet writer.
Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
repetition level of the root schema is set to REPEATED.
Reproduction steps:
1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
2: Run test_compute_stats and grab the created Parquet file for
alltypes_parquet table.
binary_decimal_dictionary.parquet,
binary_decimal_no_dictionary.parquet:
Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
and non-dictionary encoding respectively.
alltypes_agg_bitpacked_def_levels.parquet:
Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
of the standard RLE-encoded levels. See
https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
is a single file containing all of the alltypesagg data, which includes a mix of
null and non-null values. This is not actually a valid Parquet file because the
bit-packed levels are written in the reverse order specified in the Parquet spec
for BIT_PACKED. However, this is the order that Impala attempts to read the levels
in - see IMPALA-3006.