impala

mirror of https://github.com/apache/impala.git synced 2025-12-26 14:02:53 -05:00

Files

Zoltan Borok-Nagy d423979866 IMPALA-5843: Use page index in Parquet files to skip pages

This commit implements page filtering based on the Parquet page index.

The read and evaluation of the page index is done by the
HdfsParquetScanner. At first, we determine the row ranges we are
interested in, and based on the row ranges we determine the candidate
pages for each column that we are reading.

We still issue one ScanRange per column chunk, but we specify
sub-ranges that store the candidate pages, i.e. we don't read
the whole column chunk, but only fractions of it.

Pages are not aligned across column chunks, i.e. page #2 of column A
might store completely different rows than page #2 of column B.
It means we need to implement some kind of row-skipping logic
when we read the data pages. This logic is implemented in
BaseScalarColumnReader and ScalarColumnReader. Collection column
readers know nothing about page filtering.

Page filtering can be turned off by setting the query option
'read_parquet_page_index' to false.

Testing:
 * added some unit tests for the row range and
   page selection logic
 * generated various Parquet files with Parquet-MR
 * enabled Page index writing and wrote selective queries against
   tables written by Impala. Current tests are likely to use page
   filtering transparently.

Performance:
 * Measured locally, observed 3x to 20x speedup for selective queries.
   The speedup was proportional to the IO operations need to be done.

 * The TPCH benchmark didn't show a significant performance change. It
   is not a suprise since the data is not being sorted in any useful
   way. So the main goal was to not introduce perf regression.

TODO:
   * measure performance for remote reads

Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Reviewed-on: http://gerrit.cloudera.org:8080/12065
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2019-05-10 11:46:38 +00:00

date_tbl

IMPALA-7368: Add initial support for DATE type

2019-04-23 13:33:57 +00:00

date_tbl_error

IMPALA-7368: Add initial support for DATE type

2019-04-23 13:33:57 +00:00

local_tbl

Enable local filesystem tables

2015-02-27 18:48:56 +00:00

schemas

IMPALA-2525: Treat parquet ENUMs as STRINGs when creating impala tables.

2017-06-07 02:51:54 +00:00

alltypes_agg_bitpacked_def_levels.parquet

IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding

2017-11-16 21:23:09 +00:00

alltypes_tiny_pages_plain.parquet

IMPALA-5843: Use page index in Parquet files to skip pages

2019-05-10 11:46:38 +00:00

alltypes_tiny_pages.parquet

IMPALA-5843: Use page index in Parquet files to skip pages

2019-05-10 11:46:38 +00:00

alltypesagg_hive_13_1.parquet

IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps

2015-02-11 13:28:17 +00:00

avro_decimal_tbl.avro

Decimal: read from Avro

2014-05-16 22:26:11 -07:00

bad_codec.parquet

IMPALA-6592: add test for invalid parquet codecs

2018-03-08 04:48:36 +00:00

bad_column_metadata.parquet

IMPALA-2558: DCHECK in parquet scanner after block read error

2015-10-30 22:35:57 +00:00

bad_compressed_dict_page_size.parquet

IMPALA-6353: Fix crash in snappy decompressor

2018-01-17 04:18:24 +00:00

bad_compressed_size.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_dict_page_offset.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_magic_number.parquet

IMPALA-2130: Wrong verification of Parquet file version

2015-07-14 02:52:02 +00:00

bad_metadata_len.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_parquet_data.parquet

IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8

2014-01-08 10:54:27 -08:00

bad_rle_literal_count.parquet

IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

2016-06-07 17:29:59 -07:00

bad_rle_repeat_count.parquet

IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

2016-06-07 17:29:59 -07:00

binary_decimal_dictionary.parquet

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-07 04:34:26 +00:00

binary_decimal_no_dictionary.parquet

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-07 04:34:26 +00:00

chars-formats.avro

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-formats.orc

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

chars-formats.parquet

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-formats.txt

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-tiny.txt

Bugfix and tests for CHAR(N) and VARCHAR(N)

2014-09-23 07:30:07 -07:00

corrupt_footer_len_decr.parquet

IMPALA-6442: Misleading file offset reporting in error messages.

2018-09-10 16:09:41 +00:00

corrupt_footer_len_incr.parquet

IMPALA-6442: Misleading file offset reporting in error messages.

2018-09-10 16:09:41 +00:00

data-bzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

data-pbzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

date_tbl.avro

IMPALA-7368: Add initial support for DATE type

2019-04-23 13:33:57 +00:00

date_tbl.parquet

IMPALA-7368: Add initial support for DATE type

2019-04-23 13:33:57 +00:00

decimal_rtf_tbl.txt

IMPALA-6533: Add min-max filter for decimal types on kudu tables.

2019-01-10 03:32:25 +00:00

decimal_rtf_tiny_tbl.txt

IMPALA-6533: Add min-max filter for decimal types on kudu tables.

2019-01-10 03:32:25 +00:00

decimal_stored_as_int32.parquet

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

2018-08-02 20:21:12 +00:00

decimal_stored_as_int64.parquet

IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t

2018-08-02 20:21:12 +00:00

decimal_tbl.txt

Decimal implementation.

2014-04-14 21:07:32 -07:00

decimal-tiny.txt

Decimal implementation.

2014-04-14 21:07:32 -07:00

decimals_1_10.parquet

IMPALA-5843: Use page index in Parquet files to skip pages

2019-05-10 11:46:38 +00:00

deprecated_statistics.parquet

IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types

2017-05-09 15:47:21 +00:00

dict_encoding_with_large_bit_width.parquet

IMPALA-7417: Remove DCHECKs with unnecessary constraint on dictionary encoding bit width

2018-06-11 23:25:46 +00:00

double_nested_decimals.parquet

IMPALA-5843: Use page index in Parquet files to skip pages

2019-05-10 11:46:38 +00:00

hive2_pre_gregorian.parquet

IMPALA-7370: DATE: Read/Write to parquet.

2019-05-07 00:36:56 +00:00

hive_single_value_timestamp.parq

IMPALA-7559: Disable stat filtering for UTC-normalized timestamp columns

2018-09-13 21:18:56 +00:00

huge_num_rows.parquet

IMPALA-5021: Fix count(*) remaining rows overflow in Parquet.

2017-03-08 02:00:30 +00:00

int64_timestamps_at_dst_changes.parquet

IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet

2018-11-14 20:16:14 +00:00

int64_timestamps_dict.parquet

IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet

2018-11-14 20:16:14 +00:00

int64_timestamps_nano.parquet

IMPALA-7853: Add support to read int64 NANO timestamps from Parquet

2018-12-10 15:39:24 +00:00

int64_timestamps_plain.parquet

IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet

2018-11-14 20:16:14 +00:00

kite_required_fields.parquet

Parquet: Fix value def level when max def level is 0

2015-05-15 06:41:02 +00:00

large_bzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

large_pbzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

lazy_timestamp.csv

IMPALA-5315: Cast to timestamp fails for YYYY-M-D format

2018-03-13 22:10:18 +00:00

long_page_header.parquet

IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to

2014-10-27 16:30:56 -07:00

min_max_is_nan.parquet

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-22 00:57:46 +00:00

multiple_rowgroups.parquet

IMPALA-729: fix resource management in Parquet scanner for multiple row groups

2014-01-08 10:56:26 -08:00

nested_decimals.parquet

IMPALA-5843: Use page index in Parquet files to skip pages

2019-05-10 11:46:38 +00:00

num_values_def_levels_mismatch.parquet

IMPALA-6589: remove invalid DCHECK in parquet reader

2018-03-17 02:52:19 +00:00

oldrcfile.rc

Fix pre-hive 9 rc file scanner.

2014-01-08 10:48:41 -08:00

out_of_range_date.parquet

IMPALA-7370: DATE: Read/Write to parquet.

2019-05-07 00:36:56 +00:00

out_of_range_time_of_day.parquet

IMPALA-7595: Check the validity of the time part of Parquet timestamps

2018-10-01 13:20:40 +00:00

out_of_range_timestamp2_hive_211.parquet

IMPALA-4123: Columnar decoding in Parquet

2018-11-17 01:48:05 +00:00

out_of_range_timestamp_hive_211.parquet

IMPALA-4123: Columnar decoding in Parquet

2018-11-17 01:48:05 +00:00

out_of_range_timestamp.parquet

IMPALA-4363: Add Parquet timestamp validation

2016-12-03 06:41:07 +00:00

overflow.txt

IMPALA-4810: add DECIMAL test case to strict_mode tests

2017-03-03 01:43:42 +00:00

primitive_type_widening.parquet

IMPALA-6373: Allow primitive type widening on parquet tables

2018-08-23 15:55:53 +00:00

README

IMPALA-5843: Use page index in Parquet files to skip pages

2019-05-10 11:46:38 +00:00

repeated_root_schema.parquet

IMPALA-4826: Fix error during a scan on repeated root schema in Parquet.

2017-09-06 20:07:56 +00:00

repeated_values.parquet

Allow zero bit width dict/RLE decoders.

2014-01-08 10:54:27 -08:00

rle_encoded_bool.parquet

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-22 02:47:33 +00:00

signed_integer_logical_types.parquet

IMPALA-5052: Read and write signed integer logical types in Parquet

2018-01-09 04:55:59 +00:00

strings_with_quotes.csv

IMPALA-7586: fix predicate pushdown of escaped strings

2018-11-01 21:27:13 +00:00

table_missing_columns.csv

IMPALA-1973: Fixing crash when uninitialized, empty row is added in HdfsTextScanner

2015-05-05 00:19:12 +00:00

table_no_newline.csv

IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line.

2015-03-20 19:58:50 -07:00

table_with_header_2.csv

IMPALA-1740: Add support for skip.header.line.count.

2016-05-12 14:17:46 -07:00

table_with_header_2.gz

IMPALA-5287: Test skip.header.line.count on gzip

2017-05-09 01:36:46 +00:00

table_with_header.csv

IMPALA-1740: Add support for skip.header.line.count.

2016-05-12 14:17:46 -07:00

table_with_header.gz

IMPALA-5287: Test skip.header.line.count on gzip

2017-05-09 01:36:46 +00:00

text-comma-backslash-newline.txt

IMPALA-496: Fix escaping of field delimiter and escape character in inserts

2014-01-08 10:52:09 -08:00

text-dollar-hash-pipe.txt

IMPALA-496: Fix escaping of field delimiter and escape character in inserts

2014-01-08 10:52:09 -08:00

text-thorn-ecirc-newline.txt

IMP-1291: Support "extended" ASCII characters as delimiters in text files

2014-03-13 13:00:15 -07:00

timezoneverification.csv

IMPALA-8043: Fix BE test failures related to SystemV timezones.

2019-01-15 17:04:55 +00:00

widerow.txt

IMPALA-525: Adjust IO buffer size based on read length and other memory fixes

2014-01-08 10:54:01 -08:00

zero_rows_one_row_group.parquet

IMPALA-3943: Do not throw scan errors for empty Parquet files.

2016-10-12 09:22:57 +00:00

zero_rows_zero_row_groups.parquet

IMPALA-3943: Do not throw scan errors for empty Parquet files.

2016-10-12 09:22:57 +00:00

README

bad_parquet_data.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"is"
"fun"

bad_compressed_dict_page_size.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single string column 'col' with one row ("a"). The compressed_page_size field
in dict page header is modifed to 0 to test if it is correctly handled.

bad_rle_literal_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the values 1, 3, 7 stored
in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are all literals (and not repeated), but the literal count
is incorrectly 0 in the file to test that such data corruption is
proprly handled.

bad_rle_repeat_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the value 7 repeated 7 times
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are a single repeated run (and not literals), but the repeat count
is incorrectly 0 in the file to test that such data corruption is proprly
handled.

zero_rows_zero_row_groups.parquet:
Generated by hacking Impala's Parquet writer.
The file metadata indicates zero rows and no row groups.

zero_rows_one_row_group.parquet:
Generated by hacking Impala's Parquet writer.
The file metadata indicates zero rows but one row group.

huge_num_rows.parquet:
Generated by hacking Impala's Parquet writer.
The file metadata indicates 2 * MAX_INT32 rows.
The single row group also has the same number of rows in the metadata.

repeated_values.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"parquet"
"parquet"

multiple_rowgroups.parquet:
Generated with parquet-mr 1.2.5
Populated with:
hive> set parquet.block.size=500;
hive> INSERT INTO TABLE tbl
      SELECT l_comment FROM tpch.lineitem LIMIT 1000;

alltypesagg_hive_13_1.parquet:
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;

bad_column_metadata.parquet:
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
Schema:
 {"type": "record",
  "namespace": "org.apache.impala",
  "name": "bad_column_metadata",
  "fields": [
      {"name": "id", "type": ["null", "long"]},
      {"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
  ]
 }
Contains 3 row groups, each with ten rows and each array containing ten elements. The
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
there are 11 values (instead of 10). The third rowgroup has the correct metadata.

data-bzip2.bz2:
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size < 8M

large_bzip2.bz2:
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size > 8M

data-pbzip2.bz2:
Generated with pbzip2, contains multiple bzip2 streams
Contains 1 column, uncompressed data size < 8M

large_pbzip2.bz2:
Generated with pbzip2, contains multiple bzip2 stream
Contains 1 column, uncompressed data size > 8M

out_of_range_timestamp.parquet:
Generated with a hacked version of Impala parquet writer.
Contains a single timestamp column with 4 values, 2 of which are out of range
and should be read as NULL by Impala:
   1399-12-31 00:00:00 (invalid - date too small)
   1400-01-01 00:00:00
   9999-12-31 00:00:00
  10000-01-01 00:00:00 (invalid - date too large)

table_with_header.csv:
Created with a text editor, contains a header line before the data rows.

table_with_header_2.csv:
Created with a text editor, contains two header lines before the data rows.

table_with_header.gz, table_with_header_2.gz:
Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.

deprecated_statistics.parquet:
Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
Contains a copy of the data in functional.alltypessmall with statistics that use the old
'min'/'max' fields.

repeated_root_schema.parquet:
Generated by hacking Impala's Parquet writer.
Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
repetition level of the root schema is set to REPEATED.
Reproduction steps:
1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
   file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
2: Run test_compute_stats and grab the created Parquet file for
   alltypes_parquet table.

binary_decimal_dictionary.parquet,
binary_decimal_no_dictionary.parquet:
Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
and non-dictionary encoding respectively.

alltypes_agg_bitpacked_def_levels.parquet:
Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
of the standard RLE-encoded levels. See
https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
is a single file containing all of the alltypesagg data, which includes a mix of
null and non-null values. This is not actually a valid Parquet file because the
bit-packed levels are written in the reverse order specified in the Parquet spec
for BIT_PACKED. However, this is the order that Impala attempts to read the levels
in - see IMPALA-3006.

signed_integer_logical_types.parquet:
Generated using a utility that uses the java Parquet API.
The file has the following schema:
  schema {
    optional int32 id;
    optional int32 tinyint_col (INT_8);
    optional int32 smallint_col (INT_16);
    optional int32 int_col;
    optional int64 bigint_col;
  }

min_max_is_nan.parquet:
Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
Created to test the read path for a Parquet file with invalid metadata, namely when
'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
NaN
42

bad_codec.parquet:
Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
compression codec. The data in the file is the whole of the "alltypestiny" data set, with
the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
int_col int, bigint_col bigint, float_col float, double_col double,
date_string_col string, string_col string, timestamp_col timestamp, year int, month int

num_values_def_levels_mismatch.parquet:
A file with a single boolean column with page metadata reporting 2 values but only def
levels for a single literal value. Generated by hacking Impala's parquet writer to
increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK
(IMPALA-6589).

rle_encoded_bool.parquet:
Parquet v1 file with RLE encoded boolean column "b" and int column "i".
Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows,
139 with value false, and 140 with value true. "i" is always 1 if "b" is True
and always 0 if "b" is false.

dict_encoding_with_large_bit_width.parquet:
Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified
Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead
to DCHECK errors (IMPALA-7147).

decimal_stored_as_int32.parquet:
Parquet file generated by Spark 2.3.1 that contains decimals stored as int32.
Impala needs to be able to read such values (IMPALA-5542)

decimal_stored_as_int64.parquet:
Parquet file generated by Spark 2.3.1 that contains decimals stored as int64.
Impala needs to be able to read such values (IMPALA-5542)

primitive_type_widening.parquet:
Parquet file that contains two rows with the following schema:
- int32 tinyint_col1
- int32 tinyint_col2
- int32 tinyint_col3
- int32 tinyint_col4
- int32 smallint_col1
- int32 smallint_col2
- int32 smallint_col3
- int32 int_col1
- int32 int_col2
- float float_col
It is used to test primitive type widening (IMPALA-6373).

corrupt_footer_len_decr.parquet:
Parquet file that contains one row of the following schema:
- bigint c
The footer size is manually modified (using hexedit) to be the original file size minus
1, to cause metadata deserialization in footer parsing to fail, thus trigger the printing
of an error message with incorrect file offset, to verify that it's fixed by IMPALA-6442.

corrupt_footer_len_incr.parquet:
Parquet file that contains one row of the following schema:
- bigint c
The footer size is manually modified (using hexedit) to be larger than the original file
size and cause footer parsing to fail. It's used to test an error message related to
IMPALA-6442.

hive_single_value_timestamp.parq:
Parquet file written by Hive with the followin schema:
i int, timestamp d
Contains a single row. It is used to test IMPALA-7559 which only occurs when all values
in a column chunk are the same timestamp and the file is written with parquet-mr (which
is used by Hive).

out_of_range_time_of_day.parquet:
IMPALA-7595: Parquet file that contains timestamps where the time part is out of the
valid range [0..24H). Before the fix, select * returned these values:
1970-01-01 -00:00:00.000000001  (invalid - negative time of day)
1970-01-01 00:00:00
1970-01-01 23:59:59.999999999
1970-01-01 24:00:00 (invalid - time of day should be less than a whole day)

strings_with_quotes.csv:
Various strings with quotes in them to reproduce bugs like IMPALA-7586.

int64_timestamps_plain.parq:
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
Timestamp logical types. Has the following columns:
new_logical_milli_utc, new_logical_milli_local,
new_logical_micro_utc, new_logical_micro_local

int64_timestamps_dict.parq:
Parquet file generated with Parquet-mr that contains dictionary encoded int64 columns
with Timestamp logical types. Has the following columns:
id,
new_logical_milli_utc, new_logical_milli_local,
new_logical_micro_utc, new_logical_micro_local

int64_timestamps_at_dst_changes.parquet:
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
Timestamp logical types. The file contains 3 row groups, and all row groups contain
3 distinct values, so there is a "min", a "max", and a "middle" value. The values were
selected in such a way that the UTC->CET conversion changes the order of the values (this
is possible during Summer->Winter DST change) and "middle" falls outside the "min".."max"
range after conversion. This means that a naive stat filtering implementation could drop
"middle" incorrectly.
Example (all dates are 2017-10-29):
UTC: 00:45:00, 01:00:00, 01:10:00 =>
CET: 02:45:00, 02:00:00, 02:10:00
Columns: rawvalue bigint, rowgroup int, millisutc timsestamp, microsutc timestamp

int64_timestamps_nano.parquet:
Parquet file generated with Parquet-mr that contains int64 columns with nanosecond
precision. Tested separately from the micro/millisecond columns because of the different
valid range.
Columns: rawvalue bigint, nanoutc timestamp, nanononutc timestamp

out_of_range_timestamp_hive_211.parquet:
Hive-generated file with an out-of-range timestamp. Generated with Hive 2.1.1 using
the following query:
create table alltypes_hive stored as parquet as
select * from functional.alltypes
union all
select -1, false, 0, 0, 0, 0, 0, 0, '', '', cast('1399-01-01 00:00:00' as timestamp), 0, 0

out_of_range_timestamp2_hive_211.parquet:
Hive-generated file with out-of-range timestamps every second value, to exercise code
paths in Parquet scanner for non-repeated runs. Generated with Hive 2.1.1 using
the following query:
create table hive_invalid_timestamps stored as parquet as
select id,
  case id % 3
    when 0 then timestamp_col
    when 1 then NULL
    when 2 then cast('1300-01-01 9:9:9' as timestamp)
  end timestamp_col
from functional.alltypes
sort by id

decimal_rtf_tbl.txt:
This was generated using formulas in Google Sheets.  The goal was to create various
decimal values that covers the 3 storage formats with various precision and scale.
This is a reasonably large table that is used for testing min-max filters
with decimal types on Kudu.

decimal_rtf_tiny_tbl.txt:
Small table with specific decimal values picked from decimal_rtf_tbl.txt so that
min-max filter based pruning can be tested with decimal types on Kudu.

date_tbl.avro
Small table with one DATE column, created by Hive.

date_tbl.parquet
Small table with one DATE column, created by Parquet MR.

out_of_range_date.parquet:
Generated with a hacked version of Impala parquet writer.
Contains a single DATE column with 9 values, 2 of which are out of range
and should be read as NULL by Impala:
  -0001-12-31 (invalid - date too small)
   0000-01-01
   0000-01-02
   1969-12-31
   1970-01-01
   1970-01-02
   9999-12-30
   9999-12-31
  10000-01-01 (invalid - date too large)

hive2_pre_gregorian.parquet:
Small table with one DATE column, created by Hive 2.1.1.
Used to demonstrate parquet interoperability issues between Hive and Impala for dates
before the introduction of Gregorian calendar in 1582-10-15.

decimals_1_10.parquet:
Contains two decimal columns, one with precision 1, the other with precision 10.
I used Hive 2.1.1 with a modified version of Parquet-MR (6901a20) to create tiny,
misaligned pages in order to test the value-skipping logic in the Parquet column readers.
The modification in Parquet-MR was to set MIN_SLAB_SIZE to 1. You can find the change
here: https://github.com/boroknagyz/parquet-mr/tree/tiny_pages
hive  --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=5
 --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
create table decimals_1_10 (d_1 DECIMAL(1, 0), d_10 DECIMAL(10, 0)) stored as PARQUET
insert into decimals_1_10 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5),
                            (NULL, 1), (2, 2), (3, 3), (4, 4), (5, 5),
                            (1, 1), (NULL, 2), (3, 3), (4, 4), (5, 5),
                            (1, 1), (2, 2), (NULL, 3), (4, 4), (5, 5),
                            (1, 1), (2, 2), (3, 3), (NULL, 4), (5, 5),
                            (1, 1), (2, 2), (3, 3), (4, 4), (NULL, 5),
                            (NULL, 1), (NULL, 2), (3, 3), (4, 4), (5, 5),
                            (1, 1), (NULL, 2), (3, 3), (NULL, 4), (5, 5),
                            (1, 1), (2, 2), (3, 3), (NULL, 4), (NULL, 5),
                            (NULL, 1), (2, 2), (NULL, 3), (NULL, 4), (5, 5),
                            (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);

nested_decimals.parquet:
Contains two columns, one is a decimal column, the other is an array of decimals.
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
hive  --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16
 --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
create table nested_decimals (d_38 Decimal(38, 0), arr array<Decimal(1, 0)>) stored as parquet;
insert into nested_decimals select 1, array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) ) union all
                            select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all
                            select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all
                            select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all
                            select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all

                            select 1, array(cast (1 as decimal(1,0)) ) union all
                            select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all
                            select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all
                            select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all
                            select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all

                            select 1, array(cast (NULL as decimal(1, 0)), NULL, NULL) union all
                            select 2, array(cast (2 as decimal(1,0)), NULL, NULL) union all
                            select 3, array(cast (3 as decimal(1,0)), NULL, cast (3 as decimal(1,0))) union all
                            select 4, array(NULL, cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), NULL) union all
                            select 5, array(NULL, cast (5 as decimal(1,0)), NULL, NULL, cast (5 as decimal(1,0)) ) union all

                            select 6, array(cast (6 as decimal(1,0)), NULL, cast (6 as decimal(1,0)) ) union all
                            select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), NULL ) union all
                            select 8, array(NULL, NULL, cast (8 as decimal(1,0)) ) union all
                            select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)) ) union all
                            select 6, array(NULL, NULL, NULL, cast (6 as decimal(1,0)) );

double_nested_decimals.parquet:
Contains two columns, one is a decimal column, the other is an array of arrays of
decimals. I used Hive 2.1.1 with a modified Parquet-MR, see description
at decimals_1_10.parquet.
hive  --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16
  --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
create table double_nested_decimals (d_38 Decimal(38, 0), arr array<array<Decimal(1, 0)>>) stored as parquet;
insert into double_nested_decimals select 1, array(array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) )) union all
                                   select 2, array(array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) )) union all
                                   select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) )) union all
                                   select 4, array(array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) )) union all
                                   select 5, array(array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) )) union all

                                   select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
                                   select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
                                   select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
                                   select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
                                   select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all

                                   select 1, array(array(cast (1 as decimal(1,0))) ) union all
                                   select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
                                   select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
                                   select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
                                   select 5, array(array(cast (5 as decimal(1,0))) ) union all

                                   select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
                                   select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
                                   select 3, array(array(cast (3 as decimal(1,0))) ) union all
                                   select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
                                   select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all

                                   select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0))) ) union all
                                   select 2, array(array(cast (2 as decimal(1,0))) ) union all
                                   select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
                                   select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
                                   select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all

                                   select 1, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
                                   select 2, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))) ) union all
                                   select 3, array(array(cast (NULL as decimal(1,0))), array(cast (3 as decimal(1,0))), NULL ) union all
                                   select 4, array(NULL, NULL, array(cast (NULL as decimal(1,0)), NULL, NULL, NULL, NULL) ) union all
                                   select 5, array(array(NULL, cast (5 as decimal(1,0)), NULL, NULL, NULL) ) union all

                                   select 6, array(array(cast (6 as decimal(1,0)), NULL), array(cast (6 as decimal(1,0))) ) union all
                                   select 7, array(array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0))), NULL ) union all
                                   select 8, array(array(NULL, NULL, cast (8 as decimal(1,0))) ) union all
                                   select 7, array(array(cast (7 as decimal(1,0)), cast (NULL as decimal(1,0))), array(cast (7 as decimal(1,0))) ) union all
                                   select 6, array(array(NULL, NULL, cast (6 as decimal(1,0))), array(NULL, cast (6 as decimal(1,0))) );

alltypes_tiny_pages.parquet:
Created from 'functional.alltypes' with small page sizes.
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
I used the following commands to create the file:
hive  --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.page.size.row.check.min=7
create table alltypes_tiny_pages stored as parquet as select * from functional_parquet.alltypes

alltypes_tiny_pages_plain.parquet:
Created from 'functional.alltypes' with small page sizes without dictionary encoding.
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
I used the following commands to create the file:
hive  --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.enable.dictionary=false  --hiveconf parquet.page.size.row.check.min=7
create table alltypes_tiny_pages_plain stored as parquet as select * from functional_parquet.alltypes