mirror of
https://github.com/apache/impala.git
synced 2026-01-07 18:02:33 -05:00
The code mimics the code written for other min-max filters. Decimal data can be stored using 4 bytes, 8 bytes and 16 bytes. The code respectively handles these 3 storage configurations. The column definition states the precision and the precision determines the storage size. The minimum and maximum values are stored in a union. The precision from the column will come in as an input. Based on the precision the size will be found, and depending on the size appropriate variable will be used. The code in min-max-filter* follows the general convention of the file, hence uses macros. The test includes 24 decimal columns (as listed below) with the following joins: 1. Inner Join with broadcast (2 tables) 1a. 1 predicate 1b. 4 predicates - all results in decimal min-max filter 1c. 4 predicates - 3 results in decimal min=max filter; 1 doesn't 2. Inner Join with Shuffle (3 tables) 3. Right outer join (2 tables) 4. Left Semi join (2 tables) 5. Right Semi join (2 tables) Decimal Columns: 4bytes: (5,0), (5,1), (5,3), (5,5) (9,0), (9,1), (9,5), (9,9) 8 bytes: (14,0), (14,1), (14,7), (14,14) (18,0), (18,1), (18,9), (18,18) 16 bytes: (28,0), (28,1), (28,14), (28,28) (38,0), (38,1), (38,19), (38,38) The test aggregates the count of probe rows. This shows that the min-max filter is exercised, because the number of probe rows is less than the total number of rows in the probe side table. The count of probe rows is considered to be deterministic. But, it will be beneficial to look out for changes in Kudu that can change the way data is partitioned. Such a change could change the probe row count and in that case, the test will have to be updated. impala_test_suite.py and test_result_verifier.py are enhanced to support saving of aggregation using update_results. Change-Id: Ib7e7278e902160d7060f8097290bc172d9031f94 Reviewed-on: http://gerrit.cloudera.org:8080/12113 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
301 lines
12 KiB
Plaintext
301 lines
12 KiB
Plaintext
bad_parquet_data.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"is"
|
|
"fun"
|
|
|
|
bad_compressed_dict_page_size.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single string column 'col' with one row ("a"). The compressed_page_size field
|
|
in dict page header is modifed to 0 to test if it is correctly handled.
|
|
|
|
bad_rle_literal_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the values 1, 3, 7 stored
|
|
in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are all literals (and not repeated), but the literal count
|
|
is incorrectly 0 in the file to test that such data corruption is
|
|
proprly handled.
|
|
|
|
bad_rle_repeat_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the value 7 repeated 7 times
|
|
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are a single repeated run (and not literals), but the repeat count
|
|
is incorrectly 0 in the file to test that such data corruption is proprly
|
|
handled.
|
|
|
|
zero_rows_zero_row_groups.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows and no row groups.
|
|
|
|
zero_rows_one_row_group.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows but one row group.
|
|
|
|
huge_num_rows.parquet
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates 2 * MAX_INT32 rows.
|
|
The single row group also has the same number of rows in the metadata.
|
|
|
|
repeated_values.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"parquet"
|
|
"parquet"
|
|
|
|
multiple_rowgroups.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Populated with:
|
|
hive> set parquet.block.size=500;
|
|
hive> INSERT INTO TABLE tbl
|
|
SELECT l_comment FROM tpch.lineitem LIMIT 1000;
|
|
|
|
alltypesagg_hive_13_1.parquet:
|
|
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
|
|
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;
|
|
|
|
bad_column_metadata.parquet:
|
|
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
|
|
Schema:
|
|
{"type": "record",
|
|
"namespace": "org.apache.impala",
|
|
"name": "bad_column_metadata",
|
|
"fields": [
|
|
{"name": "id", "type": ["null", "long"]},
|
|
{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
|
|
]
|
|
}
|
|
Contains 3 row groups, each with ten rows and each array containing ten elements. The
|
|
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
|
|
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
|
|
there are 11 values (instead of 10). The third rowgroup has the correct metadata.
|
|
|
|
data-bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
data-pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 streams
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
out_of_range_timestamp.parquet:
|
|
Generated with a hacked version of Impala parquet writer.
|
|
Contains a single timestamp column with 4 values, 2 of which are out of range
|
|
and should be read as NULL by Impala:
|
|
1399-12-31 00:00:00 (invalid - date too small)
|
|
1400-01-01 00:00:00
|
|
9999-12-31 00:00:00
|
|
10000-01-01 00:00:00 (invalid - date too large)
|
|
|
|
table_with_header.csv:
|
|
Created with a text editor, contains a header line before the data rows.
|
|
|
|
table_with_header_2.csv:
|
|
Created with a text editor, contains two header lines before the data rows.
|
|
|
|
table_with_header.gz, table_with_header_2.gz:
|
|
Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.
|
|
|
|
deprecated_statistics.parquet:
|
|
Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
|
|
Contains a copy of the data in functional.alltypessmall with statistics that use the old
|
|
'min'/'max' fields.
|
|
|
|
repeated_root_schema.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
|
|
repetition level of the root schema is set to REPEATED.
|
|
Reproduction steps:
|
|
1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
|
|
file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
|
|
2: Run test_compute_stats and grab the created Parquet file for
|
|
alltypes_parquet table.
|
|
|
|
binary_decimal_dictionary.parquet,
|
|
binary_decimal_no_dictionary.parquet:
|
|
Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
|
|
Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
|
|
and non-dictionary encoding respectively.
|
|
|
|
alltypes_agg_bitpacked_def_levels.parquet:
|
|
Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
|
|
of the standard RLE-encoded levels. See
|
|
https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
|
|
is a single file containing all of the alltypesagg data, which includes a mix of
|
|
null and non-null values. This is not actually a valid Parquet file because the
|
|
bit-packed levels are written in the reverse order specified in the Parquet spec
|
|
for BIT_PACKED. However, this is the order that Impala attempts to read the levels
|
|
in - see IMPALA-3006.
|
|
|
|
signed_integer_logical_types.parquet:
|
|
Generated using a utility that uses the java Parquet API.
|
|
The file has the following schema:
|
|
schema {
|
|
optional int32 id;
|
|
optional int32 tinyint_col (INT_8);
|
|
optional int32 smallint_col (INT_16);
|
|
optional int32 int_col;
|
|
optional int64 bigint_col;
|
|
}
|
|
|
|
min_max_is_nan.parquet:
|
|
Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
|
|
Created to test the read path for a Parquet file with invalid metadata, namely when
|
|
'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
|
|
NaN
|
|
42
|
|
|
|
bad_codec.parquet:
|
|
Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
|
|
compression codec. The data in the file is the whole of the "alltypestiny" data set, with
|
|
the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
|
|
int_col int, bigint_col bigint, float_col float, double_col double,
|
|
date_string_col string, string_col string, timestamp_col timestamp, year int, month int
|
|
|
|
num_values_def_levels_mismatch.parquet:
|
|
A file with a single boolean column with page metadata reporting 2 values but only def
|
|
levels for a single literal value. Generated by hacking Impala's parquet writer to
|
|
increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK
|
|
(IMPALA-6589).
|
|
|
|
rle_encoded_bool.parquet:
|
|
Parquet v1 file with RLE encoded boolean column "b" and int column "i".
|
|
Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows,
|
|
139 with value false, and 140 with value true. "i" is always 1 if "b" is True
|
|
and always 0 if "b" is false.
|
|
|
|
dict_encoding_with_large_bit_width.parquet:
|
|
Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified
|
|
Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead
|
|
to DCHECK errors (IMPALA-7147).
|
|
|
|
decimal_stored_as_int32.parquet:
|
|
Parquet file generated by Spark 2.3.1 that contains decimals stored as int32.
|
|
Impala needs to be able to read such values (IMPALA-5542)
|
|
|
|
decimal_stored_as_int64.parquet:
|
|
Parquet file generated by Spark 2.3.1 that contains decimals stored as int64.
|
|
Impala needs to be able to read such values (IMPALA-5542)
|
|
|
|
primitive_type_widening.parquet:
|
|
Parquet file that contains two rows with the following schema:
|
|
- int32 tinyint_col1
|
|
- int32 tinyint_col2
|
|
- int32 tinyint_col3
|
|
- int32 tinyint_col4
|
|
- int32 smallint_col1
|
|
- int32 smallint_col2
|
|
- int32 smallint_col3
|
|
- int32 int_col1
|
|
- int32 int_col2
|
|
- float float_col
|
|
It is used to test primitive type widening (IMPALA-6373).
|
|
|
|
corrupt_footer_len_decr.parquet:
|
|
Parquet file that contains one row of the following schema:
|
|
- bigint c
|
|
The footer size is manually modified (using hexedit) to be the original file size minus
|
|
1, to cause metadata deserialization in footer parsing to fail, thus trigger the printing
|
|
of an error message with incorrect file offset, to verify that it's fixed by IMPALA-6442.
|
|
|
|
corrupt_footer_len_incr.parquet:
|
|
Parquet file that contains one row of the following schema:
|
|
- bigint c
|
|
The footer size is manually modified (using hexedit) to be larger than the original file
|
|
size and cause footer parsing to fail. It's used to test an error message related to
|
|
IMPALA-6442.
|
|
|
|
hive_single_value_timestamp.parq:
|
|
Parquet file written by Hive with the followin schema:
|
|
i int, timestamp d
|
|
Contains a single row. It is used to test IMPALA-7559 which only occurs when all values
|
|
in a column chunk are the same timestamp and the file is written with parquet-mr (which
|
|
is used by Hive).
|
|
|
|
out_of_range_time_of_day.parquet:
|
|
IMPALA-7595: Parquet file that contains timestamps where the time part is out of the
|
|
valid range [0..24H). Before the fix, select * returned these values:
|
|
1970-01-01 -00:00:00.000000001 (invalid - negative time of day)
|
|
1970-01-01 00:00:00
|
|
1970-01-01 23:59:59.999999999
|
|
1970-01-01 24:00:00 (invalid - time of day should be less than a whole day)
|
|
|
|
strings_with_quotes.csv:
|
|
Various strings with quotes in them to reproduce bugs like IMPALA-7586.
|
|
|
|
int64_timestamps_plain.parq:
|
|
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
|
|
Timestamp logical types. Has the following columns:
|
|
new_logical_milli_utc, new_logical_milli_local,
|
|
new_logical_micro_utc, new_logical_micro_local
|
|
|
|
int64_timestamps_dict.parq:
|
|
Parquet file generated with Parquet-mr that contains dictionary encoded int64 columns
|
|
with Timestamp logical types. Has the following columns:
|
|
id,
|
|
new_logical_milli_utc, new_logical_milli_local,
|
|
new_logical_micro_utc, new_logical_micro_local
|
|
|
|
int64_timestamps_at_dst_changes.parquet:
|
|
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
|
|
Timestamp logical types. The file contains 3 row groups, and all row groups contain
|
|
3 distinct values, so there is a "min", a "max", and a "middle" value. The values were
|
|
selected in such a way that the UTC->CET conversion changes the order of the values (this
|
|
is possible during Summer->Winter DST change) and "middle" falls outside the "min".."max"
|
|
range after conversion. This means that a naive stat filtering implementation could drop
|
|
"middle" incorrectly.
|
|
Example (all dates are 2017-10-29):
|
|
UTC: 00:45:00, 01:00:00, 01:10:00 =>
|
|
CET: 02:45:00, 02:00:00, 02:10:00
|
|
Columns: rawvalue bigint, rowgroup int, millisutc timsestamp, microsutc timestamp
|
|
|
|
int64_timestamps_nano.parquet:
|
|
Parquet file generated with Parquet-mr that contains int64 columns with nanosecond
|
|
precision. Tested separately from the micro/millisecond columns because of the different
|
|
valid range.
|
|
Columns: rawvalue bigint, nanoutc timestamp, nanononutc timestamp
|
|
|
|
out_of_range_timestamp_hive_211.parquet
|
|
Hive-generated file with an out-of-range timestamp. Generated with Hive 2.1.1 using
|
|
the following query:
|
|
create table alltypes_hive stored as parquet as
|
|
select * from functional.alltypes
|
|
union all
|
|
select -1, false, 0, 0, 0, 0, 0, 0, '', '', cast('1399-01-01 00:00:00' as timestamp), 0, 0
|
|
|
|
out_of_range_timestamp2_hive_211.parquet
|
|
Hive-generated file with out-of-range timestamps every second value, to exercise code
|
|
paths in Parquet scanner for non-repeated runs. Generated with Hive 2.1.1 using
|
|
the following query:
|
|
create table hive_invalid_timestamps stored as parquet as
|
|
select id,
|
|
case id % 3
|
|
when 0 then timestamp_col
|
|
when 1 then NULL
|
|
when 2 then cast('1300-01-01 9:9:9' as timestamp)
|
|
end timestamp_col
|
|
from functional.alltypes
|
|
sort by id
|
|
|
|
decimal_rtf_tbl.txt
|
|
This was generated using formulas in Google Sheets. The goal was to create various
|
|
decimal values that covers the 3 storage formats with various precision and scale.
|
|
This is a reasonably large table that is used for testing min-max filters
|
|
with decimal types on Kudu.
|
|
|
|
decimal_rtf_tiny_tbl.txt
|
|
Small table with specific decimal values picked from decimal_rtf_tbl.txt so that
|
|
min-max filter based pruning can be tested with decimal types on Kudu.
|
|
|