mirror of
https://github.com/apache/impala.git
synced 2026-01-22 18:02:34 -05:00
Hudi Read Optimized Table contains multiple versions of parquet files, in order to load the table correctly, Impala needs to recognize Hudi Read Optimized Table as a HdfsTable and load the latest version of the file using HoodieROTablePathFilter. Tests - Unit test for Hudi in FileMetadataLoader - Create table tests in functional_schema_template.sql - Query tests in hudi-parquet.test Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf Reviewed-on: http://gerrit.cloudera.org:8080/14711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
497 lines
26 KiB
Plaintext
497 lines
26 KiB
Plaintext
bad_parquet_data.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"is"
|
|
"fun"
|
|
|
|
bad_compressed_dict_page_size.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single string column 'col' with one row ("a"). The compressed_page_size field
|
|
in dict page header is modifed to 0 to test if it is correctly handled.
|
|
|
|
bad_rle_literal_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the values 1, 3, 7 stored
|
|
in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are all literals (and not repeated), but the literal count
|
|
is incorrectly 0 in the file to test that such data corruption is
|
|
proprly handled.
|
|
|
|
bad_rle_repeat_count.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Contains a single bigint column 'c' with the value 7 repeated 7 times
|
|
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
|
|
indexes are a single repeated run (and not literals), but the repeat count
|
|
is incorrectly 0 in the file to test that such data corruption is proprly
|
|
handled.
|
|
|
|
zero_rows_zero_row_groups.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows and no row groups.
|
|
|
|
zero_rows_one_row_group.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates zero rows but one row group.
|
|
|
|
huge_num_rows.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
The file metadata indicates 2 * MAX_INT32 rows.
|
|
The single row group also has the same number of rows in the metadata.
|
|
|
|
repeated_values.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Contains 3 single-column rows:
|
|
"parquet"
|
|
"parquet"
|
|
"parquet"
|
|
|
|
multiple_rowgroups.parquet:
|
|
Generated with parquet-mr 1.2.5
|
|
Populated with:
|
|
hive> set parquet.block.size=500;
|
|
hive> INSERT INTO TABLE tbl
|
|
SELECT l_comment FROM tpch.lineitem LIMIT 1000;
|
|
|
|
alltypesagg_hive_13_1.parquet:
|
|
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
|
|
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;
|
|
|
|
bad_column_metadata.parquet:
|
|
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
|
|
Schema:
|
|
{"type": "record",
|
|
"namespace": "org.apache.impala",
|
|
"name": "bad_column_metadata",
|
|
"fields": [
|
|
{"name": "id", "type": ["null", "long"]},
|
|
{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
|
|
]
|
|
}
|
|
Contains 3 row groups, each with ten rows and each array containing ten elements. The
|
|
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
|
|
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
|
|
there are 11 values (instead of 10). The third rowgroup has the correct metadata.
|
|
|
|
data-bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_bzip2.bz2:
|
|
Generated with bzip2, contains single bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
data-pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 streams
|
|
Contains 1 column, uncompressed data size < 8M
|
|
|
|
large_pbzip2.bz2:
|
|
Generated with pbzip2, contains multiple bzip2 stream
|
|
Contains 1 column, uncompressed data size > 8M
|
|
|
|
out_of_range_timestamp.parquet:
|
|
Generated with a hacked version of Impala parquet writer.
|
|
Contains a single timestamp column with 4 values, 2 of which are out of range
|
|
and should be read as NULL by Impala:
|
|
1399-12-31 00:00:00 (invalid - date too small)
|
|
1400-01-01 00:00:00
|
|
9999-12-31 00:00:00
|
|
10000-01-01 00:00:00 (invalid - date too large)
|
|
|
|
table_with_header.csv:
|
|
Created with a text editor, contains a header line before the data rows.
|
|
|
|
table_with_header_2.csv:
|
|
Created with a text editor, contains two header lines before the data rows.
|
|
|
|
table_with_header.gz, table_with_header_2.gz:
|
|
Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.
|
|
|
|
deprecated_statistics.parquet:
|
|
Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
|
|
Contains a copy of the data in functional.alltypessmall with statistics that use the old
|
|
'min'/'max' fields.
|
|
|
|
repeated_root_schema.parquet:
|
|
Generated by hacking Impala's Parquet writer.
|
|
Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
|
|
repetition level of the root schema is set to REPEATED.
|
|
Reproduction steps:
|
|
1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
|
|
file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
|
|
2: Run test_compute_stats and grab the created Parquet file for
|
|
alltypes_parquet table.
|
|
|
|
binary_decimal_dictionary.parquet,
|
|
binary_decimal_no_dictionary.parquet:
|
|
Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
|
|
Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
|
|
and non-dictionary encoding respectively.
|
|
|
|
alltypes_agg_bitpacked_def_levels.parquet:
|
|
Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
|
|
of the standard RLE-encoded levels. See
|
|
https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
|
|
is a single file containing all of the alltypesagg data, which includes a mix of
|
|
null and non-null values. This is not actually a valid Parquet file because the
|
|
bit-packed levels are written in the reverse order specified in the Parquet spec
|
|
for BIT_PACKED. However, this is the order that Impala attempts to read the levels
|
|
in - see IMPALA-3006.
|
|
|
|
signed_integer_logical_types.parquet:
|
|
Generated using a utility that uses the java Parquet API.
|
|
The file has the following schema:
|
|
schema {
|
|
optional int32 id;
|
|
optional int32 tinyint_col (INT_8);
|
|
optional int32 smallint_col (INT_16);
|
|
optional int32 int_col;
|
|
optional int64 bigint_col;
|
|
}
|
|
|
|
min_max_is_nan.parquet:
|
|
Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
|
|
Created to test the read path for a Parquet file with invalid metadata, namely when
|
|
'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
|
|
NaN
|
|
42
|
|
|
|
bad_codec.parquet:
|
|
Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
|
|
compression codec. The data in the file is the whole of the "alltypestiny" data set, with
|
|
the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
|
|
int_col int, bigint_col bigint, float_col float, double_col double,
|
|
date_string_col string, string_col string, timestamp_col timestamp, year int, month int
|
|
|
|
num_values_def_levels_mismatch.parquet:
|
|
A file with a single boolean column with page metadata reporting 2 values but only def
|
|
levels for a single literal value. Generated by hacking Impala's parquet writer to
|
|
increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK
|
|
(IMPALA-6589).
|
|
|
|
rle_encoded_bool.parquet:
|
|
Parquet v1 file with RLE encoded boolean column "b" and int column "i".
|
|
Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows,
|
|
139 with value false, and 140 with value true. "i" is always 1 if "b" is True
|
|
and always 0 if "b" is false.
|
|
|
|
dict_encoding_with_large_bit_width.parquet:
|
|
Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified
|
|
Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead
|
|
to DCHECK errors (IMPALA-7147).
|
|
|
|
decimal_stored_as_int32.parquet:
|
|
Parquet file generated by Spark 2.3.1 that contains decimals stored as int32.
|
|
Impala needs to be able to read such values (IMPALA-5542)
|
|
|
|
decimal_stored_as_int64.parquet:
|
|
Parquet file generated by Spark 2.3.1 that contains decimals stored as int64.
|
|
Impala needs to be able to read such values (IMPALA-5542)
|
|
|
|
primitive_type_widening.parquet:
|
|
Parquet file that contains two rows with the following schema:
|
|
- int32 tinyint_col1
|
|
- int32 tinyint_col2
|
|
- int32 tinyint_col3
|
|
- int32 tinyint_col4
|
|
- int32 smallint_col1
|
|
- int32 smallint_col2
|
|
- int32 smallint_col3
|
|
- int32 int_col1
|
|
- int32 int_col2
|
|
- float float_col
|
|
It is used to test primitive type widening (IMPALA-6373).
|
|
|
|
corrupt_footer_len_decr.parquet:
|
|
Parquet file that contains one row of the following schema:
|
|
- bigint c
|
|
The footer size is manually modified (using hexedit) to be the original file size minus
|
|
1, to cause metadata deserialization in footer parsing to fail, thus trigger the printing
|
|
of an error message with incorrect file offset, to verify that it's fixed by IMPALA-6442.
|
|
|
|
corrupt_footer_len_incr.parquet:
|
|
Parquet file that contains one row of the following schema:
|
|
- bigint c
|
|
The footer size is manually modified (using hexedit) to be larger than the original file
|
|
size and cause footer parsing to fail. It's used to test an error message related to
|
|
IMPALA-6442.
|
|
|
|
hive_single_value_timestamp.parq:
|
|
Parquet file written by Hive with the followin schema:
|
|
i int, timestamp d
|
|
Contains a single row. It is used to test IMPALA-7559 which only occurs when all values
|
|
in a column chunk are the same timestamp and the file is written with parquet-mr (which
|
|
is used by Hive).
|
|
|
|
out_of_range_time_of_day.parquet:
|
|
IMPALA-7595: Parquet file that contains timestamps where the time part is out of the
|
|
valid range [0..24H). Before the fix, select * returned these values:
|
|
1970-01-01 -00:00:00.000000001 (invalid - negative time of day)
|
|
1970-01-01 00:00:00
|
|
1970-01-01 23:59:59.999999999
|
|
1970-01-01 24:00:00 (invalid - time of day should be less than a whole day)
|
|
|
|
strings_with_quotes.csv:
|
|
Various strings with quotes in them to reproduce bugs like IMPALA-7586.
|
|
|
|
int64_timestamps_plain.parq:
|
|
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
|
|
Timestamp logical types. Has the following columns:
|
|
new_logical_milli_utc, new_logical_milli_local,
|
|
new_logical_micro_utc, new_logical_micro_local
|
|
|
|
int64_timestamps_dict.parq:
|
|
Parquet file generated with Parquet-mr that contains dictionary encoded int64 columns
|
|
with Timestamp logical types. Has the following columns:
|
|
id,
|
|
new_logical_milli_utc, new_logical_milli_local,
|
|
new_logical_micro_utc, new_logical_micro_local
|
|
|
|
int64_timestamps_at_dst_changes.parquet:
|
|
Parquet file generated with Parquet-mr that contains plain encoded int64 columns with
|
|
Timestamp logical types. The file contains 3 row groups, and all row groups contain
|
|
3 distinct values, so there is a "min", a "max", and a "middle" value. The values were
|
|
selected in such a way that the UTC->CET conversion changes the order of the values (this
|
|
is possible during Summer->Winter DST change) and "middle" falls outside the "min".."max"
|
|
range after conversion. This means that a naive stat filtering implementation could drop
|
|
"middle" incorrectly.
|
|
Example (all dates are 2017-10-29):
|
|
UTC: 00:45:00, 01:00:00, 01:10:00 =>
|
|
CET: 02:45:00, 02:00:00, 02:10:00
|
|
Columns: rawvalue bigint, rowgroup int, millisutc timsestamp, microsutc timestamp
|
|
|
|
int64_timestamps_nano.parquet:
|
|
Parquet file generated with Parquet-mr that contains int64 columns with nanosecond
|
|
precision. Tested separately from the micro/millisecond columns because of the different
|
|
valid range.
|
|
Columns: rawvalue bigint, nanoutc timestamp, nanononutc timestamp
|
|
|
|
out_of_range_timestamp_hive_211.parquet:
|
|
Hive-generated file with an out-of-range timestamp. Generated with Hive 2.1.1 using
|
|
the following query:
|
|
create table alltypes_hive stored as parquet as
|
|
select * from functional.alltypes
|
|
union all
|
|
select -1, false, 0, 0, 0, 0, 0, 0, '', '', cast('1399-01-01 00:00:00' as timestamp), 0, 0
|
|
|
|
out_of_range_timestamp2_hive_211.parquet:
|
|
Hive-generated file with out-of-range timestamps every second value, to exercise code
|
|
paths in Parquet scanner for non-repeated runs. Generated with Hive 2.1.1 using
|
|
the following query:
|
|
create table hive_invalid_timestamps stored as parquet as
|
|
select id,
|
|
case id % 3
|
|
when 0 then timestamp_col
|
|
when 1 then NULL
|
|
when 2 then cast('1300-01-01 9:9:9' as timestamp)
|
|
end timestamp_col
|
|
from functional.alltypes
|
|
sort by id
|
|
|
|
decimal_rtf_tbl.txt:
|
|
This was generated using formulas in Google Sheets. The goal was to create various
|
|
decimal values that covers the 3 storage formats with various precision and scale.
|
|
This is a reasonably large table that is used for testing min-max filters
|
|
with decimal types on Kudu.
|
|
|
|
decimal_rtf_tiny_tbl.txt:
|
|
Small table with specific decimal values picked from decimal_rtf_tbl.txt so that
|
|
min-max filter based pruning can be tested with decimal types on Kudu.
|
|
|
|
date_tbl.orc
|
|
Small orc table with one DATE column, created by Hive.
|
|
|
|
date_tbl.avro
|
|
Small avro table with one DATE column, created by Hive.
|
|
|
|
date_tbl.parquet
|
|
Small parquet table with one DATE column, created by Parquet MR.
|
|
|
|
out_of_range_date.parquet:
|
|
Generated with a hacked version of Impala parquet writer.
|
|
Contains a single DATE column with 9 values, 4 of which are out of range
|
|
and should be read as NULL by Impala:
|
|
-0001-12-31 (invalid - date too small)
|
|
0000-01-01 (invalid - date too small)
|
|
0000-01-02 (invalid - date too small)
|
|
1969-12-31
|
|
1970-01-01
|
|
1970-01-02
|
|
9999-12-30
|
|
9999-12-31
|
|
10000-01-01 (invalid - date too large)
|
|
|
|
out_of_range_date.orc:
|
|
Created using a pre-3.1. Hive version (2.1.1.) to contain an out-of-range date value.
|
|
Took advantage of the incompatibility between Hive and Impala when Hive (before 3.1)
|
|
writes a date before 1582-10-15 is interpreted incorrectly by Impala. The values I wrote
|
|
with Hive:
|
|
2019-10-04, 1582-10-15, 0001-01-01, 9999-12-31
|
|
This is interpreted by Impala as:
|
|
2019-10-04, 1582-10-15, 0000-12-30 (invalid - date too small), 9999-12-31
|
|
|
|
|
|
|
|
hive2_pre_gregorian.parquet:
|
|
Small parquet table with one DATE column, created by Hive 2.1.1.
|
|
Used to demonstrate parquet interoperability issues between Hive and Impala for dates
|
|
before the introduction of Gregorian calendar in 1582-10-15.
|
|
|
|
hive2_pre_gregorian.orc:
|
|
Same as the above but in ORC format instead of Parquet.
|
|
|
|
decimals_1_10.parquet:
|
|
Contains two decimal columns, one with precision 1, the other with precision 10.
|
|
I used Hive 2.1.1 with a modified version of Parquet-MR (6901a20) to create tiny,
|
|
misaligned pages in order to test the value-skipping logic in the Parquet column readers.
|
|
The modification in Parquet-MR was to set MIN_SLAB_SIZE to 1. You can find the change
|
|
here: https://github.com/boroknagyz/parquet-mr/tree/tiny_pages
|
|
hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=5
|
|
--hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
|
|
create table decimals_1_10 (d_1 DECIMAL(1, 0), d_10 DECIMAL(10, 0)) stored as PARQUET
|
|
insert into decimals_1_10 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5),
|
|
(NULL, 1), (2, 2), (3, 3), (4, 4), (5, 5),
|
|
(1, 1), (NULL, 2), (3, 3), (4, 4), (5, 5),
|
|
(1, 1), (2, 2), (NULL, 3), (4, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (NULL, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (4, 4), (NULL, 5),
|
|
(NULL, 1), (NULL, 2), (3, 3), (4, 4), (5, 5),
|
|
(1, 1), (NULL, 2), (3, 3), (NULL, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (NULL, 4), (NULL, 5),
|
|
(NULL, 1), (2, 2), (NULL, 3), (NULL, 4), (5, 5),
|
|
(1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
|
|
|
|
nested_decimals.parquet:
|
|
Contains two columns, one is a decimal column, the other is an array of decimals.
|
|
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
|
|
hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16
|
|
--hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
|
|
create table nested_decimals (d_38 Decimal(38, 0), arr array<Decimal(1, 0)>) stored as parquet;
|
|
insert into nested_decimals select 1, array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) ) union all
|
|
select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all
|
|
select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all
|
|
select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all
|
|
select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all
|
|
|
|
select 1, array(cast (1 as decimal(1,0)) ) union all
|
|
select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all
|
|
select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all
|
|
select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all
|
|
select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all
|
|
|
|
select 1, array(cast (NULL as decimal(1, 0)), NULL, NULL) union all
|
|
select 2, array(cast (2 as decimal(1,0)), NULL, NULL) union all
|
|
select 3, array(cast (3 as decimal(1,0)), NULL, cast (3 as decimal(1,0))) union all
|
|
select 4, array(NULL, cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), NULL) union all
|
|
select 5, array(NULL, cast (5 as decimal(1,0)), NULL, NULL, cast (5 as decimal(1,0)) ) union all
|
|
|
|
select 6, array(cast (6 as decimal(1,0)), NULL, cast (6 as decimal(1,0)) ) union all
|
|
select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), NULL ) union all
|
|
select 8, array(NULL, NULL, cast (8 as decimal(1,0)) ) union all
|
|
select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)) ) union all
|
|
select 6, array(NULL, NULL, NULL, cast (6 as decimal(1,0)) );
|
|
|
|
double_nested_decimals.parquet:
|
|
Contains two columns, one is a decimal column, the other is an array of arrays of
|
|
decimals. I used Hive 2.1.1 with a modified Parquet-MR, see description
|
|
at decimals_1_10.parquet.
|
|
hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16
|
|
--hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1
|
|
create table double_nested_decimals (d_38 Decimal(38, 0), arr array<array<Decimal(1, 0)>>) stored as parquet;
|
|
insert into double_nested_decimals select 1, array(array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) )) union all
|
|
select 2, array(array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) )) union all
|
|
select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) )) union all
|
|
select 4, array(array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) )) union all
|
|
select 5, array(array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) )) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (2 as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all
|
|
select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all
|
|
select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all
|
|
|
|
select 1, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all
|
|
select 2, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))) ) union all
|
|
select 3, array(array(cast (NULL as decimal(1,0))), array(cast (3 as decimal(1,0))), NULL ) union all
|
|
select 4, array(NULL, NULL, array(cast (NULL as decimal(1,0)), NULL, NULL, NULL, NULL) ) union all
|
|
select 5, array(array(NULL, cast (5 as decimal(1,0)), NULL, NULL, NULL) ) union all
|
|
|
|
select 6, array(array(cast (6 as decimal(1,0)), NULL), array(cast (6 as decimal(1,0))) ) union all
|
|
select 7, array(array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0))), NULL ) union all
|
|
select 8, array(array(NULL, NULL, cast (8 as decimal(1,0))) ) union all
|
|
select 7, array(array(cast (7 as decimal(1,0)), cast (NULL as decimal(1,0))), array(cast (7 as decimal(1,0))) ) union all
|
|
select 6, array(array(NULL, NULL, cast (6 as decimal(1,0))), array(NULL, cast (6 as decimal(1,0))) );
|
|
|
|
alltypes_tiny_pages.parquet:
|
|
Created from 'functional.alltypes' with small page sizes.
|
|
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
|
|
I used the following commands to create the file:
|
|
hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.page.size.row.check.min=7
|
|
create table alltypes_tiny_pages stored as parquet as select * from functional_parquet.alltypes
|
|
|
|
alltypes_tiny_pages_plain.parquet:
|
|
Created from 'functional.alltypes' with small page sizes without dictionary encoding.
|
|
I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet.
|
|
I used the following commands to create the file:
|
|
hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=7
|
|
create table alltypes_tiny_pages_plain stored as parquet as select * from functional_parquet.alltypes
|
|
|
|
parent_table:
|
|
Created manually. Contains two columns, an INT and a STRING column. Together they form primary key for the table. This table is used to test primary key and foreign key
|
|
relationships along with parent_table_2 and child_table.
|
|
|
|
parent_table_2:
|
|
Created manually. Contains just one int column which is also the table's primary key. This table is used to test primary key and foreign key
|
|
relationships along with parent_table and child_table.
|
|
|
|
child_table:
|
|
Created manually. Contains four columns. 'seq' column is the primary key of this table. ('id', 'year') form a foreign key referring to parent_table('id', 'year') and 'a' is a
|
|
foreign key referring to parent_table_2's primary column 'a'.
|
|
|
|
out_of_range_timestamp.orc:
|
|
Created with Hive. ORC file with a single timestamp column 'ts'.
|
|
Contains one row (1300-01-01 00:00:00) which is outside Impala's valid time range.
|
|
|
|
corrupt_schema.orc:
|
|
ORC file from IMPALA-9277, generated by fuzz test. The file contains malformed metadata.
|
|
|
|
corrupt_root_type.orc:
|
|
ORC file for IMPALA-9249, generated by fuzz test. The root type of the schema is not
|
|
struct, which used to hit a DCHECK.
|
|
|
|
hudi_parquet:
|
|
IMPALA-8778: Support read Apache Hudi tables
|
|
Hudi parquet is a special format of parquet files managed by Apache Hudi
|
|
(hudi.incubator.apache.org) to provide ACID transaction.
|
|
In order to provide snapshot isolation between writer and queries,
|
|
Hudi will write a newer version of the existing parquet file
|
|
if there is any update comes into the file.
|
|
Hudi store the indexing information and version information in the file name.
|
|
For example:
|
|
`ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet`
|
|
`ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file
|
|
`20200112194517` is the timestamp of this version
|
|
If there is a record was updated in this file, Hudi will write a new file with
|
|
the same indexing hash but a newer version depends on the time of writing.
|
|
`ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet`
|
|
If the impala table was refreshed after this file was written, impala will
|
|
only query on the file with latest version.
|