impala

mirror of https://github.com/apache/impala.git synced 2025-12-30 12:02:10 -05:00

Files

Vincent Tran 0d7787fe4d IMPALA-5315: Cast to timestamp fails for YYYY-M-D format

This change allows casting of a string in 'lazy' date/time
format to timestamp. The supported lazy date formats are:
  yyyy-[M]M-[d]d
  yyyy-[M]M-[d]d [H]H:[m]m:[s]s[.SSSSSSSSS]
  [H]H:[m]m:[s]s[.SSSSSSSSS]

We will incur a SCAN performance penalty (approximately 1/2
TotalReadThroughput) when the string is in one of these
lazy date/time format.

Testing:
Benchmarked the performance consequence by executing this SQL on
a private build over 3.8 billion rows:
select min(cast (time_string as timestamp)) from private.impala_5315

Added tests for valid and invalid date/time format strings
in expr-test.cc to be inline with existing tests for CAST() function.

Added end-to-end tests into exprs.test and
select-lazy-timestamp.test to exercise the new function within
the context of a query.

Added tests to exercise the leading and trailing white space trimming
behaviour in default and lazy date/time string format (IMPALA-6630).

Change-Id: Ib9a184a09d7e7783f04d47588537612c2ecec28f
Reviewed-on: http://gerrit.cloudera.org:8080/7009
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins

2018-03-13 22:10:18 +00:00

local_tbl

Enable local filesystem tables

2015-02-27 18:48:56 +00:00

schemas

IMPALA-2525: Treat parquet ENUMs as STRINGs when creating impala tables.

2017-06-07 02:51:54 +00:00

alltypes_agg_bitpacked_def_levels.parquet

IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding

2017-11-16 21:23:09 +00:00

alltypesagg_hive_13_1.parquet

IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps

2015-02-11 13:28:17 +00:00

avro_decimal_tbl.avro

Decimal: read from Avro

2014-05-16 22:26:11 -07:00

bad_codec.parquet

IMPALA-6592: add test for invalid parquet codecs

2018-03-08 04:48:36 +00:00

bad_column_metadata.parquet

IMPALA-2558: DCHECK in parquet scanner after block read error

2015-10-30 22:35:57 +00:00

bad_compressed_dict_page_size.parquet

IMPALA-6353: Fix crash in snappy decompressor

2018-01-17 04:18:24 +00:00

bad_compressed_size.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_dict_page_offset.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_magic_number.parquet

IMPALA-2130: Wrong verification of Parquet file version

2015-07-14 02:52:02 +00:00

bad_metadata_len.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_parquet_data.parquet

IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8

2014-01-08 10:54:27 -08:00

bad_rle_literal_count.parquet

IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

2016-06-07 17:29:59 -07:00

bad_rle_repeat_count.parquet

IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

2016-06-07 17:29:59 -07:00

binary_decimal_dictionary.parquet

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-07 04:34:26 +00:00

binary_decimal_no_dictionary.parquet

IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

2017-11-07 04:34:26 +00:00

chars-formats.avro

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-formats.parquet

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-formats.txt

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-tiny.txt

Bugfix and tests for CHAR(N) and VARCHAR(N)

2014-09-23 07:30:07 -07:00

data-bzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

data-pbzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

decimal_tbl.txt

Decimal implementation.

2014-04-14 21:07:32 -07:00

decimal-tiny.txt

Decimal implementation.

2014-04-14 21:07:32 -07:00

deprecated_statistics.parquet

IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types

2017-05-09 15:47:21 +00:00

huge_num_rows.parquet

IMPALA-5021: Fix count(*) remaining rows overflow in Parquet.

2017-03-08 02:00:30 +00:00

kite_required_fields.parquet

Parquet: Fix value def level when max def level is 0

2015-05-15 06:41:02 +00:00

large_bzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

large_pbzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

lazy_timestamp.csv

IMPALA-5315: Cast to timestamp fails for YYYY-M-D format

2018-03-13 22:10:18 +00:00

long_page_header.parquet

IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to

2014-10-27 16:30:56 -07:00

min_max_is_nan.parquet

IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN

2018-02-22 00:57:46 +00:00

multiple_rowgroups.parquet

IMPALA-729: fix resource management in Parquet scanner for multiple row groups

2014-01-08 10:56:26 -08:00

oldrcfile.rc

Fix pre-hive 9 rc file scanner.

2014-01-08 10:48:41 -08:00

out_of_range_timestamp.parquet

IMPALA-4363: Add Parquet timestamp validation

2016-12-03 06:41:07 +00:00

overflow.txt

IMPALA-4810: add DECIMAL test case to strict_mode tests

2017-03-03 01:43:42 +00:00

README

IMPALA-6592: add test for invalid parquet codecs

2018-03-08 04:48:36 +00:00

repeated_root_schema.parquet

IMPALA-4826: Fix error during a scan on repeated root schema in Parquet.

2017-09-06 20:07:56 +00:00

repeated_values.parquet

Allow zero bit width dict/RLE decoders.

2014-01-08 10:54:27 -08:00

signed_integer_logical_types.parquet

IMPALA-5052: Read and write signed integer logical types in Parquet

2018-01-09 04:55:59 +00:00

table_missing_columns.csv

IMPALA-1973: Fixing crash when uninitialized, empty row is added in HdfsTextScanner

2015-05-05 00:19:12 +00:00

table_no_newline.csv

IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line.

2015-03-20 19:58:50 -07:00

table_with_header_2.csv

IMPALA-1740: Add support for skip.header.line.count.

2016-05-12 14:17:46 -07:00

table_with_header_2.gz

IMPALA-5287: Test skip.header.line.count on gzip

2017-05-09 01:36:46 +00:00

table_with_header.csv

IMPALA-1740: Add support for skip.header.line.count.

2016-05-12 14:17:46 -07:00

table_with_header.gz

IMPALA-5287: Test skip.header.line.count on gzip

2017-05-09 01:36:46 +00:00

text-comma-backslash-newline.txt

IMPALA-496: Fix escaping of field delimiter and escape character in inserts

2014-01-08 10:52:09 -08:00

text-dollar-hash-pipe.txt

IMPALA-496: Fix escaping of field delimiter and escape character in inserts

2014-01-08 10:52:09 -08:00

text-thorn-ecirc-newline.txt

IMP-1291: Support "extended" ASCII characters as delimiters in text files

2014-03-13 13:00:15 -07:00

timezoneverification.csv

IMPALA-1381: Expand set of supported timezones.

2015-05-22 01:32:54 +00:00

widerow.txt

IMPALA-525: Adjust IO buffer size based on read length and other memory fixes

2014-01-08 10:54:01 -08:00

zero_rows_one_row_group.parquet

IMPALA-3943: Do not throw scan errors for empty Parquet files.

2016-10-12 09:22:57 +00:00

zero_rows_zero_row_groups.parquet

IMPALA-3943: Do not throw scan errors for empty Parquet files.

2016-10-12 09:22:57 +00:00

README

bad_parquet_data.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"is"
"fun"

bad_compressed_dict_page_size.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single string column 'col' with one row ("a"). The compressed_page_size field
in dict page header is modifed to 0 to test if it is correctly handled.

bad_rle_literal_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the values 1, 3, 7 stored
in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are all literals (and not repeated), but the literal count
is incorrectly 0 in the file to test that such data corruption is
proprly handled.

bad_rle_repeat_count.parquet:
Generated by hacking Impala's Parquet writer.
Contains a single bigint column 'c' with the value 7 repeated 7 times
stored in a single data chunk as dictionary plain. The RLE encoded dictionary
indexes are a single repeated run (and not literals), but the repeat count
is incorrectly 0 in the file to test that such data corruption is proprly
handled.

zero_rows_zero_row_groups.parquet:
Generated by hacking Impala's Parquet writer.
The file metadata indicates zero rows and no row groups.

zero_rows_one_row_group.parquet:
Generated by hacking Impala's Parquet writer.
The file metadata indicates zero rows but one row group.

huge_num_rows.parquet
Generated by hacking Impala's Parquet writer.
The file metadata indicates 2 * MAX_INT32 rows.
The single row group also has the same number of rows in the metadata.

repeated_values.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"parquet"
"parquet"

multiple_rowgroups.parquet:
Generated with parquet-mr 1.2.5
Populated with:
hive> set parquet.block.size=500;
hive> INSERT INTO TABLE tbl
SELECT l_comment FROM tpch.lineitem LIMIT 1000;

alltypesagg_hive_13_1.parquet:
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;

bad_column_metadata.parquet:
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
Schema:
{"type": "record",
"namespace": "org.apache.impala",
"name": "bad_column_metadata",
"fields": [
{"name": "id", "type": ["null", "long"]},
{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
]
}
Contains 3 row groups, each with ten rows and each array containing ten elements. The
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
there are 11 values (instead of 10). The third rowgroup has the correct metadata.

data-bzip2.bz2:
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size < 8M

large_bzip2.bz2:
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size > 8M

data-pbzip2.bz2:
Generated with pbzip2, contains multiple bzip2 streams
Contains 1 column, uncompressed data size < 8M

large_pbzip2.bz2:
Generated with pbzip2, contains multiple bzip2 stream
Contains 1 column, uncompressed data size > 8M

out_of_range_timestamp.parquet:
Generated with a hacked version of Impala parquet writer.
Contains a single timestamp column with 4 values, 2 of which are out of range
and should be read as NULL by Impala:
1399-12-31 00:00:00 (invalid - date too small)
1400-01-01 00:00:00
9999-12-31 00:00:00
10000-01-01 00:00:00 (invalid - date too large)

table_with_header.csv:
Created with a text editor, contains a header line before the data rows.

table_with_header_2.csv:
Created with a text editor, contains two header lines before the data rows.

table_with_header.gz, table_with_header_2.gz:
Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.

deprecated_statistics.parquet:
Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
Contains a copy of the data in functional.alltypessmall with statistics that use the old
'min'/'max' fields.

repeated_root_schema.parquet:
Generated by hacking Impala's Parquet writer.
Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
repetition level of the root schema is set to REPEATED.
Reproduction steps:
1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
2: Run test_compute_stats and grab the created Parquet file for
alltypes_parquet table.

binary_decimal_dictionary.parquet,
binary_decimal_no_dictionary.parquet:
Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
and non-dictionary encoding respectively.

alltypes_agg_bitpacked_def_levels.parquet:
Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
of the standard RLE-encoded levels. See
https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
is a single file containing all of the alltypesagg data, which includes a mix of
null and non-null values. This is not actually a valid Parquet file because the
bit-packed levels are written in the reverse order specified in the Parquet spec
for BIT_PACKED. However, this is the order that Impala attempts to read the levels
in - see IMPALA-3006.

signed_integer_logical_types.parquet:
Generated using a utility that uses the java Parquet API.
The file has the following schema:
schema {
optional int32 id;
optional int32 tinyint_col (INT_8);
optional int32 smallint_col (INT_16);
optional int32 int_col;
optional int64 bigint_col;
}

min_max_is_nan.parquet:
Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
Created to test the read path for a Parquet file with invalid metadata, namely when
'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
NaN
42

bad_codec.parquet:
Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
compression codec. The data in the file is the whole of the "alltypestiny" data set, with
the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
int_col int, bigint_col bigint, float_col float, double_col double,
date_string_col string, string_col string, timestamp_col timestamp, year int, month int