impala

mirror of https://github.com/apache/impala.git synced 2026-01-06 15:01:43 -05:00

Author	SHA1	Message	Date
aphadke	38461c524f	IMPALA-5052: Read and write signed integer logical types in Parquet This patch maps a signed integer logical type in parquet to a supported Impala column type. This change introduces the following mapping - INT_8 -> TINYINT INT_16 -> SMALLINT INT_32 -> INT INT_64 -> BIGINT Also, added a parquet file with the following schema for testing - schema { optional int32 id; optional int32 tinyint_col (INT_8); optional int32 smallint_col (INT_16); optional int32 int_col; optional int64 bigint_col; } Change-Id: I47a8371858c9597c6a440808cf6f933532468927 Reviewed-on: http://gerrit.cloudera.org:8080/8548 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 04:55:59 +00:00
Tim Armstrong	ae116b5bf7	IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding Switch the decoders to using more batch-oriented interfaces. As an intermediate step this doesn't make the interfaces of LevelDecoder or DictDecoder batch-oriented, only the lower-level utility classes. The next step would be to change those interfaces to be batch-oriented and make according optimisations in parquet. This could deliver much larger perf improvements than the current patch. The high-level changes are. * BitReader -> BatchedBitReader, which is built to unpack runs of 32 bit-packed values efficiently. * RleDecoder -> RleBatchDecoder, which exposes the repeated and literal runs to the caller and uses BatchedBitReader to unpack literal runs efficiently. * Dict decoding uses RleBatchDecoder to decode repeated runs efficiently and uses the BitPacking utilities to unpack and encode in a single step. Also removes an older benchmark that isn't too interesting (since the batch-oriented approach to encoding and decoding is so much faster than the value-by-value approach). Testing: * Ran core tests. * Updated unit tests to exercise new code. * Added test coverage for the deprecated bit-packed level encoding to that it still works (there was no coverage previously). Perf: Single-node benchmarks showed a few % performance gain. 16 node cluster benchmarks only showed a gain for TPC-H nested. Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6 Reviewed-on: http://gerrit.cloudera.org:8080/8267 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-16 21:23:09 +00:00
Bikramjeet Vig	94236ff2ff	IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Reviewed-on: http://gerrit.cloudera.org:8080/7822 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-07 04:34:26 +00:00
Gabor Kaszab	545eab6d62	IMPALA-4826: Fix error during a scan on repeated root schema in Parquet. Having the repetition level set to REPEATED on the root schema resulted a scan to fail with error when Impala tried to parse that table. As a solution, the 'REPEATED' repetition level is ignored when the root schema is processed. The reasoning behind is that the Parquet format description says that the repetition level of the root schema should not be set to REPEATED anyway, so it's safe to ignore it in case it is set to this value for some reason. Change-Id: I7ea84589e1d122ad9d43adde46893ec0ecc5f9c4 Reviewed-on: http://gerrit.cloudera.org:8080/7870 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-06 20:07:56 +00:00
Lars Volker	9270346825	IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types This change adds functionality to write and read parquet::Statistics for Decimal, String, and Timestamp values. As an exception, we don't read statistics for CHAR columns, since CHAR support is broken in Impala (IMPALA-1652). This change also switches from using the deprecated fields 'min' and 'max' to populate the new fields 'min_value' and 'max_value' in parquet::Statistics, that were added in parquet-format pull request #46. The HdfsParquetScanner will preferably read the new fields if they are populated and if the column order 'TypeDefinedOrder' has been used to compute the statistics. For columns without a column order set or with only the deprecated fields populated, the scanner will read them only if they are of simple numeric type, i.e. boolean, integer, or floating point. This change removes the validation of the Parquet Statistics we write to Hive from the tests, since Hive does not write the new fields. Instead it adds a parquet file written by Hive that uses the deprecated fields for its statistics. It uses that file to exercise the fallback logic for supported types in a test. This change also cleans up the interface of ParquetPlainEncoder in parquet-common.h. Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Reviewed-on: http://gerrit.cloudera.org:8080/6563 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2017-05-09 15:47:21 +00:00
Lars Volker	12f3ecceab	IMPALA-5287: Test skip.header.line.count on gzip This change fixed IMPALA-4873 by adding the capability to supply a dict 'test_file_vars' to run_test_case(). Keys in this dict will be replaced with their values inside test queries before they are executed. Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70 Reviewed-on: http://gerrit.cloudera.org:8080/6817 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 01:36:46 +00:00
Alex Behm	d3cc23e569	IMPALA-5021: Fix count() remaining rows overflow in Parquet. Zero-slot scans of Parquet files that have num_rows > MAX_INT32 in the footer metadata used to run forever due to an overflow when calculating the remaining number of rows to process. Testing: - Added a regression test using a file with num_rows = 2MAX_INT32. - Locally ran test_scanners.py which succeeded. - Private core/hdfs run succeeded Change-Id: Ib9f8a6b83f8f621451d5977423ef81a6e4b124bd Reviewed-on: http://gerrit.cloudera.org:8080/6286 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-08 02:00:30 +00:00
Taras Bobrovytsky	858f5c2197	IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400..9999 year range as we are scanning Parquet. Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Reviewed-on: http://gerrit.cloudera.org:8080/5343 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-12-03 06:41:07 +00:00
Alex Behm	0449b5beab	IMPALA-3943: Do not throw scan errors for empty Parquet files. For Parquet files with no row groups but with num_rows=0 in the file footer the Parquet scanner returns an error indicating that the file is invalid. This behavior is a regression from previous Impala versions which used to accept such files. This patch restores the previous behavior and adds tests. Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e Reviewed-on: http://gerrit.cloudera.org:8080/4693 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-12 09:22:57 +00:00
Thomas Tauber-Marshall	b2c2fe7813	IMPALA-3786: Replace "cloudera" with "apache" (part 2) As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . \| grep "\.java\\|\.py\\|\.h\\|\.cc" \| \ xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 21:14:13 +00:00
Alex Behm	025fd3bd7f	IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0. Adds handling and testing for a specific Parquet data corruption scenario with plain dictionary encoded values. The problematic scenario is when the repeat or literal count of the RLE-encoded dictionary indexes is decoded as 0 - an invalid value. There are several other cases of data corruption that are not yet handled gracefully. This patch only handles one specific case. Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63 Reviewed-on: http://gerrit.cloudera.org:8080/3299 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-06-07 17:29:59 -07:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Skye Wanderman-Milne	dd2eb951d7	IMPALA-2558: DCHECK in parquet scanner after block read error There was an incorrect DCHECK in the parquet scanner. If abort_on_error is false, the intended behaviour is to skip to the next row group, but the DCHECK assumed that execution should have aborted if a parse error was encountered. This also: - Fixes a DCHECK after an empty row group. InitColumns() would try to create empty scan ranges for the column readers. - Uses metadata_range_->file() instead of stream_->filename() in the scanner. InitColumns() was using stream_->filename() in error messages, which used to work but now stream_ is set to NULL before calling InitColumns(). Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701 Reviewed-on: http://gerrit.cloudera.org:8080/1257 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-30 22:35:57 +00:00
casey	87b9fac2ad	IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps No changes to writing were made. No changes to reading Impala written files were made. Hive writes TIMESTAMP values to parquet files differently than Impala does. Hive converts the value from local time to UTC before writing; Impala does not. This change adds a startup flag that will convert UTC to local when reading files written by Hive. The Hive-file detection actually checks for "parquet-mr" (which is the library Hive uses) in the file metadata. A slight possibility exists that TIMESTAMP values written by something other than Hive but also using parquet-mr may become incorrect. The possibility should be very small because TIMESTAMP values are stored and encoded in a non-standard way other applications are unlikely to be aware of. Flags from be/src/exec/hdfs-parquet-scanner.cc: -convert_legacy_hive_parquet_utc_timestamps (When true, TIMESTAMPs read from files written by Parquet-MR (used by Hive) will be converted from UTC to local time. Writes are unaffected.) type: bool default: false Change-Id: I79a499fe24049b7025ee2dd76c9c3e07010d346a Reviewed-on: http://gerrit.cloudera.org:8080/35 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-02-11 13:28:17 +00:00
Skye Wanderman-Milne	561da008c7	IMPALA-729: fix resource management in Parquet scanner for multiple row groups We weren't attaching resources to the row batch when starting a new row group, so it was possible for string data to be overwritten. This patch removes CloseStreams() and merges its functionality with AttachCompletedResources() so it's not possible to destroy streams without transferring the resources first. It also merges and removes ScannerContext::Close(). Also adds test cases for IMPALA-720. Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb	2014-01-08 10:56:26 -08:00

15 Commits