SnappyDecompressor::MaxOutputLen assumes the input pointer to be
non-null. It's not true when the parquet file is corrupted and the
compressed_page_size field in a page header is 0. This patch handles
this error instead of failing a DCHECK.
Testing: A bad parquet file with 0 compressed_page_size is added. It
crashes impala without this patch.
Change-Id: I0d42937aab92a74f8e104d2f7fcd64dc24f6a500
Reviewed-on: http://gerrit.cloudera.org:8080/8977
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This patch maps a signed integer logical type in parquet to a supported
Impala column type. This change introduces the following mapping -
INT_8 -> TINYINT
INT_16 -> SMALLINT
INT_32 -> INT
INT_64 -> BIGINT
Also, added a parquet file with the following schema for testing -
schema {
optional int32 id;
optional int32 tinyint_col (INT_8);
optional int32 smallint_col (INT_16);
optional int32 int_col;
optional int64 bigint_col;
}
Change-Id: I47a8371858c9597c6a440808cf6f933532468927
Reviewed-on: http://gerrit.cloudera.org:8080/8548
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins
Switch the decoders to using more batch-oriented interfaces. As an
intermediate step this doesn't make the interfaces of LevelDecoder
or DictDecoder batch-oriented, only the lower-level utility classes.
The next step would be to change those interfaces to be batch-oriented
and make according optimisations in parquet. This could deliver much
larger perf improvements than the current patch.
The high-level changes are.
* BitReader -> BatchedBitReader, which is built to unpack runs of 32
bit-packed values efficiently.
* RleDecoder -> RleBatchDecoder, which exposes the repeated and literal
runs to the caller and uses BatchedBitReader to unpack literal runs
efficiently.
* Dict decoding uses RleBatchDecoder to decode repeated runs efficiently
and uses the BitPacking utilities to unpack and encode in a single
step.
Also removes an older benchmark that isn't too interesting (since
the batch-oriented approach to encoding and decoding is so much
faster than the value-by-value approach).
Testing:
* Ran core tests.
* Updated unit tests to exercise new code.
* Added test coverage for the deprecated bit-packed level encoding to
that it still works (there was no coverage previously).
Perf:
Single-node benchmarks showed a few % performance gain. 16 node cluster
benchmarks only showed a gain for TPC-H nested.
Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6
Reviewed-on: http://gerrit.cloudera.org:8080/8267
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.
Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.
Performance:
Initial perf testing using tpcds_1000 shows no regression.
Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Reviewed-on: http://gerrit.cloudera.org:8080/7822
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
Having the repetition level set to REPEATED on the root schema
resulted a scan to fail with error when Impala tried to parse that
table.
As a solution, the 'REPEATED' repetition level is ignored when the
root schema is processed. The reasoning behind is that the Parquet
format description says that the repetition level of the root schema
should not be set to REPEATED anyway, so it's safe to ignore it in
case it is set to this value for some reason.
Change-Id: I7ea84589e1d122ad9d43adde46893ec0ecc5f9c4
Reviewed-on: http://gerrit.cloudera.org:8080/7870
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This change adds functionality to write and read parquet::Statistics for
Decimal, String, and Timestamp values. As an exception, we don't read
statistics for CHAR columns, since CHAR support is broken in Impala
(IMPALA-1652).
This change also switches from using the deprecated fields 'min' and
'max' to populate the new fields 'min_value' and 'max_value' in
parquet::Statistics, that were added in parquet-format pull request #46.
The HdfsParquetScanner will preferably read the new fields if they are
populated and if the column order 'TypeDefinedOrder' has been used to
compute the statistics. For columns without a column order set or with
only the deprecated fields populated, the scanner will read them only if
they are of simple numeric type, i.e. boolean, integer, or floating
point.
This change removes the validation of the Parquet Statistics we write to
Hive from the tests, since Hive does not write the new fields. Instead
it adds a parquet file written by Hive that uses the deprecated fields
for its statistics. It uses that file to exercise the fallback logic for
supported types in a test.
This change also cleans up the interface of ParquetPlainEncoder in
parquet-common.h.
Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Reviewed-on: http://gerrit.cloudera.org:8080/6563
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Lars Volker <lv@cloudera.com>
This change fixed IMPALA-4873 by adding the capability to supply a dict
'test_file_vars' to run_test_case(). Keys in this dict will be replaced
with their values inside test queries before they are executed.
Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70
Reviewed-on: http://gerrit.cloudera.org:8080/6817
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
Zero-slot scans of Parquet files that have num_rows > MAX_INT32
in the footer metadata used to run forever due to an overflow when
calculating the remaining number of rows to process.
Testing:
- Added a regression test using a file with num_rows = 2*MAX_INT32.
- Locally ran test_scanners.py which succeeded.
- Private core/hdfs run succeeded
Change-Id: Ib9f8a6b83f8f621451d5977423ef81a6e4b124bd
Reviewed-on: http://gerrit.cloudera.org:8080/6286
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The string parsing code already errors if the decimal column either
overflows or underflows (i.e. loses scale). Let's just add a test
case.
Change-Id: Idd66c0fb5a4d201919d39f73dea08b87339d6469
Reviewed-on: http://gerrit.cloudera.org:8080/6150
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Before this patch, we would simply read the INT96 Parquet timestamp
representation and assume that it's valid. However, not all bit
permutations represent a valid timestamp. One of the boost functions
raised an exception (that we didn't catch) when passed an invalid
boost date object, which resulted in a crash. This patch fixes
problem by validating that the date falls into 1400..9999 year
range as we are scanning Parquet.
Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac
Reviewed-on: http://gerrit.cloudera.org:8080/5343
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
For Parquet files with no row groups but with num_rows=0 in the
file footer the Parquet scanner returns an error indicating
that the file is invalid. This behavior is a regression from
previous Impala versions which used to accept such files.
This patch restores the previous behavior and adds tests.
Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Reviewed-on: http://gerrit.cloudera.org:8080/4693
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*
A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:
find . | grep "\.java\|\.py\|\.h\|\.cc" | \
xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g
along with some manual fixes.
After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/
Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
Adds handling and testing for a specific Parquet data corruption
scenario with plain dictionary encoded values.
The problematic scenario is when the repeat or literal count of
the RLE-encoded dictionary indexes is decoded as 0 - an invalid value.
There are several other cases of data corruption that are not yet
handled gracefully. This patch only handles one specific case.
Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Reviewed-on: http://gerrit.cloudera.org:8080/3299
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
HIVE-5795 introduced a parameter skip.header.line.count to skip header
lines from input files. This change introduces the capability to skip
an arbitrary number of header lines from csv input files on hdfs. The
size of the total file header must be smaller than
max_scan_range_length, otherwise an error will be reported. This is
necessary because scan ranges are not read in disk order, so there is
no way of identifying header lines except by counting from the start
of the first scan range.
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='1');
Query: alter table t1 set tblproperties('skip.header.line.count'='1')
[localhost:21000] > select * from t1;
Query: select * from t1
+----+----+
| c1 | c2 |
+----+----+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+----+----+
Fetched 3 row(s) in 0.32s
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='0');
Query: alter table t1 set tblproperties('skip.header.line.count'='0')
[localhost:21000] > select * from t1;
Query: select * from t1
+------+------+
| c1 | c2 |
+------+------+
| NULL | NULL |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+------+------+
WARNINGS: Error converting column: 0 TO INT (Data is: num1)
Error converting column: 1 TO DOUBLE (Data is: num2)
file: hdfs://localhost:20500/test-warehouse/t1/test.txt
record: num1,num2
Fetched 4 row(s) in 0.41s
Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543
Reviewed-on: http://gerrit.cloudera.org:8080/2110
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
Fix a bug in which Impala only reads the first stream
of a multi-stream bz2/gzip file.
Changes the bz2 decoder to read the file in a streaming
fashion rather than reading the entire file into memory
before it can be decompressed.
Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8
(cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e)
Reviewed-on: http://gerrit.cloudera.org:8080/2219
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
There was an incorrect DCHECK in the parquet scanner. If abort_on_error
is false, the intended behaviour is to skip to the next row group, but
the DCHECK assumed that execution should have aborted if a parse error
was encountered.
This also:
- Fixes a DCHECK after an empty row group. InitColumns() would try to
create empty scan ranges for the column readers.
- Uses metadata_range_->file() instead of stream_->filename() in the
scanner. InitColumns() was using stream_->filename() in error
messages, which used to work but now stream_ is set to NULL before
calling InitColumns().
Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701
Reviewed-on: http://gerrit.cloudera.org:8080/1257
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.
Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
A script is added that generates two parquet files with nested data.
One file has modern nested types encoding and the other one has
legacy encoding. This data will be used for testing nested types
support for "create table like file" statement.
Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef
Reviewed-on: http://gerrit.cloudera.org:8080/610
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This patch corrects a mistake in the Parquet magic file number verification
and adds a test about it. Note that with this patch Impala may fail to read
Parquet files with wrong magic number that it used to read before.
Change-Id: Iff31accda1e1d541946ef1f750e38886ce4cb8d5
Reviewed-on: http://gerrit.cloudera.org:8080/515
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
The hardcoded timezone information is from Java version 1.7.0_76.
Change-Id: I32c40d0036473079e5bfd4d0252a648cbb0e7c23
Reviewed-on: http://gerrit.cloudera.org:8080/393
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
When running with a release build, NULL would be returned when
reading values from required fields in parquet files (with a debug
build a DCHECK would be hit).
Previously when the max definition level for a field was 0 (which
happens if a field is required), the definition level for value was
incorrectly set to 1. The max definition level is related to nested
data and is defined to be the number of nullable fields that will be
encountered when traversing a path to reach the desired end field.
For example, if a nested schema has a path a.b.c.d where b and d are
nullable then the max def level is 2. A def level is attached to each
value to indicate the number of optional values that are present (in
the previous example an def level of 2 means both b and d are not
null). So having a def level for a value that is greater than the max
def level for a field should never happen.
Change-Id: Ia91a97cf79e672c420d10416c6817f0930dcc920
(cherry picked from commit cdd67e4c7fd62d5b08adfaa303d7bb2382e6932c)
Reviewed-on: http://gerrit.cloudera.org:8080/386
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
This patch fixes an issue when an uninitialized, empty row is falsely
added to the rowbatch. The uninitialized data inside this row leads
later on to a crash when the null byte is checked together with the
offsets (that contains garbage).
The fix is to not only check for the number of materialized columns, but
as well for the number of materialized partition key columns. Only if both are
empty and the parser has an unfinished tuple, add the empty row.
To accommodate for the last row, check in FinishScanRange() if there is an
unfinished tuple with materialized slots or materialized partition key. Write
the fields if necessary.
Change-Id: I2808cc228e62d048d917d3a6352d869d117597ab
(cherry picked from commit c1795a8b40d10fbb32d9051a0e7de5ebffc8a6bd)
Reviewed-on: http://gerrit.cloudera.org:8080/364
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
I did a local benchmark and there's minimal performance impact(<1%)
Change-Id: I8d84a145acad886c52587258b27d33cff96ea399
(cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0)
Reviewed-on: http://gerrit.cloudera.org:8080/189
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
The S3 work really enabled any Hadoop FileSystem to work with Impala,
but a small tweak is needed for LocalFileSystem due to how the Hadoop
Path code deals with URIs that don't have an authority component.
While we aren't claiming support for arbitrary FileSystem's at this
tiem, it is useful to test this. Since the S3 testing is done as a
nightly test rather than pre-checkin, we can use the LocalFileSystem to
regression test that:
1) Impala can access table data living on a secondary filesystem,
i.e. not the filesystem specified by fs.defaultFS.
2) Impala does not make assumptions that the filesystem has type
DistributedFileSystem.
Change-Id: Ie9b858ea440c9b3b332602e034c8052b168c57da
Reviewed-on: http://gerrit.cloudera.org:8080/121
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
No changes to writing were made. No changes to reading Impala written
files were made.
Hive writes TIMESTAMP values to parquet files differently than Impala
does. Hive converts the value from local time to UTC before writing;
Impala does not. This change adds a startup flag that will convert UTC
to local when reading files written by Hive.
The Hive-file detection actually checks for "parquet-mr" (which is the
library Hive uses) in the file metadata. A slight possibility exists
that TIMESTAMP values written by something other than Hive but also
using parquet-mr may become incorrect. The possibility should be very
small because TIMESTAMP values are stored and encoded in a non-standard
way other applications are unlikely to be aware of.
Flags from be/src/exec/hdfs-parquet-scanner.cc:
-convert_legacy_hive_parquet_utc_timestamps (When true, TIMESTAMPs
read from files written by Parquet-MR (used by Hive) will be
converted from UTC to local time. Writes are unaffected.) type: bool
default: false
Change-Id: I79a499fe24049b7025ee2dd76c9c3e07010d346a
Reviewed-on: http://gerrit.cloudera.org:8080/35
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
DistributedFileSystem is lenient about seeking past the end of the file.
Other FileSystem implementations, such as NativeS3FileSystem, return an
error on this condition. That leads to a scary looking message in the
query warnings.
So, when creating scan ranges, let's require that the ranges fall within
the file bounds (at least according to what the HdfsFileDesc indicates
is the length). There were a couple of kinds of AllocateScanRange()
callsites that needed to be fixed up:
1) When a stream wants to read past a scan range, be careful not to read
past the end of the file.
2) When Impala needs to "guess" at the length of a range, use the
file_length as an upper bound on the guess. We were already doing this
someplaces but not everywhere.
3) When the scan range is derived from parquet metadata, validate the
metadata against file_length and issue appropriate errors. This will
give better diagnostics for corrupt files.
Note that we can't rely on this for safety (HdfsFileDesc file_length may
be stale), but it does mean that when metadata is up-to-date Impala will
no longer try to access beyond the end of files (and so we'll no longer
get false positive errors from the filesystem).
Additionally, this change revealed a pre-existing problem with files
that have multiple row-groups. The first time through InitColumns(),
stream_ was set to NULL. But, stream_->filename could potentially be
accessed when constructing error statuses for subsequent row-groups.
Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
Adds fixes and tests for Hive CHAR & VARCHAR compatibility.
Also fixes a bug in tuple materialization for VARCHAR and non in-lined CHAR.
Change-Id: I400b089cb8ddba2e264ef9f2e37956b2ceaaf9fb
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4054
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Fixed a bug when setting the length in reading/write text files for CHAR(N).
Also added chars_tiny table for testing CHAR(N) and VARCHAR(N).
Change-Id: If5d5db30afa4b00cf03c68c6a845f182970329f4
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4415
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'".
Supports all options that CREATE TABLE does. Currently only PARQUET is supported.
Run testdata/bin/create-load-data.sh after pulling this patch.
Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158
Allows reading decimal columns with or without codegen. Includes tests
based on a data file posted on HIVE-5823.
Change-Id: Ie541c6b98bd24543691850cb45a434af60b5a5a6
(cherry picked from commit 6983dcefdf70cce14724e17d03bc061ffb8f671c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2596
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
This fixes how we validate delimiters to be in line with Hive. A delimiter must
fit in a single byte and can be specified in the following formats, as far as I can
tell (there isn't documentation):
- A single ASCII or unicode character (ex. '|')
- An escape character in octal format (ex. \001. Stored in the metastore as a
unicode character: \u0001).
- A signed decimal integer in the range [-128:127]. Used to support delimiters
for ASCII character values between 128-255 (-2 maps to ASCII 254).
Previously, we were not handling the "signed integer" case so there was no way
to specify a delimiter in the "extended" ASCII range of 128-255.
To support result validation, the test infrastructure had to be updated to support
reading/writing different character encodings.
Change-Id: Ie3c4d444dc9c6e60192093ed0c0f6f151eab16bc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1848
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1888
We weren't attaching resources to the row batch when starting a new
row group, so it was possible for string data to be overwritten. This
patch removes CloseStreams() and merges its functionality with
AttachCompletedResources() so it's not possible to destroy streams
without transferring the resources first. It also merges and removes
ScannerContext::Close().
Also adds test cases for IMPALA-720.
Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).
Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
We were previously wasting memory by always reading into 8MB IO
buffers, even when the data read was much less than 8MB. With this
patch, the IO manager picks a buffer size closer to the actual amount
being read (we don't use the exact size so we can continue to recycle
buffers). The minimum IO buffer size is determined via the
--min_buffer_size flag, and the max IO buffer size via the --read_size
flag.
This technique also helps with IMPALA-652, since short columns will
not use as much memory as before (we will not use considerably more
memory than the size of the table).
This patch also changes StringBuffer to use a doubling strategy so it
doesn't end up allocating many large unused buffers, and has the
scanner context use the requested length as the sync read size if it's
larger than the size produced by read_past_size_cb(). These changes
help prevent the boundary buffer in the scanner context from
allocating excess memory.
Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50
Reviewed-on: http://gerrit.ent.cloudera.com:8080/938
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins