Commit Graph

8 Commits

Author SHA1 Message Date
Skye Wanderman-Milne
68fef6a5bf IMPALA-2213: make Parquet scanner fail query if the file size metadata is stale
This patch changes the Parquet scanner to check if it can't read the
full footer scan range, indicating that file has been overwritten by a
shorter file without refreshing the table metadata. Before it would
DCHECK. This patch adds a test for this case, as well as the case
where the new file is longer than the metadata states (which fails
with an existing error).

Change-Id: Ie2031ac2dc90e4f2573bd3ca8a3709db60424f07
Reviewed-on: http://gerrit.cloudera.org:8080/1084
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2015-10-01 13:58:39 -07:00
Ippokratis Pandis
e99c68fe52 IMPALA-2130: Wrong verification of Parquet file version
This patch corrects a mistake in the Parquet magic file number verification
and adds a test about it. Note that with this patch Impala may fail to read
Parquet files with wrong magic number that it used to read before.

Change-Id: Iff31accda1e1d541946ef1f750e38886ce4cb8d5
Reviewed-on: http://gerrit.cloudera.org:8080/515
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-07-14 02:52:02 +00:00
Casey Ching
ac0c075997 Parquet: Fix value def level when max def level is 0
When running with a release build, NULL would be returned when
reading values from required fields in parquet files (with a debug
build a DCHECK would be hit).

Previously when the max definition level for a field was 0 (which
happens if a field is required), the definition level for value was
incorrectly set to 1. The max definition level is related to nested
data and is defined to be the number of nullable fields that will be
encountered when traversing a path to reach the desired end field.
For example, if a nested schema has a path a.b.c.d where b and d are
nullable then the max def level is 2. A def level is attached to each
value to indicate the number of optional values that are present (in
the previous example an def level of 2 means both b and d are not
null). So having a def level for a value that is greater than the max
def level for a field should never happen.

Change-Id: Ia91a97cf79e672c420d10416c6817f0930dcc920
(cherry picked from commit cdd67e4c7fd62d5b08adfaa303d7bb2382e6932c)
Reviewed-on: http://gerrit.cloudera.org:8080/386
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-05-15 06:41:02 +00:00
Dan Hecht
3735ea94a0 S3: Don't seek/read past file end
DistributedFileSystem is lenient about seeking past the end of the file.
Other FileSystem implementations, such as NativeS3FileSystem, return an
error on this condition.  That leads to a scary looking message in the
query warnings.

So, when creating scan ranges, let's require that the ranges fall within
the file bounds (at least according to what the HdfsFileDesc indicates
is the length). There were a couple of kinds of AllocateScanRange()
callsites that needed to be fixed up:

1) When a stream wants to read past a scan range, be careful not to read
past the end of the file.

2) When Impala needs to "guess" at the length of a range, use the
file_length as an upper bound on the guess.  We were already doing this
someplaces but not everywhere.

3) When the scan range is derived from parquet metadata, validate the
metadata against file_length and issue appropriate errors.  This will
give better diagnostics for corrupt files.

Note that we can't rely on this for safety (HdfsFileDesc file_length may
be stale), but it does mean that when metadata is up-to-date Impala will
no longer try to access beyond the end of files (and so we'll no longer
get false positive errors from the filesystem).

Additionally, this change revealed a pre-existing problem with files
that have multiple row-groups.  The first time through InitColumns(),
stream_ was set to NULL.  But, stream_->filename could potentially be
accessed when constructing error statuses for subsequent row-groups.

Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
2015-01-08 16:19:35 -08:00
Skye Wanderman-Milne
4a722980e5 IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to
stitch together header buffer

Change-Id: I4f33b90e845e9bef1ac929bf4ebb8e98eaff985c
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4961
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
(cherry picked from commit c3a90183b2f03434a9604f3aa2ef6dd08c9ba97c)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4981
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-10-27 16:30:56 -07:00
Skye Wanderman-Milne
561da008c7 IMPALA-729: fix resource management in Parquet scanner for multiple row groups
We weren't attaching resources to the row batch when starting a new
row group, so it was possible for string data to be overwritten. This
patch removes CloseStreams() and merges its functionality with
AttachCompletedResources() so it's not possible to destroy streams
without transferring the resources first. It also merges and removes
ScannerContext::Close().

Also adds test cases for IMPALA-720.

Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb
2014-01-08 10:56:26 -08:00
Skye Wanderman-Milne
9e17042185 Allow zero bit width dict/RLE decoders.
This allows us to read single-value dictionary-encoded columns
generated by parquet-mr.

Change-Id: I80903d910d0cc3a3e4ebf02e34212d868e94feb4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1098
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne
de531e15bd IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8
parquet-mr had a bug where it didn't include the dictionary page's
header in the total column size. We now compensate for this by
detecting these files and padding the scan range length. This required
changing how the scanner detects when it's finished: it now counts the
number of rows rather than checking eosr (since the scan range may be
longer than the column).

Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:27 -08:00