Files
impala/testdata/data/bad_compressed_size.parquet
Dan Hecht 3735ea94a0 S3: Don't seek/read past file end
DistributedFileSystem is lenient about seeking past the end of the file.
Other FileSystem implementations, such as NativeS3FileSystem, return an
error on this condition.  That leads to a scary looking message in the
query warnings.

So, when creating scan ranges, let's require that the ranges fall within
the file bounds (at least according to what the HdfsFileDesc indicates
is the length). There were a couple of kinds of AllocateScanRange()
callsites that needed to be fixed up:

1) When a stream wants to read past a scan range, be careful not to read
past the end of the file.

2) When Impala needs to "guess" at the length of a range, use the
file_length as an upper bound on the guess.  We were already doing this
someplaces but not everywhere.

3) When the scan range is derived from parquet metadata, validate the
metadata against file_length and issue appropriate errors.  This will
give better diagnostics for corrupt files.

Note that we can't rely on this for safety (HdfsFileDesc file_length may
be stale), but it does mean that when metadata is up-to-date Impala will
no longer try to access beyond the end of files (and so we'll no longer
get false positive errors from the filesystem).

Additionally, this change revealed a pre-existing problem with files
that have multiple row-groups.  The first time through InitColumns(),
stream_ was set to NULL.  But, stream_->filename could potentially be
accessed when constructing error statuses for subsequent row-groups.

Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
2015-01-08 16:19:35 -08:00

245 B