mirror of
https://github.com/apache/impala.git
synced 2026-01-08 03:02:48 -05:00
DistributedFileSystem is lenient about seeking past the end of the file. Other FileSystem implementations, such as NativeS3FileSystem, return an error on this condition. That leads to a scary looking message in the query warnings. So, when creating scan ranges, let's require that the ranges fall within the file bounds (at least according to what the HdfsFileDesc indicates is the length). There were a couple of kinds of AllocateScanRange() callsites that needed to be fixed up: 1) When a stream wants to read past a scan range, be careful not to read past the end of the file. 2) When Impala needs to "guess" at the length of a range, use the file_length as an upper bound on the guess. We were already doing this someplaces but not everywhere. 3) When the scan range is derived from parquet metadata, validate the metadata against file_length and issue appropriate errors. This will give better diagnostics for corrupt files. Note that we can't rely on this for safety (HdfsFileDesc file_length may be stale), but it does mean that when metadata is up-to-date Impala will no longer try to access beyond the end of files (and so we'll no longer get false positive errors from the filesystem). Additionally, this change revealed a pre-existing problem with files that have multiple row-groups. The first time through InitColumns(), stream_ was set to NULL. But, stream_->filename could potentially be accessed when constructing error statuses for subsequent row-groups. Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: jenkins
241 B
241 B