Commit Graph

5 Commits

Author SHA1 Message Date
Alex Behm
931bf49cd9 IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans.
Implements HdfsScanner::GetNext() for the Avro, RC File, and
Sequence File scanners. Changes ProcessSplit() to repeatedly call
GetNext() to share the core scanning code between the legacy
ProcessSplit() interface (ProcessSplit()) and the new GetNext()
interface.

Summary of changes:
- Slightly change code flow for initial scan range that
  only parses the file header. The new code sets
  'only_parsing_header_' in Open() and then honors
  that flag in GetNextInternal(). Before, all the logic
  was inside ProcessSpit().
- Replace 'finished_' with 'eos_'.
- Add a RowBatch parameter to various functions.
- Change Close() to free all resources when a nullptr
  RowBatch is passed.

Testing:
- Exhaustive tests passed on debug
- Core tests passed on asan
- TODO: Perf testing on cluster

Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669
Reviewed-on: http://gerrit.cloudera.org:8080/6527
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-07-01 21:59:34 +00:00
Tim Armstrong
6587c08f70 IMPALA-4387: validate decimal type in Avro file schema
This patch prevents an invalid decimal type in an Avro file schema from
crashing Impala. Most invalid Avro schemas are caught by the frontend,
but file schemas still need to be validated by the backend.

After this patch files with bad schemas are skipped.

Testing:
This was hit very rarely by the scanner fuzzing. Added a regression test that
scans a file with a bad schema.

Change-Id: I25a326ee2220bc14d3b5f887dc288b4adf859cfc
Reviewed-on: http://gerrit.cloudera.org:8080/4876
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-10-30 00:12:58 +00:00
Henry Robinson
34b5f1c416 IMPALA-(3895,3859): Don't log file data on parse errors
Logging file or table data is a bad idea, and doing it by default is
particularly bad. This patch changes HdfsScanNode::LogRowParseError() to
log a file and offset only.

Testing: See rewritten tests.

To support testing this change, we also fix IMPALA-3895, by introducing
a canonical string __HDFS_FILENAME__ that all Hadoop filenames in the ERROR
output are replaced with before comparing with the expected
results. This fixes a number of issues with the old way of matching
filenames which purported to be a regex, but really wasn't. In
particular, we can now match the rest of an ERROR line after the
filename, which was not possible before.

In some cases, we don't want to substitute filenames because the ERROR
output is looking for a very specific output. In that case we can write:

$NAMENODE/<filename>

and this patch will not perform _any_ filename substitutions on ERROR
sections that contain the $NAMENODE string.

Finally, this patch fixes a bug where a test that had an ERRORS section
but no RESULTS section would silently pass without testing anything.

Change-Id: I5a604f8784a9ff7b4bf878f82ee7f56697df3272
Reviewed-on: http://gerrit.cloudera.org:8080/4020
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-08-25 10:20:36 +00:00
Tim Armstrong
bc8c55afcd IMPALA-3729: batch_size=1 coverage for avro scanner
Also fix a stale comment in the avro scanner header.

The main work here is to fix the handling of empty result sets in the
test result verifier. This is a problem because we wanted to verify
that the results in the test file were a superset of the rows
returned, and this was thrown off by superflous '' rows in the expected
and actual result sets.

The basic problem is that the way test file sections
was parsed conflated an empty result section with non-empty result
section that had a single empty string. I.e.:

---- RESULTS
====

vs
---- RESULTS

====

both got resolved to [''].

Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/3413
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-07-19 23:30:02 -07:00
Skye Wanderman-Milne
01287a3ba9 IMPALA-3441, IMPALA-3659: check for malformed Avro data
This patch adds error checking to the Avro scanner (both the codegen'd
and interepted paths), including out-of-bounds checks and data
validity checks.

I ran a local benchmark using the following queries:
  set num_scanner_threads=1;
  select count(i) from default.avro_bigints_big; # file contains only longs
  select max(l_orderkey) from biglineitem_avro; # file has tpch.lineitem schema

Both benchmark queries see negligable or no performance impact.

This patch adds a new Avro scanner unit test and an end-to-end test
that queries several corrupted files, as well as updates the zig-zag
varlen int unit test.

Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Reviewed-on: http://gerrit.cloudera.org:8080/3072
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-06-13 18:32:32 -07:00