This adds a test that performs some simple fuzz testing of HDFS
scanners. It creates a copy of a given HDFS table, with each
file in the table corrupted in a random way: either a single
byte is set to a random value, or the file is truncated to a
random length. It then runs a query that scans the whole table
with several different batch_size settings. I made some effort
to make the failures reproducible by explicitly seeding the
random number generator, and providing a mechanism to override
the seed.
The fuzzer has found crashes resulting from corrupted or truncated
input files for RCFile, SequenceFile, Parquet, and Text LZO so far.
Avro only had a small buffer read overrun detected by ASAN.
Includes fixes for Parquet crashes found by the fuzzer, a small
buffer overrun in Avro, and a DCHECK in MemPool.
Initially it is only enabled for Avro, Parquet, and uncompressed
text. As follow-up work we should fix the bugs in the other scanners
and enable the test for them.
We also don't implement abort_on_error=0 correctly in Parquet:
for some file formats, corrupt headers result in the query being
aborted, so an exception will xfail the test.
Testing:
Ran the test with exploration_strategy=exhaustive in a loop locally
with both DEBUG and ASAN builds for a couple of days over a weekend.
Also ran exhaustive private build.
Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b
Reviewed-on: http://gerrit.cloudera.org:8080/3833
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins