There are cases of Parquet files where the metadata indicate wrong number of rows for
these files. The parquet-scanner until now was not reporting any problem in this case.
Instead it was reading as long as there where values for the read columns.
But with IMPALA-1016 we are now reading at most as many rows as the rows per metadata.
With this patch, the parquet-scanner, right before it finishes scannings, checks whether
it read the expected number of rows (taken from metadata). In cases where the actual
number of rows read is less than or greater than the expected number, it either aborts
or logs an error.
Change-Id: Ie6a66a38e8912730bf04762e6526ec1cadb2bcdc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2755
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2944
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh