CodegenWriteSlot() receives negative length values for the lengths of
the slots passed to it if the slots contain escape characters. (This
is currently only for non-string types, as we do not codegen string
types with escaped characters). The DelimitedTextParser is responsible
for identifying escape characters and assigning the negative lengths
appropriately. CodegenWriteCompleteTuple() passes this length to
CodegenWriteSlot() as it is. This differs from the behavior of
WriteSlot() where the length passed to it is always positive, as all
the callers of WriteSlot() make sure of that (including
WriteCompleteTuple()).
The IrIsNullString() and IrGenericIsNullString() functions are
responsibe for checking if the given data contains a NULL pattern.
They are called by CodegenWriteSlot(). A NULL pattern usually contains
an escaped character which means that the length of that slot will be
a negative length. However, the IrIsNullString() and
IrGenericIsNullString() that take the same length argument from
CodegenWriteSlot() always expect a positive length argument. So, no
slots were ever marked as NULL by these NULL-checking functions when
codegen was enabled.
NULL slots were still detected accidentally because of some incorrect
code in CodegenWriteSlot() that marked invalid slots and NULL slots as
NULL. Therefore, due to this code, even invalid slots were not marked
as invalid and did not return an error. Instead they were just
sliently marked as NULL.
This patch makes sure that only positive lengths are passed to
CodegenWriteSlot() so that NULL checking is correct and it also
makes sure that invalid slots are not silently marked as NULL.
Testing: Re-enabled an older hdfs-scan-node-errors test. Formatted
it to fit new error message format after IMPALA-3859 and IMPALA-3895.
Change-Id: I858e427ad7c2b2da8c2bb657be06b7443655781f
Reviewed-on: http://gerrit.cloudera.org:8080/5377
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
Logging file or table data is a bad idea, and doing it by default is
particularly bad. This patch changes HdfsScanNode::LogRowParseError() to
log a file and offset only.
Testing: See rewritten tests.
To support testing this change, we also fix IMPALA-3895, by introducing
a canonical string __HDFS_FILENAME__ that all Hadoop filenames in the ERROR
output are replaced with before comparing with the expected
results. This fixes a number of issues with the old way of matching
filenames which purported to be a regex, but really wasn't. In
particular, we can now match the rest of an ERROR line after the
filename, which was not possible before.
In some cases, we don't want to substitute filenames because the ERROR
output is looking for a very specific output. In that case we can write:
$NAMENODE/<filename>
and this patch will not perform _any_ filename substitutions on ERROR
sections that contain the $NAMENODE string.
Finally, this patch fixes a bug where a test that had an ERRORS section
but no RESULTS section would silently pass without testing anything.
Change-Id: I5a604f8784a9ff7b4bf878f82ee7f56697df3272
Reviewed-on: http://gerrit.cloudera.org:8080/4020
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
Fix a bug in which Impala only reads the first stream
of a multi-stream bz2/gzip file.
Changes the bz2 decoder to read the file in a streaming
fashion rather than reading the entire file into memory
before it can be decompressed.
Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8
(cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e)
Reviewed-on: http://gerrit.cloudera.org:8080/2219
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
I did a local benchmark and there's minimal performance impact(<1%)
Change-Id: I8d84a145acad886c52587258b27d33cff96ea399
(cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0)
Reviewed-on: http://gerrit.cloudera.org:8080/189
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
Re-enables data error tests which were not being included in
run-tests.py. Broken tests were updated, with one exception which
is tracked by IMPALA-1862. Depends on a related change to
Impala-lzo.
Change-Id: I4c42498bdebf9155a8722695a3305b63ecc6e5f3
Reviewed-on: http://gerrit.cloudera.org:8080/194
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Compressed text formats currently require entire compressed
files be read into memory to be decompressed in a single call
to the decompression codec. This changes the HdfsTextScanner
to drive gzip in a streaming mode, i.e. produce partial output
as input is consumed.
Change-Id: Id5c0805e18cf6b606bcf27a5df4b5f58895809fd
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5233
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 05c3cc55e7a601d97adc4eebe03f878c68a33e56)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5385
This re-enables a subset of the stable data errors tests and updates them to
work in our test framework. This includes support for updating results via --update_results.
This also lets us remove a lot of old code that was there only to support these disabled
tests.
Change-Id: I4c40c3976d00dfc710d59f3f96c99c1ed33e7e9b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1952
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2277
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh
verify that the exception message contains the correct error;
verify that excpected exception is thrown;
verify that no exception is thrown when abort_on_error is set to false
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:
./run-benchmark --workloads=hive-benchmark,tpch
We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.
To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.
Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:
./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.
Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3
Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3
This changeset also includes a few other minor tweaks to some of the test
scripts.
Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6