impala

mirror of https://github.com/apache/impala.git synced 2026-01-04 09:00:56 -05:00

Author	SHA1	Message	Date
Alex Behm	931bf49cd9	IMPALA-3905: HdfsScanner::GetNext() for Avro, RC, and Seq scans. Implements HdfsScanner::GetNext() for the Avro, RC File, and Sequence File scanners. Changes ProcessSplit() to repeatedly call GetNext() to share the core scanning code between the legacy ProcessSplit() interface (ProcessSplit()) and the new GetNext() interface. Summary of changes: - Slightly change code flow for initial scan range that only parses the file header. The new code sets 'only_parsing_header_' in Open() and then honors that flag in GetNextInternal(). Before, all the logic was inside ProcessSpit(). - Replace 'finished_' with 'eos_'. - Add a RowBatch parameter to various functions. - Change Close() to free all resources when a nullptr RowBatch is passed. Testing: - Exhaustive tests passed on debug - Core tests passed on asan - TODO: Perf testing on cluster Change-Id: Ie18f57b0d3fe0052a8ccd361b6a5fcdf979d0669 Reviewed-on: http://gerrit.cloudera.org:8080/6527 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-01 21:59:34 +00:00
Sailesh Mukil	b00310ca89	IMPALA-4594: WriteSlot and CodegenWriteSlot handle escaped NULL slots differently CodegenWriteSlot() receives negative length values for the lengths of the slots passed to it if the slots contain escape characters. (This is currently only for non-string types, as we do not codegen string types with escaped characters). The DelimitedTextParser is responsible for identifying escape characters and assigning the negative lengths appropriately. CodegenWriteCompleteTuple() passes this length to CodegenWriteSlot() as it is. This differs from the behavior of WriteSlot() where the length passed to it is always positive, as all the callers of WriteSlot() make sure of that (including WriteCompleteTuple()). The IrIsNullString() and IrGenericIsNullString() functions are responsibe for checking if the given data contains a NULL pattern. They are called by CodegenWriteSlot(). A NULL pattern usually contains an escaped character which means that the length of that slot will be a negative length. However, the IrIsNullString() and IrGenericIsNullString() that take the same length argument from CodegenWriteSlot() always expect a positive length argument. So, no slots were ever marked as NULL by these NULL-checking functions when codegen was enabled. NULL slots were still detected accidentally because of some incorrect code in CodegenWriteSlot() that marked invalid slots and NULL slots as NULL. Therefore, due to this code, even invalid slots were not marked as invalid and did not return an error. Instead they were just sliently marked as NULL. This patch makes sure that only positive lengths are passed to CodegenWriteSlot() so that NULL checking is correct and it also makes sure that invalid slots are not silently marked as NULL. Testing: Re-enabled an older hdfs-scan-node-errors test. Formatted it to fit new error message format after IMPALA-3859 and IMPALA-3895. Change-Id: I858e427ad7c2b2da8c2bb657be06b7443655781f Reviewed-on: http://gerrit.cloudera.org:8080/5377 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-12-08 08:30:51 +00:00
Tim Armstrong	6587c08f70	IMPALA-4387: validate decimal type in Avro file schema This patch prevents an invalid decimal type in an Avro file schema from crashing Impala. Most invalid Avro schemas are caught by the frontend, but file schemas still need to be validated by the backend. After this patch files with bad schemas are skipped. Testing: This was hit very rarely by the scanner fuzzing. Added a regression test that scans a file with a bad schema. Change-Id: I25a326ee2220bc14d3b5f887dc288b4adf859cfc Reviewed-on: http://gerrit.cloudera.org:8080/4876 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-30 00:12:58 +00:00
Henry Robinson	34b5f1c416	IMPALA-(3895,3859): Don't log file data on parse errors Logging file or table data is a bad idea, and doing it by default is particularly bad. This patch changes HdfsScanNode::LogRowParseError() to log a file and offset only. Testing: See rewritten tests. To support testing this change, we also fix IMPALA-3895, by introducing a canonical string __HDFS_FILENAME__ that all Hadoop filenames in the ERROR output are replaced with before comparing with the expected results. This fixes a number of issues with the old way of matching filenames which purported to be a regex, but really wasn't. In particular, we can now match the rest of an ERROR line after the filename, which was not possible before. In some cases, we don't want to substitute filenames because the ERROR output is looking for a very specific output. In that case we can write: $NAMENODE/<filename> and this patch will not perform _any_ filename substitutions on ERROR sections that contain the $NAMENODE string. Finally, this patch fixes a bug where a test that had an ERRORS section but no RESULTS section would silently pass without testing anything. Change-Id: I5a604f8784a9ff7b4bf878f82ee7f56697df3272 Reviewed-on: http://gerrit.cloudera.org:8080/4020 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-08-25 10:20:36 +00:00
Tim Armstrong	bc8c55afcd	IMPALA-3729: batch_size=1 coverage for avro scanner Also fix a stale comment in the avro scanner header. The main work here is to fix the handling of empty result sets in the test result verifier. This is a problem because we wanted to verify that the results in the test file were a superset of the rows returned, and this was thrown off by superflous '' rows in the expected and actual result sets. The basic problem is that the way test file sections was parsed conflated an empty result section with non-empty result section that had a single empty string. I.e.: ---- RESULTS ==== vs ---- RESULTS ==== both got resolved to ['']. Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/3413 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-07-19 23:30:02 -07:00
Skye Wanderman-Milne	01287a3ba9	IMPALA-3441, IMPALA-3659: check for malformed Avro data This patch adds error checking to the Avro scanner (both the codegen'd and interepted paths), including out-of-bounds checks and data validity checks. I ran a local benchmark using the following queries: set num_scanner_threads=1; select count(i) from default.avro_bigints_big; # file contains only longs select max(l_orderkey) from biglineitem_avro; # file has tpch.lineitem schema Both benchmark queries see negligable or no performance impact. This patch adds a new Avro scanner unit test and an end-to-end test that queries several corrupted files, as well as updates the zig-zag varlen int unit test. Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132 Reviewed-on: http://gerrit.cloudera.org:8080/3072 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-06-13 18:32:32 -07:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Juan Yu	41509ce3c1	IMPALA-2477: Parquet metadata randomly 'appears stale' Stream::ReadBytes() could fail by other reasons than 'stale metadata'. Adding Errorcode Check to make sure Impala return proper error message. It also fixes IMPALA-2488 metadata.test_stale_metadata fails on non-hdfs filesystem. Change-Id: I9a25df3fb49f721bf68d1b07f42a96ce170abbaa Reviewed-on: http://gerrit.cloudera.org:8080/1166 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-10-07 14:47:41 -07:00
Juan Yu	e121bc9b0a	IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line. I did a local benchmark and there's minimal performance impact(<1%) Change-Id: I8d84a145acad886c52587258b27d33cff96ea399 (cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0) Reviewed-on: http://gerrit.cloudera.org:8080/189 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 19:58:50 -07:00
Matthew Jacobs	7558a4752b	IMPALA-1502: Fix and re-enable broken data errors tests Re-enables data error tests which were not being included in run-tests.py. Broken tests were updated, with one exception which is tracked by IMPALA-1862. Depends on a related change to Impala-lzo. Change-Id: I4c42498bdebf9155a8722695a3305b63ecc6e5f3 Reviewed-on: http://gerrit.cloudera.org:8080/194 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:40 -07:00
ishaan	8369c3b13b	Remove explicit references to functional_hbase tables from .test files. Additionally, this patch also disabled the hbase/none test dimension if the TARGET_FILESYSTEM environment variable is set to either s3 of isilon. Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb Reviewed-on: http://gerrit.cloudera.org:8080/74 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-02-23 23:32:41 +00:00
Matthew Jacobs	25428fdb21	Add support for streaming decompression of gzip text Compressed text formats currently require entire compressed files be read into memory to be decompressed in a single call to the decompression codec. This changes the HdfsTextScanner to drive gzip in a streaming mode, i.e. produce partial output as input is consumed. Change-Id: Id5c0805e18cf6b606bcf27a5df4b5f58895809fd Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5233 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 05c3cc55e7a601d97adc4eebe03f878c68a33e56) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5385	2014-11-23 01:55:55 -08:00
Lenni Kuff	15327e8136	Migrate DataErrors tests to Python test framework, re-enable subset of tests This re-enables a subset of the stable data errors tests and updates them to work in our test framework. This includes support for updating results via --update_results. This also lets us remove a lot of old code that was there only to support these disabled tests. Change-Id: I4c40c3976d00dfc710d59f3f96c99c1ed33e7e9b Reviewed-on: http://gerrit.ent.cloudera.com:8080/1952 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2277	2014-04-18 02:25:11 -07:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Skye Wanderman-Milne	efac6f82fd	Print errors to shell in BaseSequenceScanner. Change-Id: I0d1b041695c0d61b8c4994833f0a703e3bfa9c6a Reviewed-on: http://gerrit.ent.cloudera.com:8080/278 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-01-08 10:52:20 -08:00
Alan Choi	15a3d92492	Qualify table with database	2014-01-08 10:50:57 -08:00
Alan Choi	58687d16b8	IMPALA-406 Raise an error when inserting into HBase table using a null row key.	2014-01-08 10:50:56 -08:00
Skye Wanderman-Milne	f4d8df7119	Don't suppress "incomplete read" bad status, revert DataErrorsTest accordingly.	2014-01-08 10:50:13 -08:00
Lenni Kuff	3f0252c9f3	Fix DataErrors test failures	2014-01-08 10:50:12 -08:00
Alex Behm	5db3f2cdf5	IMPALA-227: SELECT * on partitioned table returns columns in different order than Hive.	2014-01-08 10:49:48 -08:00
Skye Wanderman-Milne	8ef36831f6	Update DataErrorsTest to reflect LZO_MAX_BLOCK_SIZE rename	2014-01-08 10:49:16 -08:00
Lenni Kuff	831ee529be	Fixed data loading bugs, moved most tables out of load-dependent-tables	2014-01-08 10:48:56 -08:00
Lenni Kuff	328ceed4e7	Add support for generating lzo compressed text files and running tests against lzo	2014-01-08 10:48:38 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Skye Wanderman-Milne	357327b5c0	Fix file offsets in DataErrorsTest	2014-01-08 10:48:06 -08:00
Nong Li	02c329b97a	Update RC files to use io mgr and remove scanner support for non-io mgr.	2014-01-08 10:47:11 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00
Nong Li	adf36b81f9	Fix data errors test.	2014-01-08 10:46:45 -08:00
Michael Ubell	8a5297a526	Add HdfsLzoTextScanner	2014-01-08 10:46:35 -08:00
Michael Ubell	37aaf06f79	IMP-390 Get rid of test dependencies on InProcessQE and Runquery	2014-01-08 10:46:18 -08:00
Alan Choi	dbf1074066	Fragments report errors to coordinator. Enable multi-node DataErrorTest (IMP-250 resolved) Check fragment/coord errors in DataErrorTest	2014-01-08 10:46:00 -08:00
Alan Choi	69fcaadd5f	Added all the conversion errors in .test file. The errors come from run-query. Error message is now more consistent. Remove useless message from RC file.	2014-01-08 10:45:12 -08:00
Michael Ubell	d0dd13053a	Improve string to timestamp performance.	2014-01-08 10:45:08 -08:00
Alan Choi	8dae344ceb	Do not validate filename in DataErrorTest because it is not deterministic.	2014-01-08 10:44:45 -08:00
Alan Choi	22765fc33a	IMP-251: re-enable DataErrorTest verify that the exception message contains the correct error; verify that excpected exception is thrown; verify that no exception is thrown when abort_on_error is set to false	2014-01-08 10:44:45 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00

36 Commits