impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Alex Behm	e57fd2d831	IMPALA-3491: Use unique_database fixture in test_local_fs.py Testing: Ran hdfs/core and localfs/core private builds. Change-Id: I0720458882ac3b1138deccf9af0ee57bf2eed7dc Reviewed-on: http://gerrit.cloudera.org:8080/3334 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-06-08 16:30:32 -07:00
Alex Behm	025fd3bd7f	IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0. Adds handling and testing for a specific Parquet data corruption scenario with plain dictionary encoded values. The problematic scenario is when the repeat or literal count of the RLE-encoded dictionary indexes is decoded as 0 - an invalid value. There are several other cases of data corruption that are not yet handled gracefully. This patch only handles one specific case. Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63 Reviewed-on: http://gerrit.cloudera.org:8080/3299 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-06-07 17:29:59 -07:00
Alex Behm	95064359cc	IMPALA-3491: Use unique_database fixture in test_delimited_text.py. Testing: Ran the test locally 10 times in a loop on exhaustive. Change-Id: Idedd5f03984e41a4b3ebf271e50863e980c66cb6 Reviewed-on: http://gerrit.cloudera.org:8080/3096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-06-07 09:34:30 -07:00
Tim Armstrong	d23e5505c8	IMPALA-3670: fix sorter buffer mgmt bugs Also make test_scratch_disk.py more deterministic, by using max_block_mgr_memory, which doesn't include scanner memory. The fixed test_scratch_disk.py exercises the other sorter bugs that occurs when scratch cannot be written. Testing: Added a test that does a sort with various memory limits and consumes the whole output of the sorter (we have many tests of sorts with limits but limited coverage of sorts without limits). Ran an exhaustive test run before posting for review. This added test reproduced one of the sorter bugs, where var-len blocks were not always attached to the output batch. The other test was reproduced by the test change in IMPALA-3669: test_scratch_disk fix. Change-Id: Ia1a0ddffa0a5b157ab86a376b7b7360a923698d6 Reviewed-on: http://gerrit.cloudera.org:8080/3315 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-06-06 22:34:19 -07:00
Tim Armstrong	37ec25396f	IMPALA-3344: Simplify sorter and document/enforce invariants. Clarify relationships between classes, clean up the previous mess where every class was friends with the other so there's an actual distinction between public and private members. TupleIterator is now no longer tied to TupleSorter, just Run. Document and enforce invariants in many cases. Factor out some functions from large functions. Simplify and document iterator logic. Make management of buffers when iterating over output stream more explicitly correct: either use MarkNeedToReturn() or attach block to the batch as appropriate. The SortedRunMerger didn't handle resource transfer correctly, except if all the memory came from the batch's MemPool. This patch fixes the cases when resources are attached to the batches, but not the 'need_to_return' case. Document that SortedRunMerger requires 'deep_copy_input' to be true if batches can have the 'need_to_return' flag set. Also use the atomic block exchange operation when moving between blocks in unpinned runs to prevent pin failures at that point. I explicitly have avoided changing the hairy block management logic when allocating buffers for merging, that will need addressing in a follow-up patch. Add a SpilledRuns counter so that it's more explicit that spilling occurred. Testing: Added some tests for corner cases with empty and NULL strings. Fixed a test that previously failed with OOM but now succeeds. Performance: Benchmarking against old code initial revealed some regressions from changes in inlining. Force inlining the TupleComparator::operator() and iterator Next()/Prev() functions helped and performance seems similar or slightly better on the targeted orderby benchmarks. Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a Reviewed-on: http://gerrit.cloudera.org:8080/2826 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-06-02 21:33:08 -07:00
Alex Behm	32c40f9c5d	Remove redundant test in test_avro_schema_resolution.py Change-Id: I7123cd5e19d79122af3b4fef2c092442b7a098f1 Reviewed-on: http://gerrit.cloudera.org:8080/3095 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-31 23:32:11 -07:00
Huaisi Xu	816735a032	IMPALA-3092: Set default value to NULL in AvroSchemaConverter This change ensures that Avro tables created without column definitions remain queryable if columns are added via ALTER TABLE. The bug was that when synthesizing an Avro schema from the column definitions we used to not add default values. Change-Id: Ib86e9ba1f4329b285ae14ee299365f7291a7410e Reviewed-on: http://gerrit.cloudera.org:8080/3219 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-31 23:32:11 -07:00
Sailesh Mukil	6f1fe4ebe7	IMPALA-3577, IMPALA-3486: Partitions on multiple filesystems breaks with S3_SKIP_INSERT_STAGING The HdfsTableSink usualy creates a HDFS connection to the filesystem that the base table resides in. However, if we create a partition in a FS different than that of the base table and set S3_SKIP_INSERT_STAGING to "true", the table sink will try to write to a different filesystem with the wrong filesystem connector. This patch allows the table sink itself to work with different filesystems by getting rid of a single FS connector and getting a connector per partition. This also reenables the multiple_filesystems test and modifies it to use the unique_database fixture so that parallel runs on the same bucket do not clash and end up in failures. This patch also introduces a SECONDARY_FILESYSTEM environment variable which will be set by the test to allow S3, Isilon and the localFS to be used as the secondary filesystems. All jobs with HDFS as the default filesystem need to set the appropriate environment for S3 and Isilon, i.e. the following: - export AWS_SECERT_ACCESS_KEY - export AWS_ACCESS_KEY_ID - export SECONDARY_FILESYSTEM (to whatever filesystem needs to be tested) TODO: SECONDARY_FILESYSTEM and FILESYSTEM_PREFIX and NAMENODE have a lot of similarities. Need to clean them up in a following patch. Change-Id: Ib13b610eb9efb68c83894786cea862d7eae43aa7 Reviewed-on: http://gerrit.cloudera.org:8080/3146 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-31 23:32:11 -07:00
Matthew Jacobs	f413e236a8	IMPALA-3579: Strict handling of numeric overflow in text parsing Adds a query option 'strict_mode' which treats integer and floating pt overflows as parse errors. In the past, overflows were ignored and the max value was returned. When this query option is set, overflowing values are treated as if they were completely invalid data, i.e. NULL is returned. When abort_on_error is enabled, this means the query is aborted. Notes: * DECIMAL overflow/underflow is already treated as an error. * The handling in text-converter treats underflows the same as overflows, so they would result in the same behavior. However, floating point parsing never returns an underflow today. * We may also want to handle numeric values that are truncated when parsing to integer types, e.g. 10.5 -> 10. Change-Id: I7409c31ec0cb6fe0b2d9842b9f58fe1670914836 Reviewed-on: http://gerrit.cloudera.org:8080/3150 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:20 -07:00
Bharath Vissapragada	49610e2cfa	IMPALA-3314/IMPALA-3513: Fix querying tables/partitions altered to Avro format Bug: Impalads crash if we query an Avro table with stale metadata Cause: This happens because avroSchema_ is not set in HdfsTable, which is not propagated to the avro scanner and it doesn't have appropriate checks to make sure the schema is non-null. The patch fixes the following. 1. Avro scanner should gracefully handle the case where the avro schema is not set. Appropriate null checks and a meaning error message have been added. 2. This is a special case with multi-fileformat partitioned tables. avroSchema_ should be set in HdfsTable even if any subset of the partitions are backed by avro. Without this patch, we only set it if the base table file format is Avro. Change-Id: I09262d3a7b85a2263c721f3beafd0cab2a1bdf4b Reviewed-on: http://gerrit.cloudera.org:8080/3136 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:20 -07:00
Matthew Jacobs	f067929f3a	IMPALA-3535: Ignore invalid per-pool default query options In 2.5 we added the ability to set per-pool default query options. A string of key-value pairs can be specified with a pool configuration. However, if any options fail to parse, then all the options are ignored. We want that behavior (and returning an error) when parsing the process-wide default query options on startup and when parsing the options sent from a client (e.g. in beeswax server) because an error can be returned immediately for the triggering action at that time (i.e. starting the impalad or submitting a query with the options set). This behavior is bad for the pool default query options because (a) the configuration is set by the administrator and there's nothing we can do until a query is submitted and (b) one invalid option shouldn't mean that other valid options aren't set. Change-Id: If04733b775963091b0314c65286df126fd812358 Reviewed-on: http://gerrit.cloudera.org:8080/3056 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-17 10:09:05 -07:00
Casey Ching	e61b5bc119	IMPALA-3511: Fix race setting up TestKuduOperations A couple of tests could both attempt to create/destroy the same database if they were running in parallel. Several other related tests were marked as requiring serial execution, these needed to be marked for serial execution as well. Change-Id: If0573a755cd371363c2e43c001d5c1ba499793c6 Reviewed-on: http://gerrit.cloudera.org:8080/3063 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-14 01:30:01 -07:00
Skye Wanderman-Milne	9174dee395	IMPALA-1578: fix text scanner to handle "\r\n" delimiters split across blocks This patch modifies HdfsTextScanner to specifically check for split "\r\n" delimiters when the scan range ends with '\r'. If there does turn out to be a split delimiter, the next tuple is considered the responsibility of the next scan range's scanner, as if the delimiter appeared fully in the second scan range. This should not affect the overall performance characteristics of the text scanner since it already must do a remote read past the end of the scan range to read the last tuple. Change-Id: Id42b441674bb21517ad2788b99942a4b5dc55420 Reviewed-on: http://gerrit.cloudera.org:8080/2803 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 23:06:36 -07:00
Dan Hecht	a0d4249652	IMPALA-3337: fix "Cancelled" warnings when LIMIT clause is specified The cancelled status is propagated in scanner threads to cause them to shut down once the limit has been satisified, but depending on the code path and when abort_on_error=false, this internal status would sometimes incorrectly end up in the error log. Fix this by factoring out the abort_on_error handling code so that it's handled more consistently across scanners. Parquet, RC, and Avro all suffered from this bug. Testing: exhastive Change-Id: I4a91a22608e346ca21a23ea66c855eae54bbced6 Reviewed-on: http://gerrit.cloudera.org:8080/2964 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:57 -07:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Huaisi Xu	99879aad47	IMPALA-3385: Fix crashes on accessing error_log We used to check error_log empty with error_log.empty(), but this may return false even when some of its member has empty messages, thus crashes impala. This commit: 1. Remove error_log from HdfsScanNode::ProcessSplit() since the logs may contain irrelevant errors. This also prevents a potential race condition when error_log is checked and enters if clause, but later is changed by other threads. Also prevent crashes when error_log has a cleared entry. 2. Hold locks and return a copy of the RuntimeState::error_log_, fixing a race in the coordinator. Change-Id: I3a7e3d22e26147ada780aae5aed1f2e25a515afc Reviewed-on: http://gerrit.cloudera.org:8080/2829 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Reviewed-by: Huaisi Xu <hxu@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:48 -07:00
Alex Behm	bce6b2b422	IMPALA-2736: Basic column-wise slot materialization in Parquet scanner. This change is a first step towards a more efficient Parquet scanner. The focus is on presenting the new code flow that materializes the table-level slots in a column-wise fashion, without going deep into actually improving scan efficieny. After these changes there are several obvious places that should be optimized to realize efficiency gains. Summary of changes - the table-level tuples are materialized in a column-wise fashion with new ColumnReader::ReadValueBatch() functions - this is done by materializing a 'scratch' batch, and transferring scratch tuples that survive filters/conjuncts to the output batch - the tuples of nested collections are still materialized in a row-wise fashion using the ColumnReader::ReadValue() function, just as before Mini benchmark I ran the following queries on a single impalad before and after my change using a synthetic 'huge_lineitem' table. I modified hdfs-scan-node.cc to set the number of rows of any row batch to 0 to focus the measurement on the scan time. Query options: set num_scanner_threads=1; set disable_codegen=true; set num_nodes=1; select * from huge_lineitem; Before: 22.39s Afer: 18.50s select * from huge_lineitem where l_linenumber < 0; Before: 25.11s After: 20.56s select * from huge_lineitem where l_linenumber % 2 = 0; Before: 26.32s After: 21.82s Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Reviewed-on: http://gerrit.cloudera.org:8080/2779 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:48 -07:00
Lars Volker	b5570da405	IMPALA-1740: Add support for skip.header.line.count. HIVE-5795 introduced a parameter skip.header.line.count to skip header lines from input files. This change introduces the capability to skip an arbitrary number of header lines from csv input files on hdfs. The size of the total file header must be smaller than max_scan_range_length, otherwise an error will be reported. This is necessary because scan ranges are not read in disk order, so there is no way of identifying header lines except by counting from the start of the first scan range. [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='1'); Query: alter table t1 set tblproperties('skip.header.line.count'='1') [localhost:21000] > select * from t1; Query: select * from t1 +----+----+ \| c1 \| c2 \| +----+----+ \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +----+----+ Fetched 3 row(s) in 0.32s [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='0'); Query: alter table t1 set tblproperties('skip.header.line.count'='0') [localhost:21000] > select * from t1; Query: select * from t1 +------+------+ \| c1 \| c2 \| +------+------+ \| NULL \| NULL \| \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +------+------+ WARNINGS: Error converting column: 0 TO INT (Data is: num1) Error converting column: 1 TO DOUBLE (Data is: num2) file: hdfs://localhost:20500/test-warehouse/t1/test.txt record: num1,num2 Fetched 4 row(s) in 0.41s Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543 Reviewed-on: http://gerrit.cloudera.org:8080/2110 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:46 -07:00
Michael Brown	2826df93d2	IMPALA-3427: test_insert_wide_table: use unique_database fixture test_insert_wide_table runs in parallel attempts to DROP/CREATE tables with the same name into the same database. A race thus exists in which one tests can stomp on each others' tables. The fix is to use the unique_database fixture, in which each test can run in parallel within its own database, and thus its own table. Change-Id: If3b22f1b089d20c6024d6d680af82f102342394e Reviewed-on: http://gerrit.cloudera.org:8080/2871 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Michael Brown <mikeb@cloudera.com>	2016-05-12 14:17:44 -07:00
Anuj Phadke	a915293109	IMPALA-1850: Allow fs.defaultFS to be set to a non-HDFS filesystem This change whitelists the supported filesystems which can be set as Default FS for Impala to run on. This patch configures Impala to use S3 as the default filesystem, rather than a secondary filesystem as before. Change-Id: I2f45bef6c94ece634045acb906d12591587ccfed Reviewed-on: http://gerrit.cloudera.org:8080/1121 Reviewed-by: anujphadke <aphadke@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:40 -07:00
Tim Armstrong	89aa6597f4	IMPALA-3354: bad sorter pivot selection on some inputs Switch to a median of three random tuples that should be very robust to a range of inputs. It may be slightly worse than the existing pivot selection on some inputs where the original algorithm is close to optimal (e.g. already sorted inputs), but should be typically better overall. Always recurse on the smaller partition: this prevent the stack overflow even with bad pivot selection. The overhead is minimal - in profiles for small sorts I'm seeing pivot selection take at most 0.5% of CPU time. The improved pivot selections gives modest improvements of 2-5% on the targeted perf order by benchmarks on a single node run with TPC-H scale factor 20. Change-Id: Iae50112b6deca3d6268e18b6f4daae1af279b452 Reviewed-on: http://gerrit.cloudera.org:8080/2824 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:39 -07:00
Henry Robinson	6629e79f32	IMPALA-3077: Don't run spilling / nested tests without PHJ Change-Id: Ide5e20f05b14aa19a0f570398712ac9297b525eb Reviewed-on: http://gerrit.cloudera.org:8080/2822 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:36 -07:00
Tim Armstrong	31d4103416	IMPALA-3317: fix crash in sorter when spilling zero-length strings The sorter converts string pointers to block offsets when spilling. There was a subtle bug in the logic that assumed if the offset was past the end of the current block, the data must necessarily be in the next block. This is not true for zero-length strings, because there is no backing storage so the pointer can point to the byte after the end of the block. This patch fixes the bug by using a simpler offset encoding scheme that packs the block number into the upper 32 bits and the offset within the block into the lower 32 bits. It also slightly refactors the functions so that the method signatures and types are more consistent with the rest of the impala codebase. Also fix a bug with handling of multiple query options in tests. Change-Id: I5f64593e94d367d6b6efb61a8b86e35516f18839 Reviewed-on: http://gerrit.cloudera.org:8080/2780 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:36 -07:00
Henry Robinson	630f1ab270	IMPALA-3367: Ensure runtime filters tests run on 3 nodes Runtime filter tests require that each scan gets assigned to three backends. In our test environment (with Impala daemons on the same machine) concurrent load can cause these scans to get assigned to only two nodes, which breaks the runtime filter tests. See IMPALA-2479 for more details. This patch changes the tests to run sequentially. Change-Id: I1ca43ca0f215539d909a1dcabee5e87f2437420b Reviewed-on: http://gerrit.cloudera.org:8080/2802 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:33 -07:00
Tim Armstrong	8e0267f2a9	IMPALA-3328: xfail TPC-H q9 if memory limit exceeded The test is flaky due to nondeterministic memory consumption under ASAN. Xfail until we have more concrete guarantees on mem usage. Change-Id: Ieefcb8f8ecc179f483f6d06af80c814fe0ef728e Reviewed-on: http://gerrit.cloudera.org:8080/2770 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:30 -07:00
Sailesh Mukil	c083c79888	IMPALA-3256: TestUdfs.test_libs_with_same_filenames failure There was an observed race between TestUdfs.test_java_udfs and TestUdfs.test_libs_with_same_filenames, because they both used the same database name. This patch just changes the name of the database used by test_libs_with_same_filenames. Change-Id: Icc38cbe720a3b9d864935416eb10612171132e17 Reviewed-on: http://gerrit.cloudera.org:8080/2767 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:29 -07:00
casey	687a4373e0	Make test_low_mem_limit_low_selectivity_scan faster The took ~7 mins to run on my machine. It was spending about half its time re-creating the same data set over and over. This may also help with the test failures during continuous integration. All that needed to be done was to change the setup/teardown from method level to class level. This should have no affect on the test coverage. Change-Id: Ic874a7e9c8305bef3e16df03a807493f069e9265 Reviewed-on: http://gerrit.cloudera.org:8080/2766 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:29 -07:00
casey	0d9028dd49	IMPALA-3179: Fix alter table properties for Kudu tables This is one of the merge follow up tasks. It seems like there was just a line missing to copy the metastore data into the Kudu table object. The HDFS table class does the same thing as in this change. Change-Id: I51c9942f2f398afb7dff2485da759a185ad7505f Reviewed-on: http://gerrit.cloudera.org:8080/2728 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Skye Wanderman-Milne	4208fdacc1	IMPALA-3301: TestParquet::test_resolution_by_name fails on legacy join/agg nodes Change-Id: I7ce4c1f28966a1b14c25df228338ef4db6851a5b Reviewed-on: http://gerrit.cloudera.org:8080/2723 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:02:35 -07:00
Henry Robinson	449901fac6	IMPALA-3283: Disable runtime filter tests for local filesystems The runtime filter tests assume 3 scans for alltypes* tables. For local filesystems this isn't a correct assumption. Fixing the tests to be resilient to different number of scans is hard, and filters aren't dependent on the filesystem implementation, so let's just disable them. Change-Id: Ibcd18c7e69355cde70e13e5190ed2503adb7532b Reviewed-on: http://gerrit.cloudera.org:8080/2688 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-04-05 17:37:32 +00:00
Skye Wanderman-Milne	9b51b2b6e6	IMPALA-2835: introduce PARQUET_FALLBACK_SCHEMA_RESOLUTION query option This patch introduces a new query option, PARQUET_FALLBACK_SCHEMA_RESOLUTION which allows Parquet files' schemas to be resolved by either name or position. It's "fallback" because eventually field IDs will be the primary schema resolution scheme, and we don't want to create an option that we will have to change the name of later. The default is still by position. I chose to do a query option because it will make testing easier and also be easier to diagnose resolution problems quickly in the field. If users want to switch the default behavior to be by name (like Hive), they can use the --default_query_options flag. This patch also introduces a new test section, SHELL, which can be used to execute shell commands in a .test file. This is useful for copying files into test tables. Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6 Reviewed-on: http://gerrit.cloudera.org:8080/2384 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-04-02 04:04:25 +00:00
Casey Ching	3d5a21c047	Update another Kudu test to wait for modifying operations I checked that none of the other .test files contain "insert", "update", or "delete". This should be the last one. Eventually we should have a solution that provides a more robust way to "read-your-writes". Change-Id: I6561c32b1558e0a8685392cc5739e8d2e7900e7c Reviewed-on: http://gerrit.cloudera.org:8080/2680 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-31 12:30:19 +00:00
Michael Brown	b74e57a312	IMPALA-2650: UDF EE tests: use unique databases in some tests Some of the end-to-end tests in query_test/test_udfs.py create UDFs in the default database and leave them there. Other tests (e.g., test_functions_ddl) polling the default database and expecting to find no UDFs will fail. It turns out this wouldn't happen in our Jenkins builds (see IMPALA-2650 for more details as to why), but it manifests itself with repeated impala-py.test runs in specific order. The fix is to create the UDFs in databases unique to the test cases. This leaves the default database pristine during these tests. Testing: Before, the following sequence of impala-py.test commands would cause any subsequent runs of test_functions_ddl to fail: $ # simulate a subset of serial tests that expect default DB not to have UDFs $ impala-py.test -m "execute_serially" --workload_exploration_strategy \ functional-query:exhaustive -k test_functions_ddl metadata/test_ddl.py PASS $ # simulate a subset of parallel tests that create UDFs in default DB $ impala-py.test -n4 -m "not execute_serially" --workload_exploration_strategy \ functional-query:exhaustive query_test/test_udfs.py PASS $ # rerun a subset of serial tests that passed before $ impala-py.test -m "execute_serially" --workload_exploration_strategy \ functional-query:exhaustive -k test_functions_ddl metadata/test_ddl.py FAIL, because test_udfs left UDFs. Now, I can run these over and over, and they pass. Change-Id: Id4a8b4764fa310efaa4f6c6f06f64a4e18e44173 Reviewed-on: http://gerrit.cloudera.org:8080/2610 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-03-30 04:50:15 +00:00
Casey Ching	39a28185e8	Re-enable Kudu in build using client stubs when needed The stubs in Impala broke during the merge commit. This commit removes the stubs in hopes of improving robustness of the build. The original problem (Kudu clients are only available for some OSs) is now addressed by moving the stubbing into a dummy Kudu client. The dummy client only allows linking to succeed, if any client method is called, Impala will crash. Before calling any such method, Kudu availability must be checked. Change-Id: I4bf1c964faf21722137adc4f7ba7f78654f0f712 Reviewed-on: http://gerrit.cloudera.org:8080/2585 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-03-29 23:57:54 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00
Jim Apple	c0397911b5	IMPALA-2973: Loosen bound on join timer test Code coverage builds are slower than release or debug builds. This patch gives test_hash_join_timer extra time so this test doesn't cause code coverage builds don't fail in Jenkins. Change-Id: I5598e073d779f744d79c5292e80a8ed8f6aa9548 Reviewed-on: http://gerrit.cloudera.org:8080/2608 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-03-23 22:25:13 +00:00
Skye Wanderman-Milne	a75d7dd282	IMPALA-3081: increase mem limit (again) for TestWideRow LZO occasionally needs > 80MB to run this query. We still need to investigate why it needs so much memory (the scan alone takes > 70MB), but for now bump the mem limit again. Change-Id: Ifdb7c6d33ab2de0f06e7322aa8f8ba107da84d49 Reviewed-on: http://gerrit.cloudera.org:8080/2602 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-03-23 05:14:29 +00:00
Henry Robinson	b3937295fb	Runtime filters tests This patch adds functional tests for runtime filters. It relies on setting RUNTIME_FILTER_WAIT_TIME_MS high enough to ensure that filters are received. To make the test files more readable, this patch also adds a new COMMENT section to the test syntax, and allows blank spaces between queries so that the separation of different test cases can be made more obvious. Currently missing is a test for disabling probe-side filters based on selectivity, as we lack suitable tables to trigger the disable condition. Change-Id: I94d617c6d23ffa394a6eb7ead56f1cfb701e0d90 Reviewed-on: http://gerrit.cloudera.org:8080/2603 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-03-23 04:07:14 +00:00
Skye Wanderman-Milne	e999617719	IMPALA-3195: explicitly specify unique database in TestScanners.test_annotate_utf8_option test_annotate_utf8_option was failing in the exhaustive build because some other test was switching the session database from 'default' to 'functional_parquet'. This patch amends the patch to use the new 'unique_database' fixture, which both ensures the table will be created in the expected database and handles table cleanup. Change-Id: I981475581340edc0be68fa2813f5e63990202399 Reviewed-on: http://gerrit.cloudera.org:8080/2586 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com>	2016-03-22 06:43:51 +00:00
Skye Wanderman-Milne	a78f3a8ca5	IMPALA-2069: add USE_UTF8_PARQUET_STRINGS query option This option toggles whether the parquet writer will use the UTF8 annotation for string columns. This patch includes a test that writes a table with or without this option, then verifies that the annotation is or isn't present using a new get_parquet_metadata Python utility. Change-Id: I030c9f5c6272e09c1ce133f66234e3cfb26b68d4 Reviewed-on: http://gerrit.cloudera.org:8080/2531 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-17 05:58:39 +00:00
Michael Brown	58219eac2c	IMPALA-2537: EE tests: create and use unique database fixture To speed up tests and reduce flakiness, introduce a pytest fixture whereby a test maintainer may request a database unique to his test. Such databases are suitable for tests that need to create tables within Python test code. Because the database name is unique to the test, the test can create any tables within that database it wants without fear that the same tables will be picked up by another test. Unique databases effectively guarantee a unique namespace for tables. To generate the database name, we use the CRC32 checksum of the test's so-called pytest test ID. This ID is a long string containing the test's module path, class (if applicable), function name, and parameter set (e.g., vector). We then concatenate the CRC32 checksum with the test function name, so that it's easier to identify the test to which the database belongs. The test author may also override the prefix by parametrizing the fixture. We then use a pytest fixture to create the database, hand the name to the test using the fixture, and clean up the database automatically after the test completes. The command `impala-py.test --fixtures` executed from the tests/ directory explains the full usage. Finally, we modify a few tests to show how test maintainers can use this fixture. Not supported here are databases used by .test files, creation of hive databases, databases with special CREATE parameters such as LOCATION and COMMENT, or asking the fixture to create multiple databases. Also not supported would be attempted parallel runs of the same test with the same test parameters. Testing: 1. Manual testing of the fixture usage, both in vanilla and parametrized context. 2. Manual runs of the tests modified. 3. An exhaustive exploration strategy test run. Change-Id: I74d200da8a59379388e1edfbb849828f92a1b3b7 Reviewed-on: http://gerrit.cloudera.org:8080/1821 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-03-16 18:29:57 +00:00
David Alves	7381304a23	Merge branch 'feature/kudu' into cdh5-trunk This is the final merge commit that merges the 'feature/kudu' branch into cdh5-trunk. Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a	2016-03-13 19:28:43 -07:00
casey	804cfbdd64	Get and use Kudu from the toolchain by default This is for review purposes only. This patch will be merged with David's big merge patch. Changes: 1) Make Kudu compilation dependent on the OS since not all OSs support Kudu. 2) Only run Kudu related tests when Kudu is supported (see #1). 3) Look for Kudu locally, but in a different location. To use a local build of Kudu, set KUDU_BUILD_DIR to the path Kudu was built in and set KUDU_CLIENT_DIR to the path KUDU was installed in. Example: git clone https://github.com/cloudera/kudu.git ...build 3rd party etc... mkdir -p $KUDU_BUILD_DIR cd $KUDU_BUILD_DIR cmake <path to Kudu source dir> make DESTDIR=$KUDU_CLIENT_DIR make install 4) Look for Kudu in the toolchain if not using a local Kudu build. 5) Add Kudu service startup scripts. The Kudu in the toolchain is actually a parcel that has been renamed (the contents were not modified in any way), that mean the Kudu service binaries are there. Those binaries are now used to run the Kudu service. Change-Id: I3db88cbd27f2ea2394f011bc8d1face37411ed58	2016-03-11 11:38:05 -08:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Tim Armstrong	368d7be7e6	IMPALA-2728: reenable mem limit test now that it is stable The test has not xfailed in a long time, so we believe that various memory usage fixes have fixed the flakiness. Change-Id: Idff06791e9d880cc8ddf54c0c977a556d3701bea Reviewed-on: http://gerrit.cloudera.org:8080/2442 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-04 07:59:09 +00:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Bharath Vissapragada	393c65de6d	IMPALA-3070: Disable test_hive_udfs_missing_jar on local file system Local file system builds do not allow running more than one impalad in parallel. This test relies on that behavior and hence is disabled for such builds. Change-Id: I93fe6ae37018885ede4838f5a2ce0bf11148c4e6 Reviewed-on: http://gerrit.cloudera.org:8080/2315 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-02-26 15:37:24 -08:00
Jim Apple	7f4db1e091	IMPALA-2973: Weaken lower bound on hash join timing. Because of IMPALA-2407, we use Linux's COARSE clockid_t on EC2. However, this has a resolution between 1 and 10 milliseconds, and so cannot be trusted to measure lower bounds as low as those in test_hash_join_timer.py. This is a temporary workaround. Change-Id: I1332b1d9aede129ea6c508e40e20960fad9414a8 Reviewed-on: http://gerrit.cloudera.org:8080/2298 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Juan Yu	97af107729	IMPALA-2914: Fix DCHECK Check failed: HasDateOrTime() Some TimestampValue converting functions assume caller ensures TimestampValue instance has a valid date or time but that's not true. Change those functions to return result in output parameter and return boolean to indicate the conversion is good or not. Change-Id: I7a68a1e14d9c4ee5d83da760d4d76c20c36bc359 (cherry picked from commit 47d8977f5976b9be405f44add966820138fbda6f) Reviewed-on: http://gerrit.cloudera.org:8080/2195 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Bharath Vissapragada	ef0dac661c	IMPALA-2843: Persist hive udfs across catalog restarts This commit adds a new feature to persist hive/java udfs across catalog restarts. IMPALA-1748 already added this for non-java udfs by storing them in parameters map of the Db object and reading them back at catalog startup. However we follow a different approach for hive udfs by converting them to Hive's function format and adding them as hive functions to the metastore. This makes it possible to share udfs between hive and Impala as the udfs added from one service are accessible to other. This commit takes care of format conversions between hive and impala and user can just add function once in either of the services. Background: Hive and impala treat udfs differently. Hive resolves the evaluate function in the udf class at runtime depending on the data types of the input arguments. So user can add one function by name and can pass any arguments to it as long as there is a compatible evaluate function in the udf class. However Impala takes the input types of the udf as a part of function definition (that maps to only one evaluate function) and loads the function only for those set of input argument types. If we have multiple 'evaluate' methods, we need to add multiple functions one for each of them. This commit adds new variants of CREATE \| DROP FUNCTIONS to Impala which lets the user to create and drop hive/java udfs without input argument types or return types. Catalog takes care of loading/dropping the udf signatures corresponding to each "evaluate" method in the udf symbol class. The syntax is as follows, CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts> DROP FUNCTION [IF EXISTS] <function name> Examples: CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf'; CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2'; DROP FUNCTION foo; DROP FUNCTION IF EXISTS bar; The older way of creating hive/java udfs with specific signature is still supported, however they are not persisted across restarts. So a restart of catalog can wipe them out. Additionally this commit also loads all the compatible java udfs added outside of Impala and they needn't be separately loaded. One thing to note here is that the functions added using the new CREATE FUNCTION can only be dropped using the new DROP FUNCTION syntax (without signature). The same rule applies for the java udfs added using the old CREATE FUNCTION syntax (with signature). Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d Reviewed-on: http://gerrit.cloudera.org:8080/2250 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 23:04:03 -08:00

1 2 3 4 5 ...

446 Commits