impala

mirror of https://github.com/apache/impala.git synced 2026-01-10 09:00:16 -05:00

Author	SHA1	Message	Date
Sailesh Mukil	0d46129458	IMPALA-1746: QueryExecState doesn't check for query cancellation or errors QueryExecState::FetchRowsInternal() doesn't check the query state after evaluating the select statement expressions with GetRowValue(). These means that, e.g., UDFs that call SetError() in the select list will not fail the query. Change-Id: I120d7abbee2a3ed5c5c66ec0a3a9b6e9a6ab10bf Reviewed-on: http://gerrit.cloudera.org:8080/815 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:33 -07:00
Alex Behm	1c528492d3	Nested Types: Fix projection of collection-typed slots. There was a bug with projecting collection-typed slots in the UnnestNode by setting them to NULL. The problem was that the same tuple/slot could be referenced by multiple input rows. As a result, all unnests after the first that operate on the same collection value would incorrectly return an empty row batch because the slot had been set to NULL by the first unnesting. The fix is to ignore the null bit when retrieving a collection-typed slot's value in the UnnestNode. We still set the null bit after retrieving the value for projection. This solution purposely ignores the conventional NULL semantics of slots. It is a temporary hack which must be removed eventually. We rely on the producer of collection-typed slot values (scan node) to write an empty array value into such slots when the they are NULL in addition to setting the null bit. Change-Id: Ie6dc671b3d031f1dfe4d95090b1b6987c2c974da Reviewed-on: http://gerrit.cloudera.org:8080/859 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:33 -07:00
Alex Behm	0c90bf7ef5	IMPALA-2340: Fix NOT IN subquery planning and execution with nested types. Fixes: 1. Change the planner to not invert null-aware anti join because there is only a left version. Also, always use a hash join because the nested-loop join does not support that join mode. 2. Fix PartitionedJoinNode::Reset() and related calls to make the join usable in subplans with the left null-aware anti join mode. Change-Id: I8da50747f6a0412c5858fd32b9498f58ed779712 Reviewed-on: http://gerrit.cloudera.org:8080/847 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:33 -07:00
Tim Armstrong	db7519df24	IMPALA-2207: memory corruption on build side of NLJ The NLJ node did not follow the expected protocol when need_to_return is set on a row batch, which means that memory referenced by a rowbatch can be freed or reused the next time GetNext() is called on the child. This patch changes the NLJ node to follow the protocol by deep copying all build side row batches when the need_to_return_ flag is set on the row batches. This prevents the row batches from referencing memory that may be freed or reused. Reenable test that was disabled because of IMPALA-2332 since this was the root cause. Change-Id: Idcbb8df12c292b9e2b243e1cef5bdfc1366898d1 Reviewed-on: http://gerrit.cloudera.org:8080/810 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
Alex Behm	ef29c976df	IMPALA-2320: Use a separate MemPool for the FunctionContexts in AnalyticEvalNode. The bug: There was a MemPool in AnalyticEvalNode with a dual purpose: (1) Allocate temporary tuples. (2) Back the FunctionContexts of the aggregate function evaluators. FunctionContexts use FreePools to do their own memory management using a pointer-based structure that is stored in the memory blocks themselves. When calling AnalyticEvalNode::Reset() we reset that mem pool backing that pointer-based structure. Those pointers were then clobbered by subsequent allocations (and writes) for temporary tuples, ultimately resulting in the FreePool incorrectly reporting a double free while doing a Finalize() of an aggregate function. The fix: While there are several other ways to address this issue, I chose to use a different MemPool for the FunctionContexts because that seemed to be the most sane and minimally invasive fix. That MemPool is not reset during AnalyticEvalNode::Reset() because the memory is ultimately managed by the FreePools of the FunctionContexts. Change-Id: I42fd60785d3c6dec93436cd9ca64de58d1b15c7e Reviewed-on: http://gerrit.cloudera.org:8080/857 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
Alex Behm	057b0b7dba	IMPALA-2322: Set new pointer for ArrayValue in Tuple::DeepCopyVarlenData(). The bug was a simple oversight where copied the array data, but forgot to update the pointer of the corresponding ArrayValue. Change-Id: Ib6ec0380f66194efc7ea3eb989535652eb8b526f Reviewed-on: http://gerrit.cloudera.org:8080/855 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
Alex Behm	55374c2a60	Nested Types: Functional tests for join types and subqueries. Change-Id: I91891f477a2ae692392c2679d2dd054084773fe7 Reviewed-on: http://gerrit.cloudera.org:8080/835 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
Tim Armstrong	9ebe92c4f9	IMPALA-2299: dedup zero-length tuples correctly The dedup logic in row batch serialisation incorrectly assumed that two distinct tuples must have two distinct memory addresses. This is not true if one tuple has zero length. Update the serialisation logic to check for this case and insert a NULL. Adds a unit test that exercises this bug prior to the fix and a query test that also hit a DCHECK prior to the fix. Change-Id: If163274b3a6c10f8ac6b6bc90eee9ec95830b7dd Reviewed-on: http://gerrit.cloudera.org:8080/849 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:31 -07:00
Tim Armstrong	d7ae529dac	IMPALA-2335: DCHECK for zero-length tuple This DCHECK condition was overly strict - a non-nullable tuple pointer can be NULL if the tuple is zero bytes long (since there is no memory backing the tuple). Adds a test query that hit the DCHECK. Change-Id: I16e8bc0db747b83c239de931a4fc9677d5c85ae6 Reviewed-on: http://gerrit.cloudera.org:8080/836 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-15 08:38:09 -07:00
Alex Behm	365263b398	IMPALA-2326: Preserve tuple nullability information through SubplanNodes. Tuples are set to NULL for representing non-matches of outer joins. During planning, the FE identifies at which nodes in the plan which tuples can be NULL or not. In the BE, codegen uses the tuple nullability information to remove the runtime NULL checking of tuples. The bug here was that the tuple nullability information was not preserved through SubplanNodes, so subsequent nodes could get a SEGV in codegen'd parts of the execution when dereferencing a NULL tuple because the NULL check was incorrectly optimized out. Change-Id: I4356537c0a7153ec1247cc74b6b7952ed9e3d884 Reviewed-on: http://gerrit.cloudera.org:8080/827 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-14 13:43:02 -07:00
Alex Behm	41ef3a216d	Nested Types: Add functional tests. This patch adds basic end-to-end functional tests for nested types: 1. For exercising the Reset() of exec nodes when inside a subplan. 2. For asserting correct behavior when row batches with collection-typed slots flow through exec nodes. Most cases are covered, but there are a few known issues that prevent full coverage. The remaining tests will be added as part of the fixes for those existing JIRAs. Change-Id: I0140c1a32cb5edd189f283c68a24de8484b3f434 Reviewed-on: http://gerrit.cloudera.org:8080/823 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-14 13:43:01 -07:00
Alex Behm	f46aa38161	IMPALA-2325: Skip NULL tuples in Coordinator::ValidateCollectionSlots(). Change-Id: I4b3e07f1ebe0d4244f7386f54d7b74b403b798e2 Reviewed-on: http://gerrit.cloudera.org:8080/817 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-09-14 13:43:01 -07:00
Ippokratis Pandis	f57ce3436c	IMPALA-2256: Handle joins with right side of high cardinality and zero materialized slots The hash join and tuple stream code was not handling correctly the case of joins whose right side had very high cardinality but where tuple had zero footprint. Any such join with more than 16M tuples on the right side would crash. In particular, if the tuple footprint is zero, an infinite number of rows fit in one block. But according to the old way we were iterating over the rows of the stream, we would increment by 1 the idx to get the next "row" eventually overflowing and hitting dcheck. Another, second, problem was the calculation of the size of the hash table in such where the footprint of tuples is zero. In such case, a hash table of minimum size would suffice. Instead we would try to create a very large hash table to fit the large number of tuples, resulting to OOM errors. This patch fixes the two problems by having specific calculation of the next idx in the stream as well as the size of the hash table in case the stream contains tuples with zero footprint. Change-Id: I12469b9c63581fcbc78c87200de7797eac3428c9 Reviewed-on: http://gerrit.cloudera.org:8080/811 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-09-14 13:43:01 -07:00
Alex Behm	ecdd5688b9	Nested Types: Tuple pointers are owned by the containing RowBatch by default. This patch makes the ownership of the memory backing the tuple pointers of a RowBatch dependent on whether the legacy joins and aggs are enabled: By default, the memory is malloc'd and owned by the RowBatch: If enable_partitioned_hash_join=true and enable_partitioned_aggregation=true then the memory is owned by the RowBatch and is freed upon its destruction. This mode is more performant especially with SubplanNodes in the ExecNode tree because the tuple pointers are not transferred and do not have to be re-created in every Reset(). Memory is allocated from MemPool: Otherwise, the memory is allocated from the RowBatch's tuple pool. As a result, the pointer memory is transferred just like tuple data, and must be re-created in Reset(). This mode is required for the legacy join and agg which rely on the tuple pointers being allocated from the RowBatch's tuple pool, so they can acquire ownership of the tuple pointers. Performance impact for nested types: Initial cluster runs and profiling on nested TPCH identified excessive malloc/frees as a major performance bottleneck. This change paves the way for further optimizations which yielded a 2x improvement in response time for most nested TPCH queries. Change-Id: I4ac58b18058ce46b4db89fbe117b0bcad19e9ee7 Reviewed-on: http://gerrit.cloudera.org:8080/807 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-14 13:43:01 -07:00
Tim Armstrong	14bd361143	Temporarily remove nested loop join with limit test This caused failures on the non-partitioned agg/join tests and the ASAN test. Remove the query that failed to unbreak the build. This query can be readded when IMPALA-2207 is fixed (for the ASAN) test and the cause of the other failure is diagnosed and fixed. Change-Id: Idb1ca951e0de05aee3c1237392fff74ddd756ed7	2015-09-13 08:29:25 -07:00
Tim Armstrong	25e7454bc9	IMPALA-2319: correctly enforce NLJ limit In some cases in the NLJ node eos_ wasn't set even though the limit was reached. This prevented the limit from being handled correctly before returning rows to the caller of GetNext(). This could result in either too many rows being returned, or a crash when the row batch size was set to an invalid negative number. The fix is to always check for whether the limit was reached before returning from GetNext(). Change-Id: I660e774787870213ada9f2d3e6f10953d9937022 Reviewed-on: http://gerrit.cloudera.org:8080/797 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-11 22:45:39 +00:00
Alex Behm	1fe817f9f1	IMPALA-2318: Rework projection of collection-typed slots. The first attempt in b2b9a10dda942c7e4f2af01be28e819f71de146f was wrong, but the good news is that the validation checks caught the problem. Problem with original approach: In the SubplanNode we used to set the collection-typed slots of the current row to NULL after the subplan invocation for the current row was completed. The problem is that we may have already returned rows from the SubplanNode which still have the non-NULL slots, and some exec node consumers may have copied the data. Fixed approach: We now set a collection-slot to NULL in the UnnestNode that flattens it immediately after evaluating the corresponding SlotRef, before returning any rows from the UnnestNode. Setting the slot to NULL as early as possible ensures that all rows returned by the containing SubplanNode will have the slot set to NULL. Change-Id: Ie942d9b2c835589ed9d41c68a831795bbbff2895 Reviewed-on: http://gerrit.cloudera.org:8080/803 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-09-11 21:45:16 +00:00
Vlad Berindei	ece7fed421	IMPALA-2316: Add RESTRICT to DROP DATABASE Change-Id: Iffad73175b49160ae049911bd33c110a830f932b Reviewed-on: http://gerrit.cloudera.org:8080/796 Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com> Tested-by: Internal Jenkins	2015-09-11 20:37:27 +00:00
Alex Behm	e9e43488cf	IMPALA-2297: Handle collection types in ExprContext::GetValue(). Change-Id: I6af780791e392c0431efdf5a513e4b1cb60d14cf Reviewed-on: http://gerrit.cloudera.org:8080/749 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-10 17:46:21 +00:00
Alex Behm	deb9c6f8e6	Nested Types: Poor man's projection for collection-typed slots. Collection-typed slots are expensive to copy, e.g., during data exchanges or when writing into a buffered-tuple-stream. Even worse, such slots could be duplicated many times after unnesting in a subplan. To alleviate this problem, this patch implements a poor man's projection where collection-typed slots are set to NULL inside the SubplanNode that flattens them. The FE guarantees that the contents of an array-typed slot are never referenced outside of the single UnnestNode that access them, so when returning eos in UnnestNode::GetNext() we also set the unnested array slot to NULL to avoid those expensive copies in downstream exec nodes. The FE provides that guarantee by creating a new slot in the parent scan for every relative CollectionTableRef. For example, for a table 't' with a collection-typed column 'c' the following query would have two separate slots in the tuple of 't', one for 'c1' and one for 'c2': select * from t, t.c c1, t.c c2 Change-Id: I90e5b86463019c9ed810c299945c831c744ff563 Reviewed-on: http://gerrit.cloudera.org:8080/763 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-10 05:44:55 +00:00
Alex Behm	361da01152	Fail queries that require a SubplanNode when using legacy joins and aggs. We will not provide full nested types support if any of these options are set: --enable_partitioned_aggregation=false --enable_partitioned_hash_join=false Change-Id: I0f8607914faf9691d5f7b1a4327609fefba22e56 Reviewed-on: http://gerrit.cloudera.org:8080/792 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-09-10 04:50:31 +00:00
Henry Robinson	8809567e82	IMPALA-2290: Fix btrim() thread-safety. By not using THREAD_LOCAL for its state, btrim() invocations in multi-threaded contexts (i.e. pushed to the scanner) would have threads trampling over each other's bitset used to check for trimmed characters. Testing: See new test in expr.test: select count(*) from functional.alltpyes where btrim(string_col, string_col) != "" .. should give 0 results, but would give > 0 with this bug. Change-Id: I595e25b1d4fb7c76b846fce837b4ec140f47d43c Reviewed-on: http://gerrit.cloudera.org:8080/748 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2015-09-09 04:15:30 +00:00
Tim Armstrong	5ac55f24cc	IMPALA-2296: missing DeepCopy array support Implement Tuple-to-Tuple DeepCopy for collections. Add query test that uses the TOP-N node, which deep copies tuples in this way. Confirmed that the query test failed before this fix. Change-Id: I3fea860d8251038d7b5eb85c77973939abe9dbf8 Reviewed-on: http://gerrit.cloudera.org:8080/757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-08 23:40:53 +00:00
Tim Armstrong	d73683b320	Fix nested types tpch test formatting Invalid test file format caused tpch tests to fail. Change-Id: Ibf523d071bb14db72689e39645fd1724897543c7 Reviewed-on: http://gerrit.cloudera.org:8080/766 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-08 21:57:52 +00:00
Alex Behm	662bc24c79	IMPALA-2100: Exclude explain header from expected results of test_partitioning.py. HDFS acknowledges writes when the first replica is written. As a result, the estimated memory requirements for an Impala query may vary depending on how many replicas existed at the time of table loading. This racey behavior caused a few tests to sometimes fail due to different actual and expected memory requirements. The fix is to exclude the explain header from the expected results. Change-Id: Ifb13de937a104a48960d35745df521de66596837 Reviewed-on: http://gerrit.cloudera.org:8080/762 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-08 19:57:55 +00:00
aacalfa	5e733e8d62	IMPALA-2190: Complete conversion functions between timestamp, unixtime, and string dates Change-Id: I48a446f19c7634477f175d0defa8779dd70a392f Reviewed-on: http://gerrit.cloudera.org:8080/654 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-09-07 07:07:20 +00:00
Dimitris Tsirogiannis	f647b36e58	IMPALA-2289: Properly set eos_ in the BlockingJoinNode when the probe side is exhausted This commit fixes an issue where BlockingJoinNode will incorrectly set eos_ flag to true when the probe side is exhausted without considering the join mode that is executed. This would cause the NestedLoopJoinNode to sometimes return wrong results when a right-outer, right-anti or full-outer join mode is used. This issue appeared in nested TPC-H Q22. Change-Id: I01f2118d4db3d8739201d5c3f475f5b7e328555a Reviewed-on: http://gerrit.cloudera.org:8080/753 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-06 05:29:22 +00:00
Alex Behm	d48ec4b8b3	IMPALA-2289: Properly handle AtCapacity() in SubplaNode. After this patch we get correct results for nested TPCH Q13. The bug: Since we were not properly handling AtCapacity() of the output batch in SubplanNode, we sometimes passed a row batch that was already at capacity into GetNext() on the second child of the SubplanNode. In this particular case, that batch was passed into the NestedLoopJoinNode which may return incomplete results if the output batch is already at capacity (e.g., ProcessUnmatchedBuildRows() was not called). The fix is to return from SuplanNode::GetNext() if the output batch is at capacity due to resources being tranferred to it from the input batch used to fetch from the first child. Change-Id: Ib97821e8457867dc0d00fd37149a3f0a75872297 Reviewed-on: http://gerrit.cloudera.org:8080/742 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-04 20:26:52 +00:00
Vlad Berindei	cfc3952a83	IMPALA-898: Support explicit column names in WITH-clause views. Example: WITH t(c1, c2) AS (SELECT int_col, bool_col FROM functional.alltypes) SELECT * FROM t This will create a local view with the 'int_col' and 'bool_col' columns labeled as 'c1' and 'c2'. If the number of labels is less than the number of columns, then the remaining columns in the local view will be labeled as the corresponding columns in the query statement. Therefore, this is also a valid query (only 'int_col' will be labeled as 'c1'): WITH t(c1) AS (SELECT int_col, bool_col FROM functional.alltypes) SELECT * FROM t Change-Id: Ie3a559ca9eaf95c6980c5695a49f02010c42899b Reviewed-on: http://gerrit.cloudera.org:8080/717 Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com> Tested-by: Internal Jenkins	2015-09-03 01:19:43 +00:00
Skye Wanderman-Milne	bcc73a36da	Nested types: read and materialize nested types in Parquet scanner This patch modifies the Parquet scanner to resolve nested schemas, and read and materialize collection types. The high-level modification is to create a CollectionColumnReader that recursively materializes map- and array-type slots. This patch also adds many tests, most of which query a new table called complextypestbl. This table contains hand-generated data that is meant to expose edge cases in the scanner. The tests mostly test the scanner, with a few tests of other functionality (e.g. array serialization). I ran a local benchmark comparing this scanner code to the original scanner code on an expanded version of tpch_parquet.lineitem with 48009720 rows. My benchmark involved selecting different numbers of columns with a single scanner thread, and I looked at the HDFS scan node time in the query profiles. This code introduces a 10%-20% regression in single-threaded scan time. Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a Reviewed-on: http://gerrit.cloudera.org:8080/576 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-02 19:23:54 +00:00
Dimitris Tsirogiannis	f6985772dc	IMPALA-2275: S3: authorization.test_grant_revoke failure due to stale grant_revoke_no_insert.test This commit updates the test file of grant/revoke statements running against S3 to include column-level privileges. Change-Id: Ia21595740fd37c88040d9a692444c6009591a188 Reviewed-on: http://gerrit.cloudera.org:8080/735 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-09-02 04:29:41 +00:00
Juan Yu	c66785be4a	IMPALA-2227: S3:query_test.test_queries.TestQueries.test_exprs failure Use select query instead of insert query to verify constant expression on partition column. Change-Id: I442111225e8df29bcc5fe89500d023559bb1c1fb Reviewed-on: http://gerrit.cloudera.org:8080/707 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-08-29 00:40:41 +00:00
Dimitris Tsirogiannis	fdb90ed753	CDH-23206: Impala support for column-level authorization (part 1) This commit adds partial support for column-level authorization in Impala using the Sentry Service. The following changes are included: * Added support for parsing and analyzing GRANT/REVOKE statements with column-level privileges. The supporting syntax is: - GRANT SELECT (<col_names>) ON TABLE <table_name> TO [ROLE] <role_name> [WITH GRANT OPTION] - REVOKE [GRANT OPTION FROM] SELECT (<col_names>) ON TABLE <table_name> FROM [ROLE] <role_name> * Added support for storing column-level privileges in the Catalog Service and updating the Sentry Service when GRANT/REVOKE statements are executed. * Modified the SHOW GRANT ROLE statement to include information about column-level privileges. Subsequent patches will add support for enforcing column-level privileges in SQL queries and other statements. Change-Id: I0fd9daa92cc5147cb6f4b25eb9651aab8bf3049f Reviewed-on: http://gerrit.cloudera.org:8080/607 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-08-28 23:58:36 +00:00
Juan Yu	d42ecb310a	IMPALA-1756: Add test case for partition insert query Change-Id: I4879d8fe7221b551898fa9fa94076bb9b0804f06 Reviewed-on: http://gerrit.cloudera.org:8080/696 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-08-27 18:50:58 +00:00
Martin Grund	60c5140ea7	IMPALA-1983: Warn if table stats are potentially corrupt. When the `numRows` parameter stored in the table properties is errornously set to 0 and a number of non-empty files are present the table statistics are considered to be corrupt. To hint that there might be a problem, the explain statement will emit an additional warning if it detects potentially corrupt table stats like in the following example: Estimated Per-Host Requirements: Memory=42.00MB VCores=1 WARNING: The following tables have potentially corrupt table and/or column statistics. compute_stats_db.corrupted 03:AGGREGATE [FINALIZE] \| output: count:merge() \| 02:EXCHANGE [UNPARTITIONED] \| 01:AGGREGATE \| output: count() \| 00:SCAN HDFS [compute_stats_db.corrupted] partitions=1/2 files=1 size=24B In addition, the small query optimization is disabled for such queries. Change-Id: I0fa911f5132aa62195b854248663a94dcd8b14de Reviewed-on: http://gerrit.cloudera.org:8080/689 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-08-26 22:19:33 +00:00
Sailesh Mukil	1a9fc47295	IMPALA-2227: S3: query_test.test_queries.TestQueries.test_exprs failure The test file testdata/workloads/functional-query/queries/QueryTest/exprs.test had INSERT statements in it, which are not supported on S3. This commit gets rid of those statements and rewrites them with SELECT [...] FROM VALUES(...) so that the tests are compatible on S3. Change-Id: I25faacf9fae3780f627afee86dc8c1ede7f6e2a2 Reviewed-on: http://gerrit.cloudera.org:8080/670 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2015-08-26 00:36:51 +00:00
Vlad Berindei	452ebee59d	IMPALA-1906: PARQUET_FILE_SIZE query option overflows for values >= 2GB. The value of PARQUET_FILE_SIZE overflows when RoundUp() is called because this function returns an int32. Even with this change, this value will still overflow when calling the HDFS API since it is passed to hdfsOpenFile() as blocksize, which is an int32 parameter (see HDFS-8949). Changes: - Return an error if PARQUET_FILE_SIZE is set to a value greater than or equal to 2GB. - If PARQUET_FILE_SIZE is set in an Impala session to a value greater than or equal to 2GB, then every query will fail with an error message. - If PARQUET_FILE_SIZE is changed to a value greater than or equal to 2GB as an impalad argument, impalad will not start and log an error. - Ceil(), RoundUp(), RoundDown() return int64. Change-Id: Ie4f2551b72954e2a57db5594e4789e3f7434d578 Reviewed-on: http://gerrit.cloudera.org:8080/678 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com> Tested-by: Internal Jenkins	2015-08-25 23:28:13 +00:00
Alex Behm	6f0b255c5a	Address several shortcomings with respect to the usability of Avro tables. Addressed JIRAs: IMPALA-1947 and IMPALA-1813 New Feature: Adds support for creating an Avro table without an explicit Avro schema with the following syntax. CREATE TABLE <table_name> column_defs STORED AS AVRO Fixes and Improvements: This patch fixes and unifies the logic for reconciling differences between an Avro table's Avro Schema and its column definitions. This reconciliation logic is executed during Impala's CREATE TABLE and when loading a table's metadata. Impala generally performs the schema reconciliation during table creation, but Hive does not. In many cases, Hive's CREATE TABLE stores the original column definitions in the HMS (in the StorageDescriptor) instead of the reconciled column definitions. The reconciliation logic considers the field/column names and follows this conflict resolution policy which is similar to Hive's: Mismatched number of columns -> Prefer Avro columns. Mismatched name/type -> Prefer Avro column, except: A CHAR/VARCHAR column definition maps to an Avro STRING, and is preserved as a CHAR/VARCHAR in the reconciled schema. Behavior for TIMESTAMP: A TIMESTAMP column definition maps to an Avro STRING and is presented as a STRING in the reconciled schema, because Avro has no binary TIMESTAMP representation. As a result, no Avro table may have a TIMESTAMP column (existing behavior). Change-Id: I8457354568b6049b2dd2794b65fadc06e619d648 Reviewed-on: http://gerrit.cloudera.org:8080/550 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-25 09:52:18 +00:00
Taras Bobrovytsky	75691156be	IMPALA-2239: update misc.test to match the new .test file format Change-Id: Ia5b9925628b415c306f320ef186246179e38f73b Reviewed-on: http://gerrit.cloudera.org:8080/684 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-25 00:12:52 +00:00
Alex Behm	ae9fd52c51	IMPALA-2089: Retain eq predicates bound by grouping slots with complex grouping exprs. The bug: When enforcing slot equivalences at an aggregation node, we used to incorrectly assume that equivalences among grouping slots must have already been enforced below the aggregation (e.g., in a scan). This assumption is correct if the grouping slots are produced by simple SlotRef grouping exprs, because then there is certainly a value transfer between the grouping slot and another slot below the aggregation. However, for grouping slots with complex grouping exprs this assumption is not correct, and as a result, we would incorrectly remove eq predicates bound by gropuing slots with complex grouping exprs because we assumed they were redundant. Ths fix is to enforce slot equivalences among grouping slots with complex grouping exprs as usual, and not assume that they have already been enforced below the agg. Change-Id: Idcd44acccb9326a35c9121025dc88c2c70c7c7c7 Reviewed-on: http://gerrit.cloudera.org:8080/656 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-23 04:43:37 +00:00
Alex Behm	14a8cadcf6	Nested Types: Pretty print complex types in DESCRIBE. The current DESCRIBE prints the column type as a single string without whitespace. As a result, the DESCRIBE output for tables with complex types is basically unreadable/unusable, e.g., from the Impala shell. This patch adds a prettyPrint() function to the FE Type and uses that for generating a nicely formatted DESCRIBE output. The output of DESCRIBE FORMATTED is intentionally not modified because exact Hive-compatibility has been and presumably continues to be very important to our users. Change-Id: Ida810facdffd970948b837b83a60f9ddcd95f44d Reviewed-on: http://gerrit.cloudera.org:8080/633 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-22 09:26:35 +00:00
Taras Bobrovytsky	b8b7930377	Add nested types support to Create Table Like File Add support for creating a table based on a parquet file which contains arrays, structs and/or maps. Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae Reviewed-on: http://gerrit.cloudera.org:8080/582 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-22 01:46:26 +00:00
Vlad Berindei	e4c42fa8bf	IMPALA-595: Add CASCADE to DROP DATABASE and use it in cleanup_db Change-Id: Idfa5b6943bc797e10d542487c31b8f1b527d8c97 Reviewed-on: http://gerrit.cloudera.org:8080/635 Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com> Tested-by: Internal Jenkins	2015-08-20 03:34:31 +00:00
Skye Wanderman-Milne	7906ed44ac	IMPALA-2015: Add support for nested loop join Implement nested-loop join in Impala with support for multiple join modes, including inner, outer, semi and anti joins. Null-aware left anti-join is not currently supported. Summary of changes: Introduced the NestedLoopJoinNode class in the FE that represents the nested loop join. Common functionality between NestedLoopJoinNode and HashJoinNode (e.g. cardinality estimation) was moved to the JoinNode class. In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop join execution strategy. Change-Id: I238ec7dc0080f661847e5e1b84e30d61c3b0bb5c Reviewed-on: http://gerrit.cloudera.org:8080/652 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-08-19 08:40:14 +00:00
Tim Armstrong	5350d49f8c	IMPALA-1829: UDAs with different intermediate type Previously the frontend rejected UDAs with different intermediate and result type. The backend supports these, so this change enables support in the frontend and adds tests. This patch adds a test UDA function with different intermediate type and a simple end-to-end test that exercises it. It modifies an existing unused test UDA that used a currently unsupported intermediate type - BufferVal. Change-Id: I5675ec7f275ea698c24ea8e92de7f469a950df83 Reviewed-on: http://gerrit.cloudera.org:8080/655 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2015-08-19 04:37:39 +00:00
Sailesh Mukil	1c46cab5c6	IMPALA-2084: SPLIT_PART and REGEXP_LIKE functions for Tableau pushdown Added the SPLIT_PART and the REGEXP_LIKE builtin functions and tests for both. The REGEXP_LIKE has an optional third parameter which if used, uses a different 'prepare' function (RegexpLikePrepare in like-predicate.cc) so that the appropriate options can be set in the RE2 library. Added a patch for the RE2 library so that the 'dot matches all' option is exposed via the RE2 class. Fixed a bug in the case when the function to be evaluated for the WHERE clause operates on constants, proper cleanup isn't guaranteed on certain edge cases. Change-Id: Ia2a8de9eeb2854100a2d949f612cfaba317c5a7b Reviewed-on: http://gerrit.cloudera.org:8080/501 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2015-08-18 09:07:34 +00:00
Alex Behm	f9d26fb896	IMPALA-2203: Set an InsertStmt's result exprs from the source statement's result exprs. This patch fixes an issue where incorrect results are produced by a CTAS or IAS that is fed from a QueryStmt that has outer-joined inline views with constants or conditionals in the select list. The regression was introduced in this commit: b8f642710ea9d311a7aca32611eaa7cac6cd86df Now that the final expression substitution with TupleIsNullPredicate() wrapping is performed in planning, the InsertStmt's result expressions should be taken from the feeding QueryStmt's result expressions, and not the QueryStmt's (already substituted) base table result expressions. Change-Id: Iae29683638df01f140d0f74976cca8ca9ba0852d Reviewed-on: http://gerrit.cloudera.org:8080/637 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-18 01:44:45 +00:00
Casey Ching	cf60967b7e	IMPALA-1675: Avoid overflow when adding large intervals to TIMESTAMPs It turns out there is a variety of cases where boost incorrectly adds intervals if the interval is at (or beyond) an edge case value. This change defines a max interval and returns NULL if the user supplies an interval beyond the max. Change-Id: I4fb6869be22ab06089b66eeffaea04b0c0880080 Reviewed-on: http://gerrit.cloudera.org:8080/492 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-08-16 12:09:24 +00:00
Christopher Channing	9ea5caf0ef	IMPALA-2199: Row count not set for empty partition when spec is used with compute incremental stats This patch resolves an issue where row count is not set to 0 when a partition spec is used with 'compute incremental stats' on a partition that contains no data. The fix is to populate the partition 'expected list' in the frontend with the partition spec, the backend keeps track of which partitions had statistics generated. In the scenario where no statistics are generated for a partition, the backend will fall back to the 'expected list' to zero out the statistics. Change-Id: If4aac131dbe44e14a0477afa58e980da9e235d6b Reviewed-on: http://gerrit.cloudera.org:8080/627 Reviewed-by: Christopher Channing <cchanning@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 09:38:30 +00:00
Dimitris Tsirogiannis	47c5ae405a	Revert "IMPALA-2015: Add support for nested loop join" This reverts commit 6837cdec7f6a7e1c7e8157e323f3ab68277689aa. Change-Id: I2fd6424c553a701fcbfd425b4486af7280820b23 Reviewed-on: http://gerrit.cloudera.org:8080/636 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 02:20:07 +00:00

1 2 3 4 5 ...

577 Commits