impala

mirror of https://github.com/apache/impala.git synced 2026-01-06 06:01:03 -05:00

Author	SHA1	Message	Date
Alex Behm	263f222557	IMPALA-4490: Only generate runtime filters for hash join nodes. Change-Id: I167725e260bd0f91c2bfc164eb044321192d5b95 Reviewed-on: http://gerrit.cloudera.org:8080/5117 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-18 00:26:35 +00:00
Alex Behm	f5e660dd6e	IMPALA-4470: Avoid creating a NumericLiteral from NaN/infinity/-0. Our NumericLiteral is backed by a BigDecimal which cannot represent the special float values NaN, infinity or negative zero. As a result, when evaluating constant expressions from the FE we hit an exception when trying to create a NumericLiteral from a NaN or infinity value. Before, negative zero would silently get converted to zero which is dangerous. The fix is to treat the expr evaluation as a failure and not replace the constant Expr with a LiteralExpr. Change-Id: I8243b2ee9fa9c470d078b385583f2f48b606a230 Reviewed-on: http://gerrit.cloudera.org:8080/5050 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-16 23:55:42 +00:00
Alex Behm	91b5264e52	IMPALA-4479: Use correct isSet() thrift function when evaluating constant bool exprs. Change-Id: Ie3ba195a5241ca630bd0cf71b83d423733b06546 Reviewed-on: http://gerrit.cloudera.org:8080/5088 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 11:17:43 +00:00
Alex Behm	0aeb68050b	IMPALA-1286: Extract common conjuncts from disjunctions. Adds a new ExprRewriteRule to extract common conjuncts from disjunctions. Examples: (a AND b AND c) OR (b AND d) ==> b AND ((a AND c) OR (d)) (a AND b) OR (a AND b) ==> a AND b (a AND b AND c) OR (c) ==> c Adds a new query option ENABLE_EXPR_REWRITES to enable/disable non-essential expr rewrites in the FE. Note that some rewrites are required, e.g., BetweenToCompoundRule. Disabling the rewrites is useful for testing, in particular, to make sure that the exprs specified in expr-test.cc are executed as written. Testing: Added a new unit test in ExprRewriteRulesTest. Change-Id: I3cf9b950afaa3fd753d1b09ba5e540b5258940ad Reviewed-on: http://gerrit.cloudera.org:8080/4877 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 09:44:59 +00:00
Matthew Jacobs	08d89a5cc3	IMPALA-3710: Kudu DML should ignore conflicts by default Removes the non-standard IGNORE syntax that was allowed for DML into Kudu tables to indicate that certain errors should be ignored, i.e. not fail the query and continue. However, because there is no way to 'roll back' mutations that occurred before an error occurs, tables are left in an inconsistent state and it's difficult to know what rows were successfully modified vs which rows were not. Instead, this change makes it so that we always 'ignore' these conflicts, i.e. a 'best effort'. In the future, when Kudu will provide the mechanisms Impala needs to provide a notion of isolation levels, then Impala will be able to provide options for more traditional semantics. After this change, the following errors are ignored: * INSERT where the PK already exists * UPDATE/DELETE where the PK doesn't exist Another follow-up patch will change other violations to be handled in this way as well, e.g. nulls inserted in non-nullable cols. Reporting: The number of rows inserted is reported to the coordinator, which makes the aggregate available to the shell and via the profile. TODO: Return rows modified for INSERT via HS2 (IMPALA-1789). TODO: Return rows modified for other CRUD (beeswax+hs2) (IMPALA-3713). TODO: Return error counts for specific warnings (IMPALA-4416). Testing: Updated tests. Ran all functional tests. More tests will be needed when other conflicts are handled in the same way. Change-Id: I83b5beaa982d006da4997a2af061ef7c22cad3f1 Reviewed-on: http://gerrit.cloudera.org:8080/4911 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 20:34:00 +00:00
Thomas Tauber-Marshall	832fb53763	IMPALA-3725 Support Kudu UPSERT in Impala This patch introduces a new query statement, UPSERT, for Kudu tables which operates like an INSERT and uses all of the analysis, planning, and execution machinery as INSERT, except that if there's a primary key collision instead of returning an error an update is performed. New syntax: [with_clause] UPSERT INTO [TABLE] table_name [(column list)] { query_stmt \| VALUES (value [, value...]) [, (value [, (value...)]) ...] } where column list must contain all of the key columns in table_name, if specified, and table_name must be a Kudu table. This patch also improves the behavior of INSERTing into Kudu tables without specifying all of the key columns - this now results in an analysis exception, rather than attempting the INSERT and receiving an error back from Kudu. Change-Id: I8df5cea36b642e267f85ff6b163f3dd96b8386e9 Reviewed-on: http://gerrit.cloudera.org:8080/4047 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-05 04:16:54 +00:00
Dimitris Tsirogiannis	d802f321b2	IMPALA-3724: Support Kudu non-covering range partitions This commit adds support for non-covering range partitions in Kudu tables. The SPLIT ROWS clause is now deprecated and no longer supported. The following new syntax provides more flexibility in creating range partitions and it supports bounded and unbounded ranges as well as single value partitions; multi-column range partitions are supported as well. The new syntax is: DISTRIBUTE BY RANGE (col_list) ( PARTITION lower_1 <[=] VALUES <[=] upper_1, PARTITION lower_2 <[=] VALUES <[=] upper_2, .... PARTITION lower_n <[=] VALUES <[=] upper_n, PARTITION VALUE = val_1, .... PARTITION VALUE = val_n ) Multi-column range partitions are specified as follows: DISTRIBUTE BY RANGE (col1, col2,..., coln) ( PARTITION VALUE = (col1_val, col2_val, ..., coln_val), .... PARTITION VALUE = (col1_val, col2_val, ..., coln_val) ) Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06 Reviewed-on: http://gerrit.cloudera.org:8080/4856 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 22:02:22 +00:00
Alex Behm	c5f49ec9bb	IMPALA-4423: Correct but conservative implementation of Subquery.equals(). The underlying problem was for trivial/constant [NOT] EXISTS subqueries we substituted out Subqueries with bool literals using an ExprSubstitutionMap, but the Subquery.equals() function was not implemented properly, so we ended up matching Subqueries to the wrong entry in the ExprSubstitutionMap. This could ultimately lead to wrong plans and results. Testing: Corrected an existing test and modified an existing test for extra coverage. Change-Id: I5562d98ce36507aa5e253323e184fd42b54f27ed Reviewed-on: http://gerrit.cloudera.org:8080/4923 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 00:19:33 +00:00
Alex Behm	1d8cdb02c6	IMPALA-4309: Introduce Expr rewrite phase and supporting classes. Introduces a new phase for rewriting Exprs after analysis and before subquery rewriting. The transformed Exprs replace the original ones in analyzed statements. If Exprs were changed, the whole statement is reset() and re-analyzed, similar to how subqueries are rewritten. If both Exprs and subqueries are rewritten there is only one re-analysis of the changed statement. The following new classes work together to perform transformations: 1. ExprRewriteRule - base class for Expr transformation rules 2. ExprRewriter - drives the transformation of Exprs using a list of ExprRewriteRules Statements that have exprs to be rewritten need to implement a new method rewriteExprs() that accepts an ExprRewriter. As an example, this patch adds a rule for converting BetweenPredicates into their equivalent CompoundPredicates. The BetweenPredicate has been notoriously buggy due to a lack of such a separate rewrite phase and is now cleaned up. Testing: 1. Added a new test for checking that the rewrite framework covers all relevant statements, clauses and can properly handle nested statements and subqueries. 2. Added a new test for ExprRewriteRules and implemented tests for the BetweenPredicate rewrite. 2. There are many existing tests for BetweePredicates and they all exercise the new rewrite rule/phase. 3. Ran a private core/hdfs run and it passed. Change-Id: I2279dc984bcf7742db4fa3b1aa67283ecbb05e6e Reviewed-on: http://gerrit.cloudera.org:8080/4746 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 20:08:37 +00:00
Marcel Kornacker	0d857237a8	IMPALA-4314: Standardize on MT-related data structures This removes the data structures that were "superceded" in IMPALA-3903 and changes all control flow to utilize the new data structures. The new data structures are renamed to remove the "Mt" prefix. Change-Id: I465d0e15e2cf17cafe4c747d34c8f595d3645151 Reviewed-on: http://gerrit.cloudera.org:8080/4853 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-10-31 16:03:32 +00:00
Lars Volker	c24e9da914	IMPALA-2521: Add clustered hint to insert statements This change introduces a clustered/noclustered hint for insert statements. Specifying this hint adds an additional sort node to the plan, just before the table sink. This has the effect that data will be clustered by its partition prior to writing partitions, which therefore can be written sequentially. Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14 Reviewed-on: http://gerrit.cloudera.org:8080/4745 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 04:56:14 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00
Yuanhao Luo	f8d48b8582	IMPALA-4325: StmtRewrite lost parentheses of CompoundPredicate StmtRewrite lost parentheses of CompoundPredicate in pushNegationToOperands() and leads to incorrect toSql() result. Even though this issue would not leads to incorrect result of query, it makes user confuse of the logical operator precedence of predicates shown in EXPLAIN statement. Change-Id: I79bfc67605206e0e026293bf7032a88227a95623 Reviewed-on: http://gerrit.cloudera.org:8080/4753 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 07:45:53 +00:00
Alex Behm	0480253566	IMPALA-4270: Gracefully fail unsupported queries with mt_dop > 0. MT_DOP > 0 is only supported for plans without distributed joins or table sinks. Adds validation to fail unsupported queries gracefully in planning. For scans in queries that are executable with MT_DOP > 0 we either use the optimized MT scan node BE implementation (only Parquet), or we use the conventional scan node with num_scanner_threads=1. TODO: Still need to add end-to-end tests. Change-Id: I91a60ea7b6e3ae4ee44be856615ddd3cd0af476d Reviewed-on: http://gerrit.cloudera.org:8080/4677 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-17 09:22:57 +00:00
Henry Robinson	9f61397fc4	IMPALA-2905: Handle coordinator fragment lifecycle like all others The plan-root fragment instance that runs on the coordinator should be handled like all others: started via RPC and run asynchronously. Without this, the fragment requires special-case code throughout the coordinator, and does not show up in system metrics etc. This patch adds a new sink type, PlanRootSink, to the root fragment instance so that the coordinator can pull row batches that are pushed by the root instance. The coordinator signals completion to the fragment instance via closing the consumer side of the sink, whereupon the instance is free to complete. Since the root instance now runs asynchronously wrt to the coordinator, we add several coordination methods to allow the coordinator to wait for a point in the instance's execution to be hit - e.g. to wait until the instance has been opened. Done in this patch: * Add PlanRootSink * Add coordination to PFE to allow coordinator to observe lifecycle * Make FragmentMgr a singleton * Removed dead code from Coordinator::Wait() and elsewhere. * Moved result output exprs out of QES and into PlanRootSink. * Remove special-case limit-based teardown of coordinator fragment, and supporting functions in PlanFragmentExecutor. * Simplified lifecycle of PlanFragmentExecutor by separating Open() into Open() and Exec(), the latter of which drives the sink by reading rows from the plan tree. * Add child profile to PlanFragmentExecutor to measure time spent in each lifecycle phase. * Removed dependency between InitExecProfiles() and starting root fragment. * Removed mostly dead-code handling of LIMIT 0 queries. * Ensured that SET returns a result set in all cases. * Fix test_get_log() HS2 test. Errors are only guaranteed to be visible after fetch calls return EOS, but test was assuming this would happen after first fetch. Change-Id: Ibb0064ec2f085fa3a5598ea80894fb489a01e4df Reviewed-on: http://gerrit.cloudera.org:8080/4402 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-10-16 15:55:29 +00:00
Lars Volker	1a5c43ef5e	IMPALA-3644 Make predicate order deterministic This adds a tie-break to make sure that we sort predicates in a deterministic order on Java 7 and 8. This was suggested by Alex in IMPALA-3644. There are still three broken tests when run in Java 8, but it seems best to address them in a subsequent change. Change-Id: Id11010bfeaff368869e6d430eeb4773ddf41faff Reviewed-on: http://gerrit.cloudera.org:8080/4671 Reviewed-by: Jim Apple <jbapple@cloudera.com> Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 22:04:30 +00:00
Lars Volker	ef4c9958d0	IMPALA-4047: Remove occurrences of 'CDH'/'cdh' from repo This change removes some of the occurrences of the strings 'CDH'/'cdh' from the Impala repository. References to Cloudera-internal Jiras have been replaced with upstream Jira issues on issues.cloudera.org. For several categories of occurrences (e.g. pom.xml files, DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to remove the occurrences left after this change. Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8 Reviewed-on: http://gerrit.cloudera.org:8080/4187 Reviewed-by: Jim Apple <jbapple@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-13 00:40:41 +00:00
Taras Bobrovytsky	acb25a6d16	IMPALA-4076: Fix runtime filter sort compare method Fixed 2 isssues: - The getSelectivity() method sometimes returned NaN double values which could not be sorted properly. - The compare method for sorting runtime filters was swtiched to use the builtin Double comparison method. Change-Id: Iad433f2ece423ea29e79e81b68fa53cb0af18378 Reviewed-on: http://gerrit.cloudera.org:8080/4652 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-07 05:59:50 +00:00
Matthew Jacobs	2b5d1344c9	IMPALA-4213: Fix Kudu predicates that need constant folding Folding const exprs where there were implicit casts on the slot resulted in the predicate not being pushed to Kudu. Change-Id: I3bab22d90ee00a054c847de6c734b4f24a3f5a85 Reviewed-on: http://gerrit.cloudera.org:8080/4613 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-06 04:06:38 +00:00
Yonghyun Hwang	112ff68edd	IMPALA-4042: Preserve root types when substituting grouping exprs In case of count(distinct), FunctionCallExpr.analyze() changes type for "NULL" into "BOOLEAN" to make sure that BE doesn't see any "NULL_TYPE" exprs. In the meantime, Expr substitution, happening in Expr.substituteImpl() reverts this change back to original type, "NULL_TYPE". This causes an issue when AggregateInfo.checkConsistency() performs precondition check where slot types from AggregateInfo.outputTupleDesc_ should be matched with the types from AggregateInfo.groupingExpr_. The slot type shows "BOOLEAN" while type from groupingExpr_ is "NULL_TYPE", which makes the precondition fail and throws an exception. To resolve the issue, preserveRootType is set to true when Expr.substituteList() gets called in AggregateInfo.substitute() Change-Id: Icf3b4511234e473e5b9548fbf3e97f333c9980f1 (cherry picked from commit b17785b4890bedd1c825140ce3c48cd7d9734295) Reviewed-on: http://gerrit.cloudera.org:8080/4600 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-05 03:04:17 +00:00
Alex Behm	a5e84ac014	IMPALA-4206: Add column lineage regression test. The underlying issue was already fixed in IMPALA-3940. This patch adds a new regression test to cover the IMPALA-4206. Change-Id: I5b164000c7b0ce7e2f296d168d75a6860f5963d8 Reviewed-on: http://gerrit.cloudera.org:8080/4556 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 07:45:19 +00:00
Matthew Jacobs	c7fa03286b	IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:11:04 +00:00
Matthew Jacobs	157c80056c	IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Reviewed-on: http://gerrit.cloudera.org:8080/4120 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 01:50:51 +00:00
Alex Behm	5adedc6a1a	IMPALA-3930,IMPALA-2570: Fix shuffle insert hint with constant partition exprs. Fixes inserts into partitioned tables that have a shuffle hint and only constant partition exprs. The rows to be inserted are merged at the coordinator where the table sink is executed. There is no need to hash exchange rows. Now accepts insert hints when inserting into unpartitioned tables. The shuffle hint leads to a plan where all rows are merged at the coordinator where the table sink is executed. Change-Id: I1084d49c95b7d867eeac3297fd2016daff0ab687 Reviewed-on: http://gerrit.cloudera.org:8080/4162 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-08-31 09:59:00 +00:00
Thomas Tauber-Marshall	d72353d0c9	IMPALA-2932: Extend DistributedPlanner to account for hash table build cost When deciding between a broadcast or repartition join, Impala calculates the cost of each join as the total amount of data that is sent over the network. This ignores some relevant costs, and can lead to bad plans. One such relevant cost is the work to create the hash table used in the join. This patch accounts for this by adding the amount of data inserted into the hash table (the size of the right side of the join) to the previous cost. This generally increases the estimated cost of broadcast joins relative to repartitioning joins, as the broadcast join must build the hash table on each node the data was broadcast to, so its effect will be to make repartitioning joins more likely to be chosen, especially in large clusters. This patch has not yet been performance tested. Change-Id: I03a0f56f69c8deae68d48dfdb9dc95b71aec11f1 Reviewed-on: http://gerrit.cloudera.org:8080/4098 Tested-by: Internal Jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com>	2016-08-29 16:44:22 +00:00
Alex Behm	1a430fe40c	IMPALA-2540: Add regression test. The bug was also fixed by the fix for IMPALA-3063: `532b1fe118` This patch adds a regression test for IMPALA-2540. Change-Id: I7c7dececfee90540fe7d5f8a606381ec50a3b241 Reviewed-on: http://gerrit.cloudera.org:8080/4071 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-25 00:43:05 +00:00
Alex Behm	1bbd667fd3	IMPALA-3828: Enable inversion for inner joins. Testing: Ran the FE planner tests. Examined all the changed plans to verify that the changes are benefitial according to our cardinality estimates. Still need to do a real perf run. Change-Id: I8ba903f1df2446350cca7e71fdb13f550bf9de72 Reviewed-on: http://gerrit.cloudera.org:8080/4035 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-19 05:40:01 +00:00
Matthew Jacobs	d113205cee	IMPALA-3650: DISTRIBUTE BY required for managed Kudu tables As of Kudu 0.9, DISTRIBUTE BY is now required when creating a new Kudu table. Create table analysis, data loading, and tests are updated to reflect this. This also bumps the Kudu version to 0.10.0. Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e Reviewed-on: http://gerrit.cloudera.org:8080/3987 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-08-19 02:14:39 +00:00
Matthew Jacobs	0983da92ba	IMPALA-3856,IMPALA-3871: Fix BinaryPredicate normalization for Kudu Change-Id: Iae7612433a2e27f8887abe6624f9ee0f4867b934 Reviewed-on: http://gerrit.cloudera.org:8080/3986 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-08-18 04:03:00 +00:00
Alex Behm	532b1fe118	IMPALA-3063: Separate join inversion from join ordering. Before this change joins were inverted while doing join ordering. That approach was unnecessarily complex because it required modifying the global analysis state for correct conjunct placement, etc. However, join inversion is independent of join ordering, and the existing approach could lead to generating invalid plans with distributed non-equi right outer/semi joins, which we cannot execute in the backend. After this change joins are inverted in a separate pass over the single-node plan. This simplifies the inversion logic and allows us to avoid generating those invalid plans. Note that this change is not only a separation of functionality for the following reasons: 1. Our join cardinality estimation is not symmetric, i.e., A JOIN B may not give the same estimate as B JOIN A due to our FK/PK detection heuristic. In the context of this patch this means that an inverted join may have a different cardinality estimate, so plans may change depending on whether the inversion is done during join ordering of after. 2. We currently only invert outer/semi/anti joins based on the rhs table ref join op. In this patch I want to preserve the existing behavior as much as possible, but when doing the join ordering in a separate pass we may see a join opn in a JoinNode that is different from the rhs table ref. So in some situations the inversion behavior based on the join op could be different and there are some examples in this patch. This patch also moves the logic of converting hash joins to nested-loop joins into a separate pass over the single-node plan. Change-Id: If86db7753fc585bb4c69612745ec0103278888a4 Reviewed-on: http://gerrit.cloudera.org:8080/3846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-18 03:25:16 +00:00
Alex Behm	286da59219	IMPALA-3940: Fix getting column stats through views. The bug: During join ordering we rely on the column stats of join predicates for estimating the join cardinality. We have code that tries to find the stats of a column through views but there was a bug in identifying slots that belong to base table scans. The bug lead us to incorrectly accept slots of view references which do not have stats. This patch fixes the above issue and adds new test infrastructure for creating test-local views. It adds a TPCH-equivalent database that contains views of the form "select * from tpch_basetbl" for all TPCH tables and add tests the plans of all TPCH queries on the view database. Change-Id: Ie3b62a5e7e7d0e84850749108c13991647cedce6 Reviewed-on: http://gerrit.cloudera.org:8080/3865 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-08-11 08:22:30 +00:00
Alex Behm	ac1215fd31	IMPALA-3861: Replace BetweenPredicates with their equivalent CompoundPredicate. The bug: Our BetweenPredicate has a complex object structure that is unlike most other Exprs because we generate an equivalent CompoundPredicate during analysis and replace the original children. Keeping the various members in sync and preserving the object structure during clone() and substitute() is very difficult and error prone. In particular, subquery rewriting is difficult because we extract and replace correlated BinaryPredicates. Substituting BinaryPredicates in a BetweenPredicate's children is not equivalent to a substitution on the BetweenPredicat's original children, so keeping the various redundant members in sync is quite difficult. The fix is to replace BetweenPredicates with their equivalent CompoundPredicates before performing subquery rewrites. We ultimately still want to fix clone() and substitute() for BetweenPredicates, but an elegant solution is likely to more involved. Change-Id: I0838b30444ed9704ce6a058d30718a24caa7444a Reviewed-on: http://gerrit.cloudera.org:8080/3804 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-07-30 07:23:52 +00:00
Dimitris Tsirogiannis	0e88f0d7aa	Add TPC-H based planner tests for Kudu tables This commit adds a set of planner tests for Kudu tables based on the 22 TPC-H queries. Change-Id: I6c40534b72b9aa1ee582b9679c2a63cad52df703 Reviewed-on: http://gerrit.cloudera.org:8080/3790 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-29 02:49:50 +00:00
Alex Behm	55b43ba8c4	IMPALA-3084: Cache the sequence of table ref and materialized tuple ids during analysis. The bug: For correct predicate assignment we rely on TableRef.getAllTupleIds() and TableRef.getMaterializedTupleIds(). The implementation of those functions used to traverse the chain of table refs and collect the appropriate ids. However, during plan generation we alter the chain of table refs, in particular, for dealing with nested collections, so those altered TableRefs do not return the expected list of ids, leading to wrong decisions in predicate assignment. The fix: Cache the lists of ids during analysis, so we are free to alter the chain of TableRefs during plan generation. Change-Id: I298b8695c9f26644a395ca9f0e86040e3f5f3846 Reviewed-on: http://gerrit.cloudera.org:8080/2415 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-07-22 22:35:18 -07:00
Thomas Tauber-Marshall	343bdad866	IMPALA-3210: last/first_value() support for IGNORE NULLS Added support for the 'ignore nulls' keyword to the last_value and first_value analytic functions, eg. 'last_value(col ignore nulls)', which would return the last value from the window that is not null, or null if all of the values in the window are null. We handle 'ignore nulls' in the FE in the same way that we handle 'distinct' - by adding isIgnoreNulls as a field in FunctionParams. To avoid affecting performance when 'ignore nulls' is not used, and to avoid having to special case 'ignore nulls' on the backend, this patch adds 'last_value_ignore_nulls' and 'first_value_ignore_nulls' builtin analytic functions that wrap 'last_value' and 'first_value' respectively. Change-Id: Ic27525e2237fb54318549d2674f1610884208e9b Reviewed-on: http://gerrit.cloudera.org:8080/3328 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Internal Jenkins	2016-07-18 08:28:09 -07:00
Alex Behm	45740c8bcc	IMPALA-3678: Fix migration of predicates into union operands with an order by + limit. There were two separate issues: First, the SortNode incorrectly picked up unassigned conjuncts, and expected those to be empty. In this case where predicates are migrated into union operands, there could actually be unassigned conjuncts bound by the SortNode's tuple id (and so would be incorrectly picked up). The fix is to not pick up unassigned conjuncts in the SortNode, and allow them to be picked up later (into a SelectNode). Second, when generating the plan for union operands we were missing a call to graft a SelectNode on top of the operand plan to capture unassigned conjuncts. Change-Id: I95d105ac15a3dc975e52dfd418890e13f912dfce Reviewed-on: http://gerrit.cloudera.org:8080/3600 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-07-15 18:27:05 +00:00
Jim Apple	d70ffa455d	IMPALA-3450: LIMITs on plan nodes are reflected in cardinality estimates PlanNode includes a 'capAtLimit()' method that can be used in 'computeStats()' on PlanNodes to ensure they do not estimate their cardinality to be more than a pushed-down LIMIT clause. This patch ensures that 'capAtLimit()' is used in all of the relevant classes descending from PlanNode. Change-Id: Ic06dcb93bbb2510c0d40151302bd817ef340b825 Reviewed-on: http://gerrit.cloudera.org:8080/3127 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-05-24 14:40:52 -07:00
Dimitris Tsirogiannis	fa30a0c818	IMPALA-3574: Handle runtime filters with TupleIsNull predicates This commit fixes an issue where an IllegalStateException is thrown while generating runtime filters if a target expr of a join conjunct is wrapped in a IF(TupleIsNull, NULL, e) expr. As this is not a valid expr to be assigned to a scan node (target of a runtime filter), we unwrap these exprs and replace exprs of the form IF(TupleIsNull, NULL, e) with 'e' while producing the targer exprs for runtime filters. The original expr of the join conjunct is not modified. Change-Id: I2e3e207b4c8522283a1cd0d14be83d42eba58f5a Reviewed-on: http://gerrit.cloudera.org:8080/3147 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:20 -07:00
Dimitris Tsirogiannis	f992dc7f88	IMPALA-2956: Filters should be able to target multiple scan nodes With this commit runtime filters can be assigned to multiple destination nodes (scans). For each filter, the destination nodes are determined using equivalent classes during planning. For each filter, all its destination nodes are in the left subtree rooted at the join node that constructs this filter. A runtime filter may have both local and remote targets. The backend determines how to route each filter depending on the number and type (local, remote) of its destination nodes. With this commit, we enable runtime filter propagation in all the operands of UNION [ALL\|DISTINCT] nodes. Change-Id: Iad2ce4e579a30616c469312a4e658140d317507b Reviewed-on: http://gerrit.cloudera.org:8080/2932 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-05-18 01:40:22 -07:00
Taras Bobrovytsky	46c3e43edb	IMPALA-3232: Allow not-exists uncorrelated subqueries Before this patch, correlated exists and not exists subqueries were rewritten as as left semi and anti joins respectively. Uncorrelated exists subqueries were rewritten as cross joins, and uncorrelated not-exists subqueries were not supported at all. This patch takes advantage of the nested loop join that was recently introduced, which allows us to rewrite both correlated and uncorrelated exists subqueries as left semi joins and both correlated and uncorrelated not-exists subqueries as anti joins. Change-Id: I52ae12f116d026190f3a2a7575cda855317d11e8 Reviewed-on: http://gerrit.cloudera.org:8080/2792 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 23:06:36 -07:00
Marcel Kornacker	3b7d5b7c17	MT: Planner for multi-threaded execution New classes: - ParallelPlanner: creates build plans, assigns plans to cohorts - JoinBuildSink: DataSink for plan fragments that materialize build sides - ids for plans, hash tables, plan fragments Tests: this adds a new test file section PARALLELPLANS and augments the tpc-h/-ds tests with those sections. In the interest of keeping this patch small I didn't augment other test files with that section yet (which will happen at a later date, to cover more corner cases). Change-Id: Ic3c34dd3f9190a131e6f03d901b4bfcd164a5174 Reviewed-on: http://gerrit.cloudera.org:8080/2846 Tested-by: Internal Jenkins Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2016-05-12 14:17:56 -07:00
Thomas Tauber-Marshall	8c2bf9769a	IMPALA-2805: Order conjuncts based on selectivity and cost Added costs to all Exprs, which estimate the relative cost of evaluating an expression and all of its children. Costs are calculated during analysis. For now, these costs are intended as a simple way to order expressions from cheap to expensive, not necessarily to be a precise reflection of running times. In general, expressions that deal with variable length types like strings will have higher cost than those dealing with fixed length types like numbers and booleans. Additionally, expressions with complicated subexpressions will have higher cost than simpler expressions. Also added PlanNode.orderConjunctsByCost, which takes a list of Exprs and returns a new list sorted according to an estimate of the cheapest order to evaulate the conjuncts in, based on their cost and selectivity. The conjuncts are sorted by repeatedly iterating over them and choosing the conjunct that would result in the least total estimated work were it to be applied before the remaining conjuncts. Selectivities are exponentially backed off, and Exprs without selectivity estimates are given a reasonable default. Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034 Reviewed-on: http://gerrit.cloudera.org:8080/2598 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:53 -07:00
Todd Lipcon	4bdd0b976d	IMPALA-3148. Fix selectivity computation for pushed Kudu predicates This follows up on a TODO from the Kudu merge and also fixes a bug: IMPALA-976 changed the computation of selectivities for a combined list of conjuncts to better handle expressions with no selectivity estimate. The Kudu implementation was forked from before this change and thus did not have an equivalent change. This refactors the algorithm to a new static method and calls it from both PlanNode and KuduScanNode so that the selectivity estimate behavior is the same regardless of whether Kudu can evaluate the predicate server-side. Todd tested this on TPCH 3TB and verified that the plans are reasonable now where they used to be nonsense. Change-Id: Id507077b577ed5804fc80517f33ea185f2bff41a Reviewed-on: http://gerrit.cloudera.org:8080/2628 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Bharath Vissapragada	5cd7ada727	IMPALA-3194: Allow queries materializing scalar type columns in RC/sequence files This commit unblocks queries materializing only scalar typed columns on tables backed by RC/sequence files containing complex typed columns. This worked prior to 2.3.0 release. Change-Id: I3a89b211bdc01f7e07497e293fafd75ccf0500fe Reviewed-on: http://gerrit.cloudera.org:8080/2580 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-31 12:06:57 +00:00
Tim Armstrong	f5b7842414	IMPALA-2502: don't redundantly repartition grouping aggregations Grouping aggregations previously always repartitioned their input, even if preceding joins or aggs had already partitioned the data on the required key (or an equivalent key). This patch checks to see if data is already partitioned on the required exprs (or equivalent ones), and if so skips the preaggregation and only does a merge aggregation. The patch also does some refactoring of the aggregation planning in DistributedPlanner to make it easier to implement the change. Includes planner tests for the three cases that are affected: grouping aggregations, non-grouping distinct aggregations and grouping distinct aggregations. Change-Id: Iffdcfd3629b8a69bd23915e1adba3b8323cbbaef Reviewed-on: http://gerrit.cloudera.org:8080/2414 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-03-15 09:21:22 +00:00
David Alves	7381304a23	Merge branch 'feature/kudu' into cdh5-trunk This is the final merge commit that merges the 'feature/kudu' branch into cdh5-trunk. Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a	2016-03-13 19:28:43 -07:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Alex Behm	4a25f87d5c	Improve the SQL for nested TPCH-Q18. Marcel spotted that nested TPCH-Q18 can be expressed with more efficient SQL. Results on nested TPCH-300: Before 160s After 100s Change-Id: I8b351b7f467e8bef0c256dc43cea325d7f177edf Reviewed-on: http://gerrit.cloudera.org:8080/2418 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-04 04:35:54 +00:00
Alex Behm	54a46e9459	IMPALA-3065/IMPALA-3062: Restrict !empty() predicates to scan nodes. The bug: Evaluating !empty() predicates at non-scan nodes interacts poorly with our BE projection of collection slots. For example, rows could incorrectly be filtered if a !empty() predicate is assigned to a plan node that comes after the unnest of the collection that also performs the projection. The fix: This patch reworks the generation of !empty() predicates introduced in IMPALA-2663 for correctness purposes. The predicates are generated in cases where we can ensure that they will be assigned only by the parent scan, and no other plan node. The conditions are as follows: - collection table ref is relative and non-correlated - collection table ref represents the rhs of an inner/cross/semi join - collection table ref's parent tuple is not outer joined Change-Id: Ie975ce139a103285c4e9f93c59ce1f1d2aa71767 Reviewed-on: http://gerrit.cloudera.org:8080/2399 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Silvius Rus <srus@cloudera.com> Tested-by: Internal Jenkins	2016-03-02 23:23:05 -08:00
Alex Behm	a303f25256	IMPALA-3071: Fix assignment of On-clause predicates belonging to an inner join. The bug: On-clause predicates belonging to an inner join were not always assigned correctly if they referenced an outer-joined tuple. Specifically, our logic for detecting whether a predicate can be assigned below an outer join if also left at the outer-join node was not correct, and so we assigned the predicate below the join, but did not also leave it at the outer join. The fix: Assign an inner join On-clause conjunct that references an outer-joined tuple to the join that the On-clause belongs to. Change-Id: Iffef7718679d48f866fa90fd3257f182cbb385ae Reviewed-on: http://gerrit.cloudera.org:8080/2309 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-29 22:22:41 -08:00

1 2 3 4 5 ...

347 Commits