impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 21:00:54 -05:00

Author	SHA1	Message	Date
Matthew Jacobs	93368e20b1	Fix CROSS JOIN handling in join order optimization and add tests Cross joins should be handled like outer joins in the join order optimization in that the right table referenced by a cross join may not be reordered anywhere before tables referenced to the left of the cross join. If there are inner joins to the right of the cross join, those tables may be reordered before the cross join. E.g., if we have A JOIN B CROSS JOIN C JOIN D, then C must come after A and B, but D may be reordered to come before C. Also adds test cases for join order optimization and predicate propagation. Change-Id: I6b1022dd3e862efbff81e283b43284d846c8eca4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:29 -08:00
Alex Behm	c6397ca1e3	Revert "Revert to FROM-clause order if any table is lacking stats." This reverts commit 7e84cbe3bab9bf30a57ac58d9ef525ebc10a7b7a. Change-Id: I89d55ca2bcb8eb6eddc244d3e7b005074d04c26a Reviewed-on: http://gerrit.ent.cloudera.com:8080/1104 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:29 -08:00
Alex Behm	a94f1efe29	Cleanup related to [shuffle]/[noshuffle] insert hints. Change-Id: I360c8fdc9be4346d148c2daadc36e90072b42b29 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1071 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:28 -08:00
Alex Behm	df0b28d163	Revert to FROM-clause order if any table is lacking stats. Change-Id: I7d09c0f393e2bfeefa386845fc6bbba4ab6c8812 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1095 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:28 -08:00
Alex Behm	37ad54210c	Fixes in plan cost estimation to restore baseline plans of TPCDS Q89,Q79. Change-Id: I95ed2d132bb7efacb94a51a17a4db5970f03038a Reviewed-on: http://gerrit.ent.cloudera.com:8080/1094 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Alex Behm	2d9a2e07bd	IMPALA-680: Account for existing data partition on lhs and/or rhs of a partitioned join. Impala now exploits existing partitioning of the lhs and/or rhs of joins. Existing partitioning effects the cost of choosing between a partitioned and a broadcast join, as well as the final plan because Impala can avoid repartitioning the data. An existing data partition is exploitable iff the lhs/rhs input partition is equivalent to the target partition of the join. This matching procedure considers equivalence classes. Change-Id: Ica080f35cf5063bea828963bc234ba69797e2030 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1070 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Lenni Kuff	e63cc59a94	Add partitioned tpcds planner tests (SQL-92 style joins) Adds the TPCDS queries as planner tests and fixes a few small issues with the Planner test file parser. This adds the TPC-DS queries using SQL-92 style joins that have a hand optimized (although not perfect) join order. Change-Id: I2d81e66af740b2d826b8ebd0c5ba8553b5faf0a2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1019 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:26 -08:00
Alex Behm	4bd266c296	Bugfixes related to plan cost estimation. Change-Id: Ie1f67bd106b25fffcaa2540ebde8b7f1d4c05dd2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1084 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:26 -08:00
Matthew Jacobs	f327431a8e	IMPALA-171: Add CROSS JOIN Adds a CROSS JOIN (cartesian product). Common join code is moved from to a new abstract base class BlockingJoinNode. We must keep all build RowBatches in memory in order to iterate over them for every row from the left child. The TupleRowList provides a convenient way to iterate over all of the rows. A future change will address codegen for the CrossJoinNode. Change-Id: I5e0caa6fb4ec802a9c87e700f9dd6238cea8cdf2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/970 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:25 -08:00
Marcel Kornacker	fd182201bc	Predicate propagation plus join order optimization. Introduces STRAIGHT_JOIN keyword to prevent join order optimization. Structural changes to the planning framework: - slot materialization: the decision whether to materialize a slot now happens prior to plan generation. This is needed in order to be able to generate accurate cost estimates at plan generation time. see QueryStmt.materializeRequiredSlots() - added PlanNode.init(), which initializes the entire state of a PlanNode; this subsumes finalize() * computeMemLayout() now happens per-tuple in the corresponding ScanNode's init() * init() calls computeStats() by default; also marks slots as materialized and calls TupleDescriptor.computeMemLayout() - added PlanNode.tblRefIds_ - restructured UnionStmt and union plan generation to fit pred propagation model: all tuples are created (and equiv predicates registered) prior to plan generation - added Expr.isAuxExpr Change-Id: I475c1645bfca9e84ae6e5f529e7781d9532e5c9a Reviewed-on: http://gerrit.ent.cloudera.com:8080/955 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:24 -08:00
Lenni Kuff	0bae3978c9	Update compute-stats.py to execute using Impala Updates our compute stats script to execute using Impala. This allows us to easily compute stats on all tables in a database or all tables in the metastore. The updated stats caused one of the TPCH plans to change so this also updates the TPCH planner test results. Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:18 -08:00
Alex Behm	2325c8c923	Added [shuffle]/[noshuffle] plan hints for forcing/preventing repartitioning before an insert. Change-Id: I0647366815f4488cabbcb1fc7bc3cf851960c44e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1007 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:16 -08:00
Matthew Jacobs	8a55982105	Add OFFSET to skip rows returned with a LIMIT Adds support for skipping a number of rows with an ORDER BY clause and a LIMIT. Hive does not support OFFSET so creating a view with an OFFSET will not work in Hive. For example, "SELECT * FROM T1 ORDER BY ID LIMIT 20 OFFSET 5" will do the sorting, skip 5 rows, then return the next 20. OFFSET requires an ORDER BY clause. Note this is not very efficient as we must actually keep (limit+offset) rows in memory in the topn-node, and all child sort nodes must as well. Users should be careful when using this feature. Change-Id: I4d7021c278296e7bdbfa0e6f2699cd6f23eef59d Reviewed-on: http://gerrit.ent.cloudera.com:8080/900 Tested-by: jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:54:02 -08:00
Alex Behm	3f54240fed	PlannerTest uses explain level 'normal'. Only add stats and costs to explain output in 'verbose' mode. Change-Id: I827b4c7085b5aa2dc5521f8748d8973178f43f4c Reviewed-on: http://gerrit.ent.cloudera.com:8080/678 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:23 -08:00
Alex Behm	4bb8b38cde	Added stats and cost estimates to explain output. Change-Id: I1273745a439fd25cefa4e08ecc075c98cc8bfc45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/602 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:22 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00
Alex Behm	9065648d77	Improvements to cost estimation and explain output. Fixed cost estimation of union queries and exchange nodes. Fixed propagation of stats through cloning of exprs and plan nodes. Fixed propagation of expr stats to slots they are materialized into (e.g., grouping columns in multi-level aggs). Improved explain output for constant selects. Change-Id: I96d1652c00d48e4093b85ae7fc8bad28d74b8b81 Reviewed-on: http://gerrit.ent.cloudera.com:8080/547 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:08 -08:00
Alex Behm	39f9a067fa	IMPALA-444: Fixed accuracy of string to double conversion. Falling back to strod for scientific notation. Change-Id: I9a5d948620907d34601ef041e58b1c9bb2172f71 Reviewed-on: http://gerrit.ent.cloudera.com:8080/507 Tested-by: jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:56 -08:00
Nong Li	2b9105cd11	IMPALA-487: don't compact data from rhs of join if it is going through an exchange node. Change-Id: I442445e7370218352cd6d3137f2a454c9afb73ba Reviewed-on: http://gerrit.ent.cloudera.com:8080/476 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:52:50 -08:00
Alex Behm	77c0e54bb9	Set HDFS block size to 128MB because HDFS versions since 2.1.0-beta use 128MB as a default (HDFS-4053). Change-Id: If112d2eab242b44f05f64ee071ebea5b253c7927 Reviewed-on: http://gerrit.ent.cloudera.com:8080/470 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:48 -08:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Alex Behm	e52ed0800b	IMPALA-524: Fix computation of stats for ExchangeNode and merge AggregationNodes. The issue caused unnecessary repartitioning for static partition insert queries having grouped aggregation in the feeding query stmt. Change-Id: I5f4017e2c4d5a1bf88f51c4e0ff7ab28911e14f1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/202 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:11 -08:00
Alex Behm	48ee7ce891	IMPALA-508: Fix join-cardinality estimation and choice of join strategy when a join involves a table lacking table stats. Change-Id: I871273e1d9f048377ce638c201118fc21086db9a Reviewed-on: http://gerrit.ent.cloudera.com:8080/152 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:05 -08:00
Alex Behm	f0e2d539fc	IMPALA-495: Views Sometimes Not Utilizing Partition Pruning. Change-Id: I65daebbe8c4b72b956a409fe28edd3773fda7cb7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/128 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:04 -08:00
Marcel Kornacker	d85b90cb22	SlotDescriptor.label plus repartitioning for inserts when column stats are missing.	2014-01-08 10:51:56 -08:00
Marcel Kornacker	c8afd16bbb	IMPALA-85: planner fails with "Join requires at least one equality predicate between the two tables" when "from" table order does not match "where" join order This fix contains two parts: - functionality inside the analyzer to compute a value transfer graph (from equality predicates between slotrefs) and from that equivalence classes for all slots; this functionality is required for this fix but will be generally useful when adding propagation of binding predicates in the future - a "shortest path" implementation inside the planner of a fix for the problem at hand; this leaves a lot to be desired: * correct handling of assigned predicates: the added test case shows that the planner will try to assign all predicates to some node in the tree, even if that predicate is superfluous because it was subsumed by an equality derived from equivalence class membership * complete lack of propagation of binding predicates (e.g., propagate "col1 = 5" to all slotrefs that are in the same equivalence class as col1) This is beyond what can be accomplished for 1.1 and therefore will have to wait for 1.2.	2014-01-08 10:51:51 -08:00
Alex Behm	8ad15fabcf	IMPALA-372: Added CREATE/DROP/ALTER VIEW.	2014-01-08 10:51:35 -08:00
Alex Behm	3bba336bbf	IMPALA-359: Return proper tuple id of inline view with distinct aggregation.	2014-01-08 10:51:26 -08:00
Alex Behm	ece9f76a0b	IMP-967: Recognize predicates referring to more than two tuples as eq conjuncts.	2014-01-08 10:51:13 -08:00
Alex Behm	045038e479	IMPALA-374: Added WITH clause without recursion.	2014-01-08 10:51:00 -08:00
Alan Choi	2bdba77f61	Perform HBase deterministic region assigment and enable HBase scan range location test in the planner test	2014-01-08 10:50:54 -08:00
Lenni Kuff	2e19107496	Fixed TPCH planner test due to column stat changes in CDH4.3.0 hive	2014-01-08 10:50:46 -08:00
Alex Behm	937a44f9f8	IMPALA-68: Support Values() statement.	2014-01-08 10:50:31 -08:00
Alex Behm	c7819f4db7	IMPALA-87: Support INSERT from SELECT without FROM.	2014-01-08 10:50:30 -08:00
Nong Li	4235bf5009	Fix planner test result.	2014-01-08 10:50:11 -08:00
Alan Choi	2d25f11ec3	IMPALA-91 new explain plan output	2014-01-08 10:50:10 -08:00
Alex Behm	c9040aee22	IMPALA-111: COUNT(DISTINCT col) returns wrong results -- does not ignore NULLs.	2014-01-08 10:50:09 -08:00
Marcel Kornacker	21ec49e810	IMPALA-150: Performing dynamic partition insert via Impala on "large" table fails and takes down HDFS This is solved by repartitioning the input to the hdfs table sinks on the partition key columns of the hdfs table, so that each partition is only written by a single node.	2014-01-08 10:50:07 -08:00
Skye Wanderman-Milne	0c343913fa	IMPALA-266: Round() does not output the right precision	2014-01-08 10:50:02 -08:00
Marcel Kornacker	5bfc477ccc	IMPALA-291: Plans should explicitly mention the join strategy	2014-01-08 10:49:59 -08:00
Alex Behm	132513f98c	IMPALA-75: 'at least one equality predicate' error message needs improvement	2014-01-08 10:49:58 -08:00
Alex Behm	21685d4f8f	Fixed a failed Preconditions check if a join predicate has constants.	2014-01-08 10:49:52 -08:00
Marcel Kornacker	7bf87a4b54	fix for IMPALA-90/IMPALA-221	2014-01-08 10:49:50 -08:00
Alex Behm	5db3f2cdf5	IMPALA-227: SELECT * on partitioned table returns columns in different order than Hive.	2014-01-08 10:49:48 -08:00
Alex Behm	805fa50d6f	IMPALA-67: Constant SELECT clauses do not work in subqueries.	2014-01-08 10:49:48 -08:00
Alex Behm	2277386d4d	IMPALA-225: Compound predicate ranges on partition keys crash impalad.	2014-01-08 10:49:45 -08:00
Marcel Kornacker	398e725a23	make broadcast joins the default join strategy	2014-01-08 10:49:34 -08:00
Marcel Kornacker	d7e22f44bb	Partitioned hash joins - added PlanNode.numNodes, PlanNode.avgRowSize and PlanNode.computeStats() - fixing up some cardinality estimates - Planner now tries to do a cost-based decision between broadcast join and join with full repartitioning (both inputs) - ExchangeNode now distinguishes between its input and output row descriptor: the output potentially contains more tuples - fixed problem related to cancellation and concurrent hash table builds. Not included: - partitioned joins that take advantage of existing partitions of the inputs; those will have to wait for a follow-on change	2014-01-08 10:49:29 -08:00
Alan Choi	4a503a4e35	IMP-808 construct runtime state in fe-support to eval now()	2014-01-08 10:49:20 -08:00
Nong Li	20fc700002	Fix precision issue in text table writer.	2014-01-08 10:49:19 -08:00

1 2 3

126 Commits