impala

mirror of https://github.com/apache/impala.git synced 2026-01-04 18:00:57 -05:00

Author	SHA1	Message	Date
Victor Bittorf	808f9a661a	IMPALA-939: Regex should match anywhere in string. Change-Id: I8dcd337c3b06b632017270670a4f199ec7ada648 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2296 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit c97f82eaaf0efe9bd4c3da3d005464f425696a62) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2371	2014-04-25 16:16:15 -07:00
Alex Behm	91e1eb0789	CDH-18563: Speed up the computation of transitive value transfers. The issue: Computing the full transitive closure for all slots can be very expensive (10s of seconds for >2k slots, minutes for >4k slots). Queries with many views and/or unions were affected most because each union/view adds a new tuple with slots, increasing the total number of slots. The fix: The new algorithm exploits the sparse structure of the value transfer graph for a significant speedup (>100x). The high-level steps are: 1. Identify complete subgraps based on bi-directional value transfers, and coalesce the slots of each complete subgraph into a single slot. 2. Map the remaining uni-directional value transfers into the new slot domain. 3. Identify the connected components of the uni-directional value transfers. This step partitions the value transfers into disjoint sets. 4. Compute the transitive closure of each partition from (3) in the new slot domain separately. Hopefully, the partitions are small enough to afford the O(N^3) complexity of the brute-force transitive closure computation. Change-Id: I35b57295d8f04b92f00ac48c04d1ef1be4daf41b Reviewed-on: http://gerrit.ent.cloudera.com:8080/2360 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-24 23:53:28 -07:00
Alex Behm	121fab8fdf	IMPALA-888: Drop union operands with constant conjuncts evaluating to false. This patch simplifies the complex slot materialization logic for unions by making the materialization independent of conjuncts assigned to MergeNodes. When 'pushing down' predicates into union operands, we drop union operands with constant predicates evaluating to false. Constant predicates that evaluate to true are simply ignored. Change-Id: I0e7ccfb206bed29db2b5d667e2bb61310980e80a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2327 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-23 18:25:14 -07:00
Alex Behm	c8e928119d	IMPALA-912: Enforce slot equivalences at the lowest possible plan node. The reported issue is that we can have redundant hash expressions in exchanges. The underlying cause is that we fail to remove redundant join predicates. This patch enforces slot equivalences based on our computed equivalence classes at the lowest possible plan node by generating new equality predicates. Each plan subtree now has a minimal set of equality predicates that express all known equivalences between slots belonging to tuples materialized at that plan node. As a result, eliminating redundant join predicates becomes trivial: It is sufficient to pick a single representative predicate of each relevant equivalence class. All predicates beyond that are redundant. Change-Id: I7998fe8d7bdf84cc8eb129d32c86269bedeab68e Reviewed-on: http://gerrit.ent.cloudera.com:8080/2177 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2278	2014-04-18 13:28:49 -07:00
Alex Behm	0585dfb546	IMPALA-888: Materialize union slots referenced by constant predicates. To keep the predicate assignment/propagation logic simple, we assign conjuncts whose underlying base table exprs are constant in at least one union operand to the evaluating MergeNode, and not in the operand(s) whose corresponding base table exprs are constant. The JIRA describes two different bugs: The first bug was that the slots required for evaluating such predicates in the MergeNode were not marked as materialized. The second bug was that predicates 'pushed' into union operands did not get re-analyzed after substituting the predicate's exprs with the result exprs of that union operand. Missing casts lead to a crash. The new test covers both bugs. Change-Id: I0f5b8a366b32f7d4b2587e13793b6103cdf7e8b3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2162 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-07 18:32:29 -07:00
Alex Behm	8b319f8959	IMPALA-935: Make PlanFragment.getDestFragment() return null if no destination is set. Change-Id: I269a7f552d7ff67ff4d65e86e8c6df9c41d0fca1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2159 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-07 16:21:24 -07:00
Alex Behm	f4cfc75544	IMPALA-860: Place outer and semi joins at a fixed position in the plan. The bug: It is generally incorrect to re-order joins across outer/semi joins. For example, an inner join following an outer join may reduce the cardinality, so placing the inner join before the outer join during join re-ordering would be incorrect because the outer join is cardinality preserving (on one or both sides). The same argument holds for semi joins. The fix: Place outer and semi joins at a fixed position in the plan based on where they appeared in the original query. Inner joins to the left/right of outer/semi joins are still re-ordered properly. Change-Id: Idae837097b9376473d7f8124eef69b51f612b210 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1909 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1922	2014-03-15 05:18:11 -07:00
Srinath Shankar	74a975c45b	IMPALA-862: count(x) may return null when a similar count(distinct x) is also used count(x) with no distinct and no group-by expressions returns NULL on empty input if other distinct aggs (e.g. COUNT(distinct x) are present. This happens because the COUNT is transformed to SUM(COUNT()), with the inner COUNT being evaluated WITH a group-by expression (e.g. x). SUM over empty input returns NULL, but COUNT should return 0. This patch fixes this by replacing COUNT with zeroifnull(COUNT) before AggregateInfo is generated if there are distinct aggs and no group-bys. The logic in AggregateInfo itself has not been modified. Change-Id: I902e3fdd95767135b2f3fe423e8802ef57366af1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1921 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins	2014-03-14 23:35:55 -07:00
Alex Behm	15e05082c0	IMPALA-831: Distributed aggregation and top-n over unions. Change-Id: I056e8271421008378db93e8b2393861cc9dd4b90 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1886	2014-03-13 15:42:31 -07:00
Alex Behm	d640273a3f	IMPALA-861: Proper slot materialization on distinct aggs inside an inline view. The bug: Slot materialization on distinct aggs inside an inline view did not work if the only reference to the 2nd-phase agg-tuple slots was in a predicate from an outer query block (e.g., Where-clause of the block with the inline view ref). The reason was that bound predicates were fetched from the wrong tuple (from the 1st phase agg). The fix: Assign predicates to the top-most agg in the single-node plan that can evaluate them, as follows: For non-distinct aggs place them in the 1st phase agg node. For distinct aggs place them in the 2nd phase agg node. Change-Id: I0f6ab53cf7bb0c6aed9524ad2e24a849d2dc0ec4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1843 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1881	2014-03-13 04:39:48 -07:00
Alex Behm	58950a52a3	IMPALA-798: Distributed execution of CTAS and explain CTAS. Change-Id: I32004a4b31c54cf5c185169fece143a61213d12d Reviewed-on: http://gerrit.ent.cloudera.com:8080/1850 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1867	2014-03-12 16:51:50 -07:00
Alex Behm	47c52ade84	IMPALA-866: Make HdfsScanNode.computeStats() idempotent with respect to totalBytes_. Change-Id: I1c243b089db82c0544586a2a1428081aa2dbcd52 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1844 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1852	2014-03-11 18:20:15 -07:00
Alex Behm	fe4f3babe5	IMPALA-824: Generalize getBoundPredicates() to handle multi-slot predicates. The bug: Multi-slot predicates bound to a single outer-joined tuple were not marked as materialized. In addition, such predicates were not picked up by nodes under the join via getBoundPredicates() even if it would be correct to do so. The fix: Always mark slots of predicates that must be evaluated by a join in SelectStmt.materializeRequiredSlots(), regardless of whether the predicate can also be safely evaluated below the join. This patch also generalizes getBoundPredicates() to handle multi-slot predicates and fixes some issues with redundant predicate assignment. Still, the new approach has several limitations which are documented in the predicate propatation planner test to ease future improvements. Change-Id: If5da0354a83c00a9766fc63b7780ed4d5a9c46e5 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1717 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1819	2014-03-08 04:22:39 -08:00
Alex Behm	a615ebc549	IMPALA-822,IMP-1271: Binding predicates on an aggregation now properly trigger slot materialization. The bug was that the number of materialized agg-tuple slots did not correspond to the number of materialized agg functions, due to binding predicates against an AggNode causing slot materialization after SelectStmt.materializeRequiredSlots(). This patch fixes the issue by taking binding predicates (bound to a slot in an agg tuple) into consideration in SelectStmt.materializeRequiredSlots(). I added a new sanity check in AggregationNode.toThrift() surfaced another issue with slot materialization that is also fixed in this patch. The ordering exprs must be marked before the agg exprs in SelectStmt.materializeRequiredSlots() because the odering exprs may contain agg exprs that are only referenced inside the ORDER BY clause. Change-Id: I1bdc0466f583907bed625ce6608938e59faee83f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1639 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1818 Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-03-08 00:25:26 -08:00
Alex Behm	f7c2781afe	IMPALA-845: Transfer predicates to 2nd phase merge agg in some cases. Having predicates need to be transferred to the 2nd phase merge agg for distinct + non-distinct aggregates without group by. For distinct + non-distinct aggregates with group by, it is correct to evaluate the predicates at the 2nd phase (non-merge) agg. Change-Id: I71d73c4ef92becbb81e142bc0cb5f54e790b1fb5 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1743 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1817	2014-03-07 21:45:16 -08:00
Alex Behm	f71767c612	IMPALA-846: Add additional regression test. This issue has coincidentally been resolved by the fix for IMPALA-820. This patch adds an additional regression test for explicitly covering IMPALA-846. Change-Id: Ib60174676e5bb53de543a1db30adc05cef4d6593 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1719 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1730	2014-03-03 16:50:36 -08:00
Alex Behm	75e89263fb	IMPALA-820: Register Having-clause predicates with analyzer after expr substitution. Change-Id: I638a7324b10007f7f55564b82def3fd05f2c9fa4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1607 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1701	2014-02-28 10:40:55 -08:00
Alex Behm	cb8150e8ee	IMPALA-817: Check equality of function name in Function.equals(). Change-Id: Ib9b4ee3a21f90fdb0d7ebccd89462dc67040bd1e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1594 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1611 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2014-02-19 17:13:51 -08:00
Alex Behm	15f4f2c33a	IMP-1258: Fix incorrect assignment/propagation of predicates with outer joins. This patch includes several changes to predicate assignment and propagation. First, we now only register as outer joined those tuples of TableRefs directly participating in an outer join. In particular, materialized tuples referenced inside an outer-joined InlineView are not registered as outer joined - only the InlineView's tuple is registered. The other major change is that we detect when it is correct to propagate predicates to scan nodes participating (directly or indirectly) in an outer join by testing whether a predicate can become true if a tuple is NULL. If that is the case, then it is generally not safe to propagate a predicate because it would change the final result of the outer join. Change-Id: Ia135ab15ec8c6ef756a908f797f96812d28c84c1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1567 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1606	2014-02-19 15:27:44 -08:00
Alex Behm	1fc3b384ed	IMPALA-805: Disable elimination of redundant join predicates. Our code for eliminating redundant join predicates based on equivalence classes is not quite right. I've commented out the relevant code to ensure we don't incorrectly remove predicates. Left a TODO to fix and re-enable this feature. Change-Id: Ie76b365903dff6df271a378cbb4fd327ffa0631f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1569 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1572	2014-02-19 13:23:11 -08:00
Nong Li	0d2919fe7f	Refactor scalar and aggregate function analysis and execution. This patch cleans up analysis and execution of scalar and aggregate functions so that there is no difference between how builtins and user functions are handled. The only difference is that the catalog is populated with the builtins all the time. The BE always gets a TFunction object and just executes it (builtins will have an empty hdfs file location). This removes the opcode registry and all of the functionality is subsumed by the catalog, most of which was already duplicated there anyway. This also introduces the concept of a system database; databases that the user cannot modify and is populated automatically on startup. Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577	2014-02-18 18:40:08 -08:00
Lenni Kuff	95404d4888	Support prioritized background table loading The overall goal of this change allow for table metadata to be loaded in the background but also to allow prioritization of loading on an as-needed basis. As part of analysis, any tables that are not loaded are tracked and if analysis fails the Impalad will make an RPC to the CatalogServer to requiest the metadata loading of these tables be prioritized and analysis will be restarted. To support this, the CatalogServer now has a deque of the tables to load. For background loading, tables to load are added to the tail of the deque. However, a new CatalogServer RPC was added that can prioritize the loading of one or more tables in which case they will get added to the head of the deque. The next table to load is always taken from the head. This helps prioritize loading but is admittedly not the most fair approach. The support the prioritized loading, some changes had to made on the Impalad side during analysis: - During analysis, any tables that are missing metadata are tracked. - Analysis now runs in a loop. If it fails due to an AnalysisException AND at least 1 table/view was missing metadata, these tables missing metadata are requested to be loaded by calling the CatalogServer. - The impalad will wait until the required tables are received (by getting notified each time there is a call to updateCatalog()), and waiting to run analysis until all tables are available. Once the tables are available, analysis will restart. This change also introduces two new flags: --load_catalog_in_background (bool). When this is true (the default) the catalog server will run a period background thread to queue all unloaded tables for loading. This is generally the desired behavior, but there may be some cases (very large metastores) where this may need to be disabled. --num_metadata_loading_threads (int32). The number of threads to use when loading catalog metadata (degree of parallelism). The default is 16, but it can be increased to improve performance at the cost of stressing the Hive metastore/HDFS. Change-Id: Ib94dbbf66ffcffea8c490f50f5c04d19fb2078ad Reviewed-on: http://gerrit.ent.cloudera.com:8080/1476 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1538	2014-02-13 23:43:06 -08:00
Lenni Kuff	b4f5c1edcf	Enable lazy loading of table metadata for the CatalogService/Impalad This change adds support for lazy loading of table metadata to the CatalogService/Impalad. The way this works is that the CatalogService initially sends out an update with only the databases and table names (wrapped as IncompleteTables). When an Impalad encounters one of these tables, it will contact the catalog service to get the metadata, possibly triggering a metadata load if the catalog server has not yet loaded this table. With these changes the catalog server starts up in just seconds, even for large metastores since it only needs to call into the metastore to get the list of tables and databases. The performance of "invalidate metadata" also improves for the same reason. I also picked up the catalog cleanup patch I had to make the APIs a bit more consistent and remove the need for using a LoadingCache for databases. This also fixes up the FE tests to run in a more realistic fashion. The FE tests now run against catalog object recieved from the catalog server. This actually turned up some bugs in our previous test configuration where we were not running with the correct column stats (we were always running with avgSerializedSize = slotSize). This changed some plans so the planner tests needed to be updated. Still TODO: This does not include the changes to perform background metadata loading. I will send that out as a separate patch on top of this. Change-Id: Ied16f8a7f3a3393e89d6bfea78f0ba708d0ddd0e Saving changes Change-Id: I48c34408826b7396004177f5fc61a9523e664acc Reviewed-on: http://gerrit.ent.cloudera.com:8080/1328 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/1338 Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-21 21:43:29 -08:00
Alex Behm	6799c93922	Simplified/enhanced explain plans with a total of four explain levels. There are now 4 explain levels summarized as follows: - Level 0: MINIMAL Non-fragmented parallel plan only showing plan nodes with minimal attributes - Level 1: STANDARD Non-fragmented parallel plan with some details in plan nodes - Level 2: EXTENDED Non-fragmented parallel plan with full details in plan nodes including the table/column stats, row size, #hosts, cardinality, and estimated per-host memory requirement - Level 3: VERBOSE Fragmented parallel plan with full details (like level 2) This patch also includes several bugfixes related to plan costing and/or testing of explain plans. Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-10 19:17:59 -08:00
Alan Choi	57b961168d	IMP-1188 Fix HBase row key predicates issues This patch fixes a few row key issues: 1. We used to assert that the row key filter must be a string literal. However, it can also be a constant function. We need to eval the expr and then use the result as the start/stop key. 2. Cast(row_key as int) simply failed. This should not be transformed into start/stop key. 3. We used to assert that lower bound < upper bound. This query: select * from tbl where row_key > 'b' and row_key < 'a' would simply ASSERT. We should simply not return any rows. 4. Handle NULL predicate HBase row key can't be null. If either upper/lower bound is null, we simply don't need to return any rows. Change-Id: Ia03590a862888b377bf1f48bcb838b99193fa241 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1180 Reviewed-by: Alan Choi <alan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:40 -08:00
Alex Behm	e4ad086dee	Added max/avg length for string columns in COMPUTE STATS. Change-Id: I6f61de2323ee12681642684ec633ed4bb7506de2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1079 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:30 -08:00
Matthew Jacobs	93368e20b1	Fix CROSS JOIN handling in join order optimization and add tests Cross joins should be handled like outer joins in the join order optimization in that the right table referenced by a cross join may not be reordered anywhere before tables referenced to the left of the cross join. If there are inner joins to the right of the cross join, those tables may be reordered before the cross join. E.g., if we have A JOIN B CROSS JOIN C JOIN D, then C must come after A and B, but D may be reordered to come before C. Also adds test cases for join order optimization and predicate propagation. Change-Id: I6b1022dd3e862efbff81e283b43284d846c8eca4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:29 -08:00
Alex Behm	c6397ca1e3	Revert "Revert to FROM-clause order if any table is lacking stats." This reverts commit 7e84cbe3bab9bf30a57ac58d9ef525ebc10a7b7a. Change-Id: I89d55ca2bcb8eb6eddc244d3e7b005074d04c26a Reviewed-on: http://gerrit.ent.cloudera.com:8080/1104 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:29 -08:00
Alex Behm	a94f1efe29	Cleanup related to [shuffle]/[noshuffle] insert hints. Change-Id: I360c8fdc9be4346d148c2daadc36e90072b42b29 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1071 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:28 -08:00
Alex Behm	df0b28d163	Revert to FROM-clause order if any table is lacking stats. Change-Id: I7d09c0f393e2bfeefa386845fc6bbba4ab6c8812 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1095 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:28 -08:00
Alex Behm	37ad54210c	Fixes in plan cost estimation to restore baseline plans of TPCDS Q89,Q79. Change-Id: I95ed2d132bb7efacb94a51a17a4db5970f03038a Reviewed-on: http://gerrit.ent.cloudera.com:8080/1094 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Alex Behm	2d9a2e07bd	IMPALA-680: Account for existing data partition on lhs and/or rhs of a partitioned join. Impala now exploits existing partitioning of the lhs and/or rhs of joins. Existing partitioning effects the cost of choosing between a partitioned and a broadcast join, as well as the final plan because Impala can avoid repartitioning the data. An existing data partition is exploitable iff the lhs/rhs input partition is equivalent to the target partition of the join. This matching procedure considers equivalence classes. Change-Id: Ica080f35cf5063bea828963bc234ba69797e2030 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1070 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Lenni Kuff	e63cc59a94	Add partitioned tpcds planner tests (SQL-92 style joins) Adds the TPCDS queries as planner tests and fixes a few small issues with the Planner test file parser. This adds the TPC-DS queries using SQL-92 style joins that have a hand optimized (although not perfect) join order. Change-Id: I2d81e66af740b2d826b8ebd0c5ba8553b5faf0a2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1019 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:26 -08:00
Alex Behm	4bd266c296	Bugfixes related to plan cost estimation. Change-Id: Ie1f67bd106b25fffcaa2540ebde8b7f1d4c05dd2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1084 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:26 -08:00
Matthew Jacobs	f327431a8e	IMPALA-171: Add CROSS JOIN Adds a CROSS JOIN (cartesian product). Common join code is moved from to a new abstract base class BlockingJoinNode. We must keep all build RowBatches in memory in order to iterate over them for every row from the left child. The TupleRowList provides a convenient way to iterate over all of the rows. A future change will address codegen for the CrossJoinNode. Change-Id: I5e0caa6fb4ec802a9c87e700f9dd6238cea8cdf2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/970 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:25 -08:00
Marcel Kornacker	fd182201bc	Predicate propagation plus join order optimization. Introduces STRAIGHT_JOIN keyword to prevent join order optimization. Structural changes to the planning framework: - slot materialization: the decision whether to materialize a slot now happens prior to plan generation. This is needed in order to be able to generate accurate cost estimates at plan generation time. see QueryStmt.materializeRequiredSlots() - added PlanNode.init(), which initializes the entire state of a PlanNode; this subsumes finalize() * computeMemLayout() now happens per-tuple in the corresponding ScanNode's init() * init() calls computeStats() by default; also marks slots as materialized and calls TupleDescriptor.computeMemLayout() - added PlanNode.tblRefIds_ - restructured UnionStmt and union plan generation to fit pred propagation model: all tuples are created (and equiv predicates registered) prior to plan generation - added Expr.isAuxExpr Change-Id: I475c1645bfca9e84ae6e5f529e7781d9532e5c9a Reviewed-on: http://gerrit.ent.cloudera.com:8080/955 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:24 -08:00
Lenni Kuff	0bae3978c9	Update compute-stats.py to execute using Impala Updates our compute stats script to execute using Impala. This allows us to easily compute stats on all tables in a database or all tables in the metastore. The updated stats caused one of the TPCH plans to change so this also updates the TPCH planner test results. Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:18 -08:00
Alex Behm	2325c8c923	Added [shuffle]/[noshuffle] plan hints for forcing/preventing repartitioning before an insert. Change-Id: I0647366815f4488cabbcb1fc7bc3cf851960c44e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1007 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:16 -08:00
Matthew Jacobs	8a55982105	Add OFFSET to skip rows returned with a LIMIT Adds support for skipping a number of rows with an ORDER BY clause and a LIMIT. Hive does not support OFFSET so creating a view with an OFFSET will not work in Hive. For example, "SELECT * FROM T1 ORDER BY ID LIMIT 20 OFFSET 5" will do the sorting, skip 5 rows, then return the next 20. OFFSET requires an ORDER BY clause. Note this is not very efficient as we must actually keep (limit+offset) rows in memory in the topn-node, and all child sort nodes must as well. Users should be careful when using this feature. Change-Id: I4d7021c278296e7bdbfa0e6f2699cd6f23eef59d Reviewed-on: http://gerrit.ent.cloudera.com:8080/900 Tested-by: jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:54:02 -08:00
Alex Behm	3f54240fed	PlannerTest uses explain level 'normal'. Only add stats and costs to explain output in 'verbose' mode. Change-Id: I827b4c7085b5aa2dc5521f8748d8973178f43f4c Reviewed-on: http://gerrit.ent.cloudera.com:8080/678 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:23 -08:00
Alex Behm	4bb8b38cde	Added stats and cost estimates to explain output. Change-Id: I1273745a439fd25cefa4e08ecc075c98cc8bfc45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/602 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:22 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00
Alex Behm	9065648d77	Improvements to cost estimation and explain output. Fixed cost estimation of union queries and exchange nodes. Fixed propagation of stats through cloning of exprs and plan nodes. Fixed propagation of expr stats to slots they are materialized into (e.g., grouping columns in multi-level aggs). Improved explain output for constant selects. Change-Id: I96d1652c00d48e4093b85ae7fc8bad28d74b8b81 Reviewed-on: http://gerrit.ent.cloudera.com:8080/547 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:08 -08:00
Alex Behm	39f9a067fa	IMPALA-444: Fixed accuracy of string to double conversion. Falling back to strod for scientific notation. Change-Id: I9a5d948620907d34601ef041e58b1c9bb2172f71 Reviewed-on: http://gerrit.ent.cloudera.com:8080/507 Tested-by: jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:56 -08:00
Nong Li	2b9105cd11	IMPALA-487: don't compact data from rhs of join if it is going through an exchange node. Change-Id: I442445e7370218352cd6d3137f2a454c9afb73ba Reviewed-on: http://gerrit.ent.cloudera.com:8080/476 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:52:50 -08:00
Alex Behm	77c0e54bb9	Set HDFS block size to 128MB because HDFS versions since 2.1.0-beta use 128MB as a default (HDFS-4053). Change-Id: If112d2eab242b44f05f64ee071ebea5b253c7927 Reviewed-on: http://gerrit.ent.cloudera.com:8080/470 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:48 -08:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Alex Behm	e52ed0800b	IMPALA-524: Fix computation of stats for ExchangeNode and merge AggregationNodes. The issue caused unnecessary repartitioning for static partition insert queries having grouped aggregation in the feeding query stmt. Change-Id: I5f4017e2c4d5a1bf88f51c4e0ff7ab28911e14f1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/202 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:11 -08:00
Alex Behm	48ee7ce891	IMPALA-508: Fix join-cardinality estimation and choice of join strategy when a join involves a table lacking table stats. Change-Id: I871273e1d9f048377ce638c201118fc21086db9a Reviewed-on: http://gerrit.ent.cloudera.com:8080/152 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:05 -08:00
Alex Behm	f0e2d539fc	IMPALA-495: Views Sometimes Not Utilizing Partition Pruning. Change-Id: I65daebbe8c4b72b956a409fe28edd3773fda7cb7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/128 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:04 -08:00

1 2 3

102 Commits