impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 12:01:11 -05:00

Author	SHA1	Message	Date
Jim Apple	d70ffa455d	IMPALA-3450: LIMITs on plan nodes are reflected in cardinality estimates PlanNode includes a 'capAtLimit()' method that can be used in 'computeStats()' on PlanNodes to ensure they do not estimate their cardinality to be more than a pushed-down LIMIT clause. This patch ensures that 'capAtLimit()' is used in all of the relevant classes descending from PlanNode. Change-Id: Ic06dcb93bbb2510c0d40151302bd817ef340b825 Reviewed-on: http://gerrit.cloudera.org:8080/3127 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-05-24 14:40:52 -07:00
Dimitris Tsirogiannis	fa30a0c818	IMPALA-3574: Handle runtime filters with TupleIsNull predicates This commit fixes an issue where an IllegalStateException is thrown while generating runtime filters if a target expr of a join conjunct is wrapped in a IF(TupleIsNull, NULL, e) expr. As this is not a valid expr to be assigned to a scan node (target of a runtime filter), we unwrap these exprs and replace exprs of the form IF(TupleIsNull, NULL, e) with 'e' while producing the targer exprs for runtime filters. The original expr of the join conjunct is not modified. Change-Id: I2e3e207b4c8522283a1cd0d14be83d42eba58f5a Reviewed-on: http://gerrit.cloudera.org:8080/3147 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:20 -07:00
Dimitris Tsirogiannis	f992dc7f88	IMPALA-2956: Filters should be able to target multiple scan nodes With this commit runtime filters can be assigned to multiple destination nodes (scans). For each filter, the destination nodes are determined using equivalent classes during planning. For each filter, all its destination nodes are in the left subtree rooted at the join node that constructs this filter. A runtime filter may have both local and remote targets. The backend determines how to route each filter depending on the number and type (local, remote) of its destination nodes. With this commit, we enable runtime filter propagation in all the operands of UNION [ALL\|DISTINCT] nodes. Change-Id: Iad2ce4e579a30616c469312a4e658140d317507b Reviewed-on: http://gerrit.cloudera.org:8080/2932 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-05-18 01:40:22 -07:00
Taras Bobrovytsky	46c3e43edb	IMPALA-3232: Allow not-exists uncorrelated subqueries Before this patch, correlated exists and not exists subqueries were rewritten as as left semi and anti joins respectively. Uncorrelated exists subqueries were rewritten as cross joins, and uncorrelated not-exists subqueries were not supported at all. This patch takes advantage of the nested loop join that was recently introduced, which allows us to rewrite both correlated and uncorrelated exists subqueries as left semi joins and both correlated and uncorrelated not-exists subqueries as anti joins. Change-Id: I52ae12f116d026190f3a2a7575cda855317d11e8 Reviewed-on: http://gerrit.cloudera.org:8080/2792 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 23:06:36 -07:00
Marcel Kornacker	3b7d5b7c17	MT: Planner for multi-threaded execution New classes: - ParallelPlanner: creates build plans, assigns plans to cohorts - JoinBuildSink: DataSink for plan fragments that materialize build sides - ids for plans, hash tables, plan fragments Tests: this adds a new test file section PARALLELPLANS and augments the tpc-h/-ds tests with those sections. In the interest of keeping this patch small I didn't augment other test files with that section yet (which will happen at a later date, to cover more corner cases). Change-Id: Ic3c34dd3f9190a131e6f03d901b4bfcd164a5174 Reviewed-on: http://gerrit.cloudera.org:8080/2846 Tested-by: Internal Jenkins Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2016-05-12 14:17:56 -07:00
Thomas Tauber-Marshall	8c2bf9769a	IMPALA-2805: Order conjuncts based on selectivity and cost Added costs to all Exprs, which estimate the relative cost of evaluating an expression and all of its children. Costs are calculated during analysis. For now, these costs are intended as a simple way to order expressions from cheap to expensive, not necessarily to be a precise reflection of running times. In general, expressions that deal with variable length types like strings will have higher cost than those dealing with fixed length types like numbers and booleans. Additionally, expressions with complicated subexpressions will have higher cost than simpler expressions. Also added PlanNode.orderConjunctsByCost, which takes a list of Exprs and returns a new list sorted according to an estimate of the cheapest order to evaulate the conjuncts in, based on their cost and selectivity. The conjuncts are sorted by repeatedly iterating over them and choosing the conjunct that would result in the least total estimated work were it to be applied before the remaining conjuncts. Selectivities are exponentially backed off, and Exprs without selectivity estimates are given a reasonable default. Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034 Reviewed-on: http://gerrit.cloudera.org:8080/2598 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:53 -07:00
Todd Lipcon	4bdd0b976d	IMPALA-3148. Fix selectivity computation for pushed Kudu predicates This follows up on a TODO from the Kudu merge and also fixes a bug: IMPALA-976 changed the computation of selectivities for a combined list of conjuncts to better handle expressions with no selectivity estimate. The Kudu implementation was forked from before this change and thus did not have an equivalent change. This refactors the algorithm to a new static method and calls it from both PlanNode and KuduScanNode so that the selectivity estimate behavior is the same regardless of whether Kudu can evaluate the predicate server-side. Todd tested this on TPCH 3TB and verified that the plans are reasonable now where they used to be nonsense. Change-Id: Id507077b577ed5804fc80517f33ea185f2bff41a Reviewed-on: http://gerrit.cloudera.org:8080/2628 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Bharath Vissapragada	5cd7ada727	IMPALA-3194: Allow queries materializing scalar type columns in RC/sequence files This commit unblocks queries materializing only scalar typed columns on tables backed by RC/sequence files containing complex typed columns. This worked prior to 2.3.0 release. Change-Id: I3a89b211bdc01f7e07497e293fafd75ccf0500fe Reviewed-on: http://gerrit.cloudera.org:8080/2580 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-31 12:06:57 +00:00
Tim Armstrong	f5b7842414	IMPALA-2502: don't redundantly repartition grouping aggregations Grouping aggregations previously always repartitioned their input, even if preceding joins or aggs had already partitioned the data on the required key (or an equivalent key). This patch checks to see if data is already partitioned on the required exprs (or equivalent ones), and if so skips the preaggregation and only does a merge aggregation. The patch also does some refactoring of the aggregation planning in DistributedPlanner to make it easier to implement the change. Includes planner tests for the three cases that are affected: grouping aggregations, non-grouping distinct aggregations and grouping distinct aggregations. Change-Id: Iffdcfd3629b8a69bd23915e1adba3b8323cbbaef Reviewed-on: http://gerrit.cloudera.org:8080/2414 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-03-15 09:21:22 +00:00
David Alves	7381304a23	Merge branch 'feature/kudu' into cdh5-trunk This is the final merge commit that merges the 'feature/kudu' branch into cdh5-trunk. Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a	2016-03-13 19:28:43 -07:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Alex Behm	4a25f87d5c	Improve the SQL for nested TPCH-Q18. Marcel spotted that nested TPCH-Q18 can be expressed with more efficient SQL. Results on nested TPCH-300: Before 160s After 100s Change-Id: I8b351b7f467e8bef0c256dc43cea325d7f177edf Reviewed-on: http://gerrit.cloudera.org:8080/2418 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-04 04:35:54 +00:00
Alex Behm	54a46e9459	IMPALA-3065/IMPALA-3062: Restrict !empty() predicates to scan nodes. The bug: Evaluating !empty() predicates at non-scan nodes interacts poorly with our BE projection of collection slots. For example, rows could incorrectly be filtered if a !empty() predicate is assigned to a plan node that comes after the unnest of the collection that also performs the projection. The fix: This patch reworks the generation of !empty() predicates introduced in IMPALA-2663 for correctness purposes. The predicates are generated in cases where we can ensure that they will be assigned only by the parent scan, and no other plan node. The conditions are as follows: - collection table ref is relative and non-correlated - collection table ref represents the rhs of an inner/cross/semi join - collection table ref's parent tuple is not outer joined Change-Id: Ie975ce139a103285c4e9f93c59ce1f1d2aa71767 Reviewed-on: http://gerrit.cloudera.org:8080/2399 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Silvius Rus <srus@cloudera.com> Tested-by: Internal Jenkins	2016-03-02 23:23:05 -08:00
Alex Behm	a303f25256	IMPALA-3071: Fix assignment of On-clause predicates belonging to an inner join. The bug: On-clause predicates belonging to an inner join were not always assigned correctly if they referenced an outer-joined tuple. Specifically, our logic for detecting whether a predicate can be assigned below an outer join if also left at the outer-join node was not correct, and so we assigned the predicate below the join, but did not also leave it at the outer join. The fix: Assign an inner join On-clause conjunct that references an outer-joined tuple to the join that the On-clause belongs to. Change-Id: Iffef7718679d48f866fa90fd3257f182cbb385ae Reviewed-on: http://gerrit.cloudera.org:8080/2309 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-29 22:22:41 -08:00
Dimitris Tsirogiannis	2c37d99fed	IMPALA-3089: Perform static partition pruning in the FE with disjunctive BETWEEN predicates This commit fixes an issue where the slow path is employed during static partition pruning for disjunctive BETWEEN predicates, inroducing significant latency during planning, especially for tables with large number of partitions. Change-Id: I66ef566fa176a859d126d49152921a176a491b0a Reviewed-on: http://gerrit.cloudera.org:8080/2320 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-26 15:37:24 -08:00
Dimitris Tsirogiannis	197eb43477	IMPALA-3074: AnalysisError when runtime filter has incompatible source and target exprs This commit fixes an issue where an AnalysisError is thrown when a runtime filter has incompatible source and target exprs. This is triggered when a runtime filter has multiple candidate target scan nodes not all of which produce a target expr which is cast-compatible with the source expr. Change-Id: I544c8fc66915f684ba24d20de525563638c4039d Reviewed-on: http://gerrit.cloudera.org:8080/2307 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 19:54:40 -08:00
Dimitris Tsirogiannis	d3b92b0d9f	IMPALA-3039: Restrict the number of runtime filters generated This commit adds a query option, MAX_NUM_RUNTIME_FILTERS, to restrict the number of runtime filters generated per query. If more than MAX_NUM_RUNTIME_FILTERS are generated, the runtime filters are sorted by the selectivity of the associate source join nodes and the MAX_NUM_RUNTIME_FILTERS most selective filters are applied. Also with this commit, non-selective filters are automatically discarded, irrespective of the value of MAX_NUM_RUNTIME_FILTERS. Change-Id: Ifd41ef6919a6d2b283a8801861a7179c96ed87c6 Reviewed-on: http://gerrit.cloudera.org:8080/2262 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 19:54:40 -08:00
Alex Behm	2c8f41b7d4	IMPALA-2832: Fix cloning of FunctionCallExpr. The bug was that we were not properly cloning the params of a FunctionCallExpr. In a CTAS we analyze the underlying query stmt twice, the first time on a clone of the original stmt. The problem was that the first analysis affected the second analysis due to an improper clone, leading to missing slots in a scan because the corresponding SlotRefs were already analyzed. Change-Id: I0025c0ee54b2f2cb3ba470b26a9de5aa5a3a3ade Reviewed-on: http://gerrit.cloudera.org:8080/2291 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Alex Behm	a99e17457b	Fix a non-determinisic test in complex-types-file-formats.test. Change-Id: I98cc3045a6a6131dba8b0a475d5d51de7bdba455 Reviewed-on: http://gerrit.cloudera.org:8080/2268 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
Alex Behm	c6fd5a0fe4	IMPALA-2844: Allow count(*) on RC files with complex types. This patch also fixes the incorrect error message reported in the JIRA. Change-Id: I2c7b732767d154c36bc7189df5177d27a35d0d7b Reviewed-on: http://gerrit.cloudera.org:8080/2267 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
Henry Robinson	2212240106	IMPALA-2552: Runtime filter forwarding between joins and scans This patch adds the ability for operators to compute and forward bitmap filters from one operator to another, across fragment and machine boundaries. Filters are provided as part of the plan from the frontend. In this patch hash join nodes produce filters from their build side input, and propagate them, via the query's coordinator, to the scan nodes which provide the probe side for that join. The scan nodes may then filter their rows before they are sent to the join, reducing the amount of work the join has to do. Filters are attached to the local RuntimeState's RuntimeFilterBank by the join node. When complete, they are asynchronously sent to the coordinator via a new UpdateFilter() RPC. The coordinator maintains a routing table that maps incoming filters to their recipient backends. For partitioned joins, the filters must be aggregated from all providers. The coordinator performs this aggregation and transmits the completed filter only when all inputs have been received. In this patch, filtering can occur in up to four places in a scan: 1. Before initial scan ranges are issued (all file formats, partition columns only) 2. Before each scan range is processed (all file formats, partition columns only) 3. Before a row group is processed (Parquet, partition columns only) 4. During assembly of every row (Parquet, any column) This patch also replaces the existing bitmap-based filters with Bloom Filter based ones. The Bloom Filters are statically sized to have an expected false positive rate of 0.1 on 2^20 distinct items. This yields Bloom Filters of 1MB in size. This is configurable by setting --bloom_filter_size, and we will perform tests to determine a good default. The query option RUNTIME_BLOOM_FILTER_SIZE can override the command-line flag on a per-query basis. This patch also simplifies and improves the memory handling for allocated filters by the RuntimeFilterBank. New filters are tracked through the query memory tracker, and owned by the fragment instance's RuntimeState::obj_pool(). This patch also adds a simple heuristic to disable filter creation based on estimated false-positive rate for the Bloom Filter. By default the maximum FP rate is set to 75%. It can be controlled by setting --max_filter_error_rate. Finally, this patch adds short-circuit publication for filters that are broadcast, and does so always even when distributed runtime filter propagation is disabled. To avoid cross-compilation problems, bloom-filter.h was rewritten in C++98. Change-Id: Icea03a87cf1705c1b4aa46f86f13141c4b58da10 Reviewed-on: http://gerrit.cloudera.org:8080/1861 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2016-02-13 16:19:41 +00:00
Alex Behm	d7ee6fa7a4	IMPALA-2663: Filter out tuples with empty collections in scan. We now generate predicates for filtering out empty collections directly in the parent scan that materializes the collections. This optimization is conservatively applied only for uncorrelated relative table references because that makes it safe/easy to determine the join type (the optimization is incorrect for outer and anti joins). The change provides a substantial improvement for queries that have selective predicates on nested collections, or for data sets that naturally have many empty collections. The performance improvement comes from: (1) The new predicates are assigned to a scan, so we get multi-threading. (2) We avoid expensive subplan iterations for collections that would yield an empty subplan result anyway. Performance measurements on 10-node using nested TPCH-300 on some of the queries originally mentioned in the JIRA: TPCH-Q12, 10x speedup Before: 111s After: 11s TPCH-Q7 Before: 205s After: 128s TPCH-Q5 Before: 48s After: 40s TPCH-Q3 Before: 18s After: 14s The following microbenchmark query designed to highlight the improvement also gets a ~10x speedup. select c_custkey, o_orderkey from customer c, c.c_orders where o_orderkey = 1884930 Before: 11.3s After: 1.8s Change-Id: I0d0dc90442a61d62cc8f7dad186560490b62441a Reviewed-on: http://gerrit.cloudera.org:8080/2118 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-12 00:23:43 +00:00
Dimitris Tsirogiannis	c943d6ab7d	IMPALA-2552: Add support for runtime filter propagation (FE) This commit adds support for runtime filter propagation in the frontend. During planning, the frontend computes a set of filters that are constructed by join operators and are applied at scan operators in order to filter scanned tuples or scan ranges. The filters are identified from equi-join predicates by traversing the single-node plan tree in a top-down fashion. A query option, termed enable_runtime_filter_propagation, is added to enable/disable runtime filter propagation (disabled by default). When runtime filter propagation is enabled, the output of EXPLAIN is modified to include information about the runtime filters that are constructed/applied. Also, an event is added to the query timeline to track the time spent in the planner while computing runtime filters. Testing: Functional planner tests are added. Change-Id: Id79a38313051d95da32c897b176a40d26b0dda1d Reviewed-on: http://gerrit.cloudera.org:8080/1532 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2016-02-12 00:11:45 +00:00
Tim Armstrong	2c2670e389	IMPALA-1305: streaming pre-aggregations Aggregations are implemented as a distributed pre-aggregation, an exchange, then a final aggregation that produces the results of the aggregation. In many cases the pre-aggregation significantly reduces the amount of data to be exchanged. However, in other cases, the preaggregation does not greatly reduce the amount of data exchanged or can use a lot of memory and starve other operators that would benefit more from the additional memory. In these cases we would be better off "passing through" some input tuples by transforming them into intermediate tuples without aggregating them. This patch adds a streaming pre-aggregation mode to PartitionedAggregationNode that tries to aggregate input rows with a hash table, but can switch to passing through the input tuples (after transforming them into the appropriate tuple format). It does this if it hits a memory limit or if the aggregation is not sufficiently reducing the node's output (specifically, if the number of aggregated rows in the hash table is more than half the number of unaggregated rows consumed by the pre-aggregation). Pre-aggregations never need to spill because they can pass through rows when under memory pressure. This initial implementation is quite conservative: it retains the partitioning of the previous implementation because switching to a single partition proved to regress performance of some queries while improving others. It also always keeps hash tables around and updates them with matching input rows so that reduction statistics are updated and early decisions to pass through data can be reversed. Future work could explore different approaches within the new framework to get larger performance gains. Currently we see significant performance benefits for queries with a very low reduction factor, e.g. group by on a nearly unique column Includes codegen support for the passthrough streaming. Adds a query option, disable_streaming_preaggregations, in case a user wants to revert to the old behaviour. Adds TPC-H tests to exercise the new passthrough code path and updates planner tests to include the new [STREAMING] detail added by the planner. Change-Id: Ia40525340cba89a8c4e70164ae11447e96494664 Reviewed-on: http://gerrit.cloudera.org:8080/1698 Tested-by: Internal Jenkins Reviewed-by: Dan Hecht <dhecht@cloudera.com>	2016-02-11 19:03:51 +00:00
Anuj Phadke	d787e1e3a7	IMPALA-2425: Broadcast join hint not enforced when low memory limit is set. Broadcast joins are disabled if the size of the rhs hash table exceeds the per node mem_limit. This change forces a broadcast join if the broadcast join hint is enforced. Change-Id: Iff9bd4d01736c48e52306ac79f74ab6ef0938f2a Reviewed-on: http://gerrit.cloudera.org:8080/1967 Reviewed-by: Huaisi Xu <hxu@cloudera.com> Tested-by: Internal Jenkins	2016-02-10 11:30:19 +00:00
Alex Behm	ecf46a5af8	IMPALA-976: Improvements to scan and join cardinality estimates. 1. Improved join cardinality estimation. For each equi join predicate we try to determine whether it is a foreign/primary key (FK/PK) join condition, and either use a special FK/PK estimation or a generic estimation method. We maintain the minimum cardinality for each method separately, and finally return in order of preference: - the FK/PK estimate, if there was at least one FP/PK predicate - the generic estimate, if there was at least one predicate with sufficient stats - otherwise, we optimistically assume a FK/PK join with a join selectivity of 1, and return the left-hand size cardinality 2. More robust handling of conjuncts with unknown selectivities, and conjuncts that are not independent. Uses exponential backoff. 3. More accurate broadcast vs. partitioned join cost estimation. We now account for the 4 byte per-tuple overhead when serializing rows over an exchange. This change is especially helpful in cases where one side of the join has no materialized slots, i.e., it has a row size of 0, and an exchange used to appear free. We are obviously not done with improving join cardinality estimates. This patch is merely a step in the right direction, in particular, the code and behavior are now more explicit and easier to reason about than before, and better reflects the original intent (i.e., fixes the IMPALA-976 bug). Change-Id: I00d8e8230e2844cb807d128d82b35ee78db7d774 Reviewed-on: http://gerrit.cloudera.org:8080/1668 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-06 09:26:46 +00:00
Michael Ho	3d7a4477ee	IMPALA-2948: Fix a bug in the planner when fast partition key scan is enabled When the query option OPTIMIZE_PARTITION_KEY_SCANS is true, we may acquire the partition key values from the metadata and generate a union node containing constant expressions only. There is a bug in the planner when generating the union node as it skips evaluating the constant expressions for unmaterialized slots but union node expects an entry in the constant expression lists for each slot in the tuple descriptor even if the slot is not materialized. This change fixes the problem by inserting a dummy null values in the constant expression list for unmaterialized slots and lets the union node filter them out. A test is also added to verify the fix. Change-Id: I9ed49dca0101b96bd9b20e6d1e5b1d56f654e911 Reviewed-on: http://gerrit.cloudera.org:8080/2067 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-06 05:28:28 +00:00
Alex Behm	733d135212	IMPALA-852,IMPALA-2215: Analyze HAVING clause before aggregation. In SelectStmt.analyzeAggregation(), we need to analyze the HAVING clause first so we can check if it contains aggregates. Also, we need analyze/register it even if we are not computing aggregates. Change-Id: Ieedfb64bf9a8f1390c0231a8b4aa25120ee5542b Reviewed-on: http://gerrit.cloudera.org:8080/2066 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-06 01:31:34 +00:00
Dimitris Tsirogiannis	ccf1f8f73f	IMPALA-2734: Correlated EXISTS subqueries with HAVING clause return wrong results This commit fixes an issue where wrong results are returned if an EXISTS subquery contains a HAVING clause and non-equality correlated binary predicates. This case does not have a valid rewrite as the HAVING clause needs to be applied after the correlated predicates have been evaluated. With this fix, we detect cases like this and throw an AnalysisException. Change-Id: I159f956e2b01f408601829b5d2afcf11d76bedcd Reviewed-on: http://gerrit.cloudera.org:8080/1927 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-04 01:06:03 +00:00
Michael Ho	0853ea1a7d	IMPALA-2499: Evaluate a SELECT block using partition metadata. This patch implements an optimization for evaluating table refs with metadata instead of table scans for queries which satisfy the following conditions: (1) All scan slots being materialized are partition columns. (2) The SELECT block only contains aggregate expressions bound by the these slots. The aggregate expressions should have distinct semantics such as SELECT MIN(X) from T; or SELECT COUNT(DISTINCT x) FROM T; If there are no aggregate expressions in the SELECT block, the query block must contain grouping expressions. If the above conditions are satisfied, the scan nodes in the plan of the SELECT block will be replaced with union nodes which materialize the partition key values. The query "select min(year), max(month) from functional.alltypes;" went from 440ms on average to 10ms (44x speed up) with this change. The speed-up depends on the number of rows per partition. The following are plans before and after this change for the following query: select month, min(year) from functional.alltypes group by month Before: 01:AGGREGATE [FINALIZE] \| output: min(year) \| group by: month \| 00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB After: 01:AGGREGATE [FINALIZE] \| output: min(year) \| group by: month \| 00:UNION constant-operands=24 This optimization is enabled by the query option 'optimize_partition_key_scans'. Note that there are some caveats with this optimization. In particular, the returned values may be inconsistent with that of the conventional plans in the following two cases when this optimization is enabled. 1. If a user deletes a file without doing a refresh, the metadata becomes stale. The conventional plan will return an error in the scan (due to missing files). With this optimization, the partition key values of deleted partitions may be returned. 2. With the conventional plan, an empty partition will not be included in the evaluation of the return values. A partition is empty if it either contains (1) no file or (2) the file contains no row. This optimization may return different values in the second case above when there are no rows in the file. Due to the potential inconsistencies above, users need to opt-in for this optimization. Change-Id: I30d4c7dab7610a30773fc60044499c468684dc9a Reviewed-on: http://gerrit.cloudera.org:8080/1638 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-01-26 22:06:26 +00:00
Alex Behm	95951a36e8	IMPALA-2539: Unmark collections slots of empty union operands. Change-Id: I401f9b9a5e5457120600a7cb5b54f84adb8477f7 Reviewed-on: http://gerrit.cloudera.org:8080/1895 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-01-26 11:32:40 +00:00
Juan Yu	708ab3c669	IMPALA-2565: Planner tests are flaky due to file size mismatches. Fix flaky test by ignoring file size in explain plan comparison. Change-Id: I38871e5e16a6b60860aed4ea89c108fecdfd60d0 Reviewed-on: http://gerrit.cloudera.org:8080/1767 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-01-23 09:38:40 +00:00
Alex Behm	30b861708f	IMPALA-2790: Exclude non-materialized aggregate exprs from explain plan. Change-Id: I6ce1ebf0c6e0c0a3a5148e2f0a00c95cc680010c Reviewed-on: http://gerrit.cloudera.org:8080/1679 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 04:27:01 +00:00
Huaisi Xu	b46b0f48ad	IMPALA-2643: Prevent migrating incorrectly inferred identity predicates into inline views. Inferred predicates implicitly referring to the same expr can be migrated into an inline view where the post-substitution predicate is an identity predicate. If such a predicate is assigned to an plan node later, it incorrectly drops NULLs. Change-Id: Id40524b30799d1ac994c0a44efcc1acce4ad1daf Reviewed-on: http://gerrit.cloudera.org:8080/1720 Reviewed-by: Huaisi Xu <hxu@cloudera.com> Tested-by: Internal Jenkins	2016-01-19 02:51:44 +00:00
Jim Apple	1a3d7ffd4f	IMPALA-2147: Support IS [NOT] DISTINCT FROM and "<=>" predicates Enforces that the planner treats IS NOT DISTINCT FROM as eligible for hash joins, but does not find the minimum spanning tree of equivalences for use in optimizing query plans; this is left as future work. Change-Id: I62c5300b1fbd764796116f95efe36573eed4c8d0 Reviewed-on: http://gerrit.cloudera.org:8080/710 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-01-14 05:45:22 +00:00
Alex Behm	f09050bedd	IMPALA-2801: Modify HBase planner test to use a table that has stats computed. The problem was that a recently added planner test was assuming stats were computed for functional_hbase.alltypes, but we purposely do not compute stats for that table. The fix is to use the alltypessmall table instead. The modified test still covers the same issue. Change-Id: I043c485489c7868b4f320048eb383943627f620b Reviewed-on: http://gerrit.cloudera.org:8080/1705 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-01-06 02:58:49 +00:00
Bharath Vissapragada	775d464fdc	IMPALA-2765: Preserve return type of subexpressions substituted in isTrueWithNullSlots() This commit fixes the issue where queries with outerjoins and case expressions in predicates can fail with an AnalysisException. This is due to the method isTrueWithNullSlots() not preserving the return types of sub-expressions during substitution with null literals and the resulting predicate can fail to analyze even though the original predicate succeeds. The fix is to preserve the type of each slot in the predicate that we subsitute with the null literal. Change-Id: I8cd827b460620355db6fd518464418e701a724f1 Reviewed-on: http://gerrit.cloudera.org:8080/1656 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2015-12-19 09:52:51 +00:00
Alex Behm	aec7ad9050	IMPALA-2144: Add regression test. Bug was fixed in 892abca. Change-Id: I163566ff6141b5d810b00082167a379ce86049bd Reviewed-on: http://gerrit.cloudera.org:8080/1515 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-12-17 23:44:35 +00:00
Michael Giambalvo	cf9d2485dd	Update Impala to HBase 1.0 APIs. Remove all use of deprecated HBase APIs and bring everything up to 1.0. This work is done to enable use of the Google Cloud Bigtable HBase client, which does not support the deprecated APIs. However, nothing in this change depends on Cloud Bigtable, and should work fine for HBase 1.x and greater. This involves two major changes. 1) HBase is trying to move away from unmanaged connections. Thus, catalogd and the backend are updated to create a Connection object which is used to create the Table client objects. Since the Connection object owns a threadpool, there is no need to create a separate ExecutorService and pass it to the HTables on creation. 2) Instead of reading the size on disk of the different Region servers, we use a single call to the HBase master to get the ClusterStatus, which contains the storefile sizes for all the region servers the master knows about. This enables the HBase table implementation to work with Cloud Bigtable, which of course does not keep its data in HDFS. To use the Cloud Bigtable driver, simply update hbase-site.xml to set the "hbase.client.connection.impl" property to the appriopriate Cloud Bigtable client implementation class . Further details can be found here: https://cloud.google.com/bigtable/docs/connecting-hbase Tested by running: py.test tests/query_test/test_hbase_queries.py \ --exploration_strategy=exhaustive Change-Id: I6c758502126884670bb6dd3153aea5aa5b41aab6 Reviewed-on: http://gerrit.cloudera.org:8080/775 Readability: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-12-15 04:45:50 +00:00
Alex Behm	096073472f	IMPALA-1459: Fix migration/assignment of On-clause predicates inside inline views. The bug: Our predicate assignment logic used to rely on a flag isWhereClauseConjunct set in Exprs to determine whether assigning a predicate at a certain plan node was correct or not. This flag was intended to distinguish between predicates from the On-clause of a join and other predicates, and the bug was that we used !isWhereClauseConjunct to imply that a predicate originated from an On-clause. For example, predicates migrated into inline views apply to the post-join, post-aggregation, post-analytic result of the inline view, but the existing flag and logic were insufficient to correctly assign predicates coming from the On-clause of an enclosing block. This patch removes the isWhereClauseConjunct flag in favor of an isOnClauseConjunct flag which can more directly capture the originally intended logic. This patch also adds a test to cover the same issue reported in IMPALA-2665. Change-Id: I9ff9086b8a4e0bd2090bf88319cb917245ae73d2 Reviewed-on: http://gerrit.cloudera.org:8080/1453 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-11-20 04:53:28 +00:00
Bharath Vissapragada	084b9b1692	IMPALA-2432: Add query endtime to impalad's lineage This commit adds query endtime to impalad's lineage log entries consumed by navigator. The lineage graph is constructed in the frontend and is then passed to the backend as a serialized thrift object. When the query terminates (includes cancellations and aborts), the backend appends the query endtime ("endTime") to the lineage graph and generates the lineage log entry in JSON format. Change-Id: I2236e98895ae9a159ad6e78b0e18e3622fdc3306 Reviewed-on: http://gerrit.cloudera.org:8080/934 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2015-11-04 08:39:12 +00:00
Martin Grund	65772cf9ce	IMPALA-2527: Disable small-query optimization for collection types When scanning tables with collection types the limit applied to the scan node is not sufficient enough to safely enable the small-query optimization. This patch adds an additional check to the MaxRowsProcessedVisitor that will abort checking the number of processed rows once a scan node accesse a collection type. Change-Id: Ic43baf3f97acfb8d7b53b0591c215046179d18b3 Reviewed-on: http://gerrit.cloudera.org:8080/1235 Reviewed-by: Silvius Rus <srus@cloudera.com> Tested-by: Internal Jenkins	2015-10-14 17:51:38 -07:00
Skye Wanderman-Milne	cfe1e38d6e	IMPALA-2495: make Expr::IsConstant() recurse on children Before, Expr::IsConstant() manually specified the constant Expr classes, but TupleIsNullPredicate and AnalyticExpr overrode IsConstant() to always return false (which Expr::IsConstant() didn't specify). This meant that unless the TupleIsNullPredicate was the root expr, TupleIsNullPredicate::IsConstant() would never be called and Expr::IsConstant() would return true. This patch changes Expr::IsConstant() to recurse on its children, rather than having it contain the constant logic for all expr types. Change-Id: I756eb945e04c791eff39c33305fe78d957ec29f4 Reviewed-on: http://gerrit.cloudera.org:8080/1214 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-10-09 16:47:46 -07:00
Alex Behm	bbaef98281	IMPALA-2445: Preserve chain of table refs until end of computeParentAndSubplanRefs(). The bug: While separating the parent table refs from the subplan refs, we immediately changed the left table link of a chosen ref. However, we rely on the link structure to correctly determine the required table ref ids, so in some cases we missed a required table ref id due to a broken table ref chain. The fix: Preserve the original chain of table refs until the end of computeParentAndSubplanRefs(). TODO: This fix has the unfortunate consequence of making the plan for nested TPCH Q21 worse. It should be possible to change the planning logic to be correct and also get the original better plan for Q21, but this needs some more thought. Change-Id: Ib8dc13c950f7783b62ce6ab7c8a6f534f9a9bb31 Reviewed-on: http://gerrit.cloudera.org:8080/1177 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins Conflicts: testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test	2015-10-09 16:47:00 -07:00
Dimitris Tsirogiannis	38f29b048d	IMPALA-2474: PlannerTest fails due to nested types file size mismatch (part2) With this commit we use regex when comparing the file sizes of table 'tpch_nested_parquet.region' in the PlannerTest. Change-Id: I03fa177c9d36d60bcb5ce7eece8a5a7c98bb7985 Reviewed-on: http://gerrit.cloudera.org:8080/1216 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins Conflicts: testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test	2015-10-09 16:39:28 -07:00
Dan Hecht	13738e26e6	IMPALA-2474: PlannerTest fails due to nested types file size mismatch For some reason, on each full data load, Hive seems to slightly change the file size of the tpch_nested.customer files. The PlannerTest result comparator already supports regex:, so use that to regex away the file size to get the builds working. Change-Id: If84ac71bc3a309407efa6c597be71f83993c5533 Reviewed-on: http://gerrit.cloudera.org:8080/1148 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-08 15:16:33 -07:00
Alex Behm	c153d094d4	IMPALA-2478: Unset the expr id of bound conjuncts. The bug: When assigning bound collection conjuncts to scan, we incorrectly marked the source of the bound conjunct as assigned even though the source conjunct must also be evaluated by a join (source conjunct is a where-clause conjunct bound by an outer-joined tuple). The root issue was that a bound conjunct retained the expr id of its source. The fix: Unset the expr id of bound conjuncts to prevent callers from inadvertently marking the source conjunct as assigned. Change-Id: Ica775adfc551d9fc0457a2392c4988cb2eb7de72 Reviewed-on: http://gerrit.cloudera.org:8080/1149 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-10-07 14:47:38 -07:00
Dimitris Tsirogiannis	a4d24954b5	IMPALA-2446: Fix wrong predicate assignment in outer joins This commit fixes an issue where a predicate from the WHERE clause that can be evaluated at a join node is incorrectly assigned to that node's join conjuncts even if it is an outer join, thereby causing the join to return wrong results. Change-Id: Ibf83e4e2c7b618532b3635b312a70a2fa12a0286 Reviewed-on: http://gerrit.cloudera.org:8080/1129 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-10-06 10:54:10 -07:00
Alex Behm	06c96e4074	IMPALA-2349,IMPALA-2412: Planner fixes to subplan ordering. IMPALA-2349: The bug was that we were not adding the parent tuple ids of a relative or correlated table ref to the list of required tuple ids if that table ref itself depended on a relative table ref (nested subplans). This patch simplifies and fixes the planning with straight_join. The required tuple ids are properly set, and the ordering requirement is enforced by adding the last parent table ref's id to the list of required table ref ids. IMPALA-2412: The bug was that we were relying on both the required materialized tuple ids as well as the table ref ids to determine whether a table ref belongs into a subplan at a certain level. However, as the existing comments in the code actually already state, the subplan placement should be determined only based on whether the required parent tuple ids are materilaized. The correct join/subplan ordering is independent, and is handled by the required table ref ids. Change-Id: I922fcbd0039242bf5940534d667926cdbdf72946 Reviewed-on: http://gerrit.cloudera.org:8080/907 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-10-05 11:30:51 -07:00
Dan Hecht	749c19c5fc	Revert "IMPALA-2474: update file sizes in planner tests" This reverts commit 54409d82a42623e156cd775810b9a76fc5ae7407. Change-Id: I47b7a73a7a9741f27b13e8983efc9e3ddf0d67f7 Reviewed-on: http://gerrit.cloudera.org:8080/1147 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Dan Hecht <dhecht@cloudera.com>	2015-10-05 11:30:49 -07:00

1 2 3 4 5 ...

311 Commits