impala

mirror of https://github.com/apache/impala.git synced 2025-12-31 15:00:10 -05:00

Author	SHA1	Message	Date
ishaan	2b5df0c6ff	[CDH5] Convert tpch schemas to decimal and change the queries where possible. I used the following document for reference: http://www.tpc.org/tpch/spec/tpch2.1.0.pdf Change-Id: Ic84db0628323c90e89552707f214bbb9fa2f2ae0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3132 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-07-08 14:51:43 -07:00
Alex Behm	21c9eb68b1	Restore casts stripped from grouping exprs by substitution. Change-Id: I2a317025f9a8549beed7cf79b463239e11a6a2d0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3352 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3432 Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-07-08 10:45:43 -07:00
Dimitris Tsirogiannis	630d90392e	CDH-20089: Query planning failed in HdfsScanNode.evalBinaryPredicate This commit fixes issue CDH-20089 where an error is thrown when we have a binary predicate on a partition key that has no values. Change-Id: I3b5cefb4d7193045fc6fc5e94766589c2299b5b1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3327 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3335	2014-06-30 15:05:31 -07:00
Dimitris Tsirogiannis	5a6f53db16	Add partition pruning tests The following changes are included in this commit: 1. Modified the alltypesagg table to include an additional partition key that has nulls. 2. Added a number of tests in hdfs.test that exercise the partition pruning logic (see IMPALA-887). 3. Modified all the tests that are affected by the change in alltypesagg. Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236	2014-06-24 02:14:27 -07:00
Alex Behm	bf85225911	IMPALA-881: Tests for joins with union inputs. Change-Id: I4be6821ac3938345ca95c542d868c87512ff66da Reviewed-on: http://gerrit.ent.cloudera.com:8080/3229 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-06-23 15:38:06 -07:00
Alex Behm	881f3a8c33	Re-order union operands descending by their estimated per-host memory. Re-order union operands descending by their estimated per-host memory, s.t. parent nodes can gauge the peak memory consumption of a MergeNode after opening it during execution (a MergeNode opens its first operand in Open()). Scan nodes are always ordered last because they can dynamically scale down their memory usage, whereas many other nodes cannot (e.g., joins, aggregations). One goal is to decrease the likelihood of a SortNode parent claiming too much memory in its Open(), possibly causing the mem limit to be hit when subsequent union operands are executed. Change-Id: Ia51caaffd55305ea3dbd2146cd55acc7da67f382 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3146 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/3213 Tested-by: jenkins	2014-06-20 18:46:10 -07:00
Alex Behm	70d7ff07af	CDH-19856: Disable Hive's stats autogathering. Change-Id: I04e91f91d29b7863848a750e362c9d94469df7f2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3156 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3169	2014-06-19 16:48:34 -07:00
Alex Behm	ef6705d7e0	Rename MergeNode to UnionNode. Change-Id: I9e3675a103757db1345b04bd1d102d2719efddd0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3128 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3154 Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-06-19 12:44:21 -07:00
Alex Behm	677062be3d	Rework planning of unions s.t. a UnionStmt produces a single MergeNode. This patch changes the planning of a UnionStmt s.t. it always produces a single fragment with a MergeNode connecting all child fragments as its root. The data partition of the returned fragment and how the child fragments are merged depends on the data partitions of the child fragments: - All child fragments are unpartitioned or partitioned: The returned fragment is has a UNPARTITIONED or RANDOM data partition, respectively. The MergeNode absorbs the plan trees of all child fragments. - Mixed partitioned/unpartitioned child fragments: The returned fragment is RANDOM partitioned. The plan trees of all partitioned child fragments are absorbed into the MergeNode. All unpartitioned child fragments are connected to the MergeNode via a RANDOM exchange, and remain unchanged otherwise. Also adds support for random partitioned data exchanges. Change-Id: I82b2d12c104d98c4e7133234653ee1b67658ef7a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2876 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3143	2014-06-19 00:56:58 -07:00
Alex Behm	4be9611474	Temporarily disable insert planner tests (CDH-19856). Change-Id: Ibcf914b87fb0ae958c5039a7cd2e8be72aa4295e Reviewed-on: http://gerrit.ent.cloudera.com:8080/3110 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-06-17 23:34:07 -07:00
Alex Behm	eed829f778	Fix misleading test to unblock full data loading. Change-Id: I98c218188a0cf459cacb96363e7a65ebb4525f07 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3100 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-06-17 17:45:04 -07:00
Srinath Shankar	895bdeddd8	Ignore order-by without limit in INSERT and CTAS Order-by without limit in the query statement corresponding an INSERT or CTAS must be ignored because i) There is no guarantee on row ordering when the target table is scanned again i.e. 'select * from table' may return rows in any order, regardless of how the rows were inserted, and ii) Ignoring (and not flagging an error) is consistent with the treatment of order-by w/o limit in nested queries, union operands etc. Currently, an order-by w/o limit in a QueryStmt is only evaluated if the analyzer is the root analyzer (has no ancestors). However, a new child analyzer is not created for the QueryStmt in an InsertStmt, so this technique fails for inserts. The correct thing to do is to use a child analyzer for that QueryStmt, but this has spill-over scoping effects for analysis of with clauses. This patch adds a flag, similar to the isExplain flag to the analyzer to identify insert statements. Change-Id: I9ded587cfea75eca0b7a43ee9b0df0a6c8ecb602 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3044 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3060	2014-06-14 18:36:43 -07:00
Nong Li	5d903efca3	ExecSummary The runtime profile as we present it is not very useful and I think the structure of it makes it hard to consume. This patch adds a new client facing schemed set of counters that are collected from the runtime profiles. For example, with this structure it would be easy to have the shell get the stats of a running query and print a useful progress report or to check the most relevant metrics for diagnosing issues. Here's an example of the output for one of the tpch queries: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------ 09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED 05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B 04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE 08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE 07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority) 03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB 02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST \|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST \| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o 00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-06-11 03:10:11 -07:00
Srinath Shankar	5755b0bdee	Order by without limit for Impala Enable order-by without limit Added BufferedBlockMgr to allocate buffers and spill to disk. Added Sorter for the external sort impelementation Added new SortNode execution node that completely sorts its input Changes to enable writing in IoMgr went in a separate patch. Reviewed-on: http://gerrit.ent.cloudera.com:8080/1539 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins Conflicts: testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test Change-Id: I3ece32affe5b006f53bbdfcc03ded01471e818ac Reviewed-on: http://gerrit.ent.cloudera.com:8080/2900 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins	2014-06-09 16:58:08 -07:00
ishaan	db97981ab9	[CDH5] Switch the tpcds schemas to use decimal instead of float/double. This patch converts the tpcds schemas to use decimal instead of float/double. Currently, Impala can only r/w decimal in text, therefore, the tables are constrained to text. The schemas were obtained from the official tpc spec: http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf Change-Id: I1ef0113dcb48bad52af75ee93b47b08adf9e1a69 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2403 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-06-08 11:47:23 -07:00
Dimitris Tsirogiannis	0348a36b49	IMPALA-887: Improve partition pruning time (final) This commit contains the final set of changes for improving the performance of partition pruning. For each HdfsTable, we materialize a set of partition value metadata that allows the efficient evaluation of simple predicates on partition attributes without invoking the BE. These changes result in three orders of magnitude performance improvement during partition pruning. Change-Id: I5b405f0f45a470f2ba7b2191e0d46632c354d5ae Reviewed-on: http://gerrit.ent.cloudera.com:8080/2700 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2823	2014-06-03 23:17:44 -07:00
Nong Li	8f4dc0f2f0	IMPALA-974: Switch from FloatLiteral to DecimalLiteral. Float/Doubles are lossy so using those as the default literal type is problematic. Change-Id: I5a619dd931d576e2e6cd7774139e9bafb9452db9 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2758 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-05-31 22:19:06 -07:00
Dimitris Tsirogiannis	ca86e470de	IMPALA-887: Improve partition pruning time This commit is the first step in improving the performance of partition pruning. Currently, Impala can prune approximately 10K partitions per sec, thereby introducing significant overhead for huge table with a large number of partitions. With this commit we reduce that overhead by 3X by batching the partition pruning calls to the backend. Change-Id: I3303bfc7fb6fe014790f58a5263adeea94d0fe7d Reviewed-on: http://gerrit.ent.cloudera.com:8080/2608 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2687	2014-05-26 13:10:12 -07:00
Nong Li	5729024fe9	IMPALA-984: Fix missing reanalyze in InlineViewRef and NULL handling. Change-Id: Ia80035c5456630aeef7a24288a998fe08546a282 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2652 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-05-21 18:18:29 -07:00
Alex Behm	1b9a8020bf	IMPALA-996: Exclude non-materialized slots from a tuple's avgSerializedSize. Change-Id: Ic7936c6b5c5e6d4c162d91105128cda2b1b7284c Reviewed-on: http://gerrit.ent.cloudera.com:8080/2617 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2626	2014-05-20 16:21:59 -07:00
Matthew Jacobs	6ccd56bc1f	Enforce slot equivalences at data source scan nodes Change-Id: I2ed606ba398990ab05afa3301b6356c6a636e2bb Reviewed-on: http://gerrit.ent.cloudera.com:8080/2521 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 55061f6953956f45d433fe227ded539a648e3f9c) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2536	2014-05-19 14:37:44 -07:00
Matthew Jacobs	ebc6c5894e	External Data Source: Frontend and catalog changes Initial frontend and catalog changes for external data sources. Change-Id: Ia0e61ef97cfd7a4e138ef555c17f2e45bbf08c18 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2224 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit dfa14c828957f751db9c89bae0bdc040ce6f648c) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2485	2014-05-08 14:56:19 -07:00
Dimitris Tsirogiannis	1a21bb9b9e	IMPALA-642: Conjunctive predicates on HBase table not working... This commit fixes IMPALA-642 issue where conjunctive predicates are returning incorrect results from HBase in the presence of NULL values. The following changes are included: 1. Modified the HBaseScanNode to re-apply the "pushed-down" predicates. 2. Added tests in QueryTest/hbase-filters.test 3. Added tests in PlannerTest/hbase.test Change-Id: I598b325ad63b043b325fba74448698ed71a3cd78 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2414 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2489 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>	2014-05-08 13:59:00 -07:00
Henry Robinson	f968bb6087	IMPALA-923: Boolean slotrefs not marked as assigned in inline views A boolean slotref predicate that could be pushed into an inline view would not be correctly marked as assigned, leading to an extra select node being introduced to evaluate it. This was because the id of the expression after substitution would change (see createInlineViewPlan()), but only the post-substitution conjunct IDs were marked as assigned. This bug only affected standalone slotrefs; other exprs (like casts, or explicit predicates referencing a slotref) would not change their ID under substitution. Change-Id: I4127528b4aec25c966a4d186ddc98a68502b90c1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2430 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins (cherry picked from commit b49bfdf57769615d43d86fcfce2269531640788a) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2435	2014-05-02 18:45:21 -07:00
Victor Bittorf	808f9a661a	IMPALA-939: Regex should match anywhere in string. Change-Id: I8dcd337c3b06b632017270670a4f199ec7ada648 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2296 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit c97f82eaaf0efe9bd4c3da3d005464f425696a62) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2371	2014-04-25 16:16:15 -07:00
Alex Behm	91e1eb0789	CDH-18563: Speed up the computation of transitive value transfers. The issue: Computing the full transitive closure for all slots can be very expensive (10s of seconds for >2k slots, minutes for >4k slots). Queries with many views and/or unions were affected most because each union/view adds a new tuple with slots, increasing the total number of slots. The fix: The new algorithm exploits the sparse structure of the value transfer graph for a significant speedup (>100x). The high-level steps are: 1. Identify complete subgraps based on bi-directional value transfers, and coalesce the slots of each complete subgraph into a single slot. 2. Map the remaining uni-directional value transfers into the new slot domain. 3. Identify the connected components of the uni-directional value transfers. This step partitions the value transfers into disjoint sets. 4. Compute the transitive closure of each partition from (3) in the new slot domain separately. Hopefully, the partitions are small enough to afford the O(N^3) complexity of the brute-force transitive closure computation. Change-Id: I35b57295d8f04b92f00ac48c04d1ef1be4daf41b Reviewed-on: http://gerrit.ent.cloudera.com:8080/2360 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-24 23:53:28 -07:00
Alex Behm	121fab8fdf	IMPALA-888: Drop union operands with constant conjuncts evaluating to false. This patch simplifies the complex slot materialization logic for unions by making the materialization independent of conjuncts assigned to MergeNodes. When 'pushing down' predicates into union operands, we drop union operands with constant predicates evaluating to false. Constant predicates that evaluate to true are simply ignored. Change-Id: I0e7ccfb206bed29db2b5d667e2bb61310980e80a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2327 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-23 18:25:14 -07:00
Alex Behm	c8e928119d	IMPALA-912: Enforce slot equivalences at the lowest possible plan node. The reported issue is that we can have redundant hash expressions in exchanges. The underlying cause is that we fail to remove redundant join predicates. This patch enforces slot equivalences based on our computed equivalence classes at the lowest possible plan node by generating new equality predicates. Each plan subtree now has a minimal set of equality predicates that express all known equivalences between slots belonging to tuples materialized at that plan node. As a result, eliminating redundant join predicates becomes trivial: It is sufficient to pick a single representative predicate of each relevant equivalence class. All predicates beyond that are redundant. Change-Id: I7998fe8d7bdf84cc8eb129d32c86269bedeab68e Reviewed-on: http://gerrit.ent.cloudera.com:8080/2177 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2278	2014-04-18 13:28:49 -07:00
Alex Behm	0585dfb546	IMPALA-888: Materialize union slots referenced by constant predicates. To keep the predicate assignment/propagation logic simple, we assign conjuncts whose underlying base table exprs are constant in at least one union operand to the evaluating MergeNode, and not in the operand(s) whose corresponding base table exprs are constant. The JIRA describes two different bugs: The first bug was that the slots required for evaluating such predicates in the MergeNode were not marked as materialized. The second bug was that predicates 'pushed' into union operands did not get re-analyzed after substituting the predicate's exprs with the result exprs of that union operand. Missing casts lead to a crash. The new test covers both bugs. Change-Id: I0f5b8a366b32f7d4b2587e13793b6103cdf7e8b3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2162 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-07 18:32:29 -07:00
Alex Behm	8b319f8959	IMPALA-935: Make PlanFragment.getDestFragment() return null if no destination is set. Change-Id: I269a7f552d7ff67ff4d65e86e8c6df9c41d0fca1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2159 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-04-07 16:21:24 -07:00
Alex Behm	f4cfc75544	IMPALA-860: Place outer and semi joins at a fixed position in the plan. The bug: It is generally incorrect to re-order joins across outer/semi joins. For example, an inner join following an outer join may reduce the cardinality, so placing the inner join before the outer join during join re-ordering would be incorrect because the outer join is cardinality preserving (on one or both sides). The same argument holds for semi joins. The fix: Place outer and semi joins at a fixed position in the plan based on where they appeared in the original query. Inner joins to the left/right of outer/semi joins are still re-ordered properly. Change-Id: Idae837097b9376473d7f8124eef69b51f612b210 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1909 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1922	2014-03-15 05:18:11 -07:00
Srinath Shankar	74a975c45b	IMPALA-862: count(x) may return null when a similar count(distinct x) is also used count(x) with no distinct and no group-by expressions returns NULL on empty input if other distinct aggs (e.g. COUNT(distinct x) are present. This happens because the COUNT is transformed to SUM(COUNT()), with the inner COUNT being evaluated WITH a group-by expression (e.g. x). SUM over empty input returns NULL, but COUNT should return 0. This patch fixes this by replacing COUNT with zeroifnull(COUNT) before AggregateInfo is generated if there are distinct aggs and no group-bys. The logic in AggregateInfo itself has not been modified. Change-Id: I902e3fdd95767135b2f3fe423e8802ef57366af1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1921 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins	2014-03-14 23:35:55 -07:00
Alex Behm	15e05082c0	IMPALA-831: Distributed aggregation and top-n over unions. Change-Id: I056e8271421008378db93e8b2393861cc9dd4b90 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1886	2014-03-13 15:42:31 -07:00
Alex Behm	d640273a3f	IMPALA-861: Proper slot materialization on distinct aggs inside an inline view. The bug: Slot materialization on distinct aggs inside an inline view did not work if the only reference to the 2nd-phase agg-tuple slots was in a predicate from an outer query block (e.g., Where-clause of the block with the inline view ref). The reason was that bound predicates were fetched from the wrong tuple (from the 1st phase agg). The fix: Assign predicates to the top-most agg in the single-node plan that can evaluate them, as follows: For non-distinct aggs place them in the 1st phase agg node. For distinct aggs place them in the 2nd phase agg node. Change-Id: I0f6ab53cf7bb0c6aed9524ad2e24a849d2dc0ec4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1843 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1881	2014-03-13 04:39:48 -07:00
Alex Behm	58950a52a3	IMPALA-798: Distributed execution of CTAS and explain CTAS. Change-Id: I32004a4b31c54cf5c185169fece143a61213d12d Reviewed-on: http://gerrit.ent.cloudera.com:8080/1850 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1867	2014-03-12 16:51:50 -07:00
Alex Behm	47c52ade84	IMPALA-866: Make HdfsScanNode.computeStats() idempotent with respect to totalBytes_. Change-Id: I1c243b089db82c0544586a2a1428081aa2dbcd52 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1844 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1852	2014-03-11 18:20:15 -07:00
Alex Behm	fe4f3babe5	IMPALA-824: Generalize getBoundPredicates() to handle multi-slot predicates. The bug: Multi-slot predicates bound to a single outer-joined tuple were not marked as materialized. In addition, such predicates were not picked up by nodes under the join via getBoundPredicates() even if it would be correct to do so. The fix: Always mark slots of predicates that must be evaluated by a join in SelectStmt.materializeRequiredSlots(), regardless of whether the predicate can also be safely evaluated below the join. This patch also generalizes getBoundPredicates() to handle multi-slot predicates and fixes some issues with redundant predicate assignment. Still, the new approach has several limitations which are documented in the predicate propatation planner test to ease future improvements. Change-Id: If5da0354a83c00a9766fc63b7780ed4d5a9c46e5 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1717 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1819	2014-03-08 04:22:39 -08:00
Alex Behm	a615ebc549	IMPALA-822,IMP-1271: Binding predicates on an aggregation now properly trigger slot materialization. The bug was that the number of materialized agg-tuple slots did not correspond to the number of materialized agg functions, due to binding predicates against an AggNode causing slot materialization after SelectStmt.materializeRequiredSlots(). This patch fixes the issue by taking binding predicates (bound to a slot in an agg tuple) into consideration in SelectStmt.materializeRequiredSlots(). I added a new sanity check in AggregationNode.toThrift() surfaced another issue with slot materialization that is also fixed in this patch. The ordering exprs must be marked before the agg exprs in SelectStmt.materializeRequiredSlots() because the odering exprs may contain agg exprs that are only referenced inside the ORDER BY clause. Change-Id: I1bdc0466f583907bed625ce6608938e59faee83f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1639 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1818 Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-03-08 00:25:26 -08:00
Alex Behm	f7c2781afe	IMPALA-845: Transfer predicates to 2nd phase merge agg in some cases. Having predicates need to be transferred to the 2nd phase merge agg for distinct + non-distinct aggregates without group by. For distinct + non-distinct aggregates with group by, it is correct to evaluate the predicates at the 2nd phase (non-merge) agg. Change-Id: I71d73c4ef92becbb81e142bc0cb5f54e790b1fb5 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1743 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1817	2014-03-07 21:45:16 -08:00
Alex Behm	f71767c612	IMPALA-846: Add additional regression test. This issue has coincidentally been resolved by the fix for IMPALA-820. This patch adds an additional regression test for explicitly covering IMPALA-846. Change-Id: Ib60174676e5bb53de543a1db30adc05cef4d6593 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1719 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1730	2014-03-03 16:50:36 -08:00
Alex Behm	75e89263fb	IMPALA-820: Register Having-clause predicates with analyzer after expr substitution. Change-Id: I638a7324b10007f7f55564b82def3fd05f2c9fa4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1607 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1701	2014-02-28 10:40:55 -08:00
Alex Behm	cb8150e8ee	IMPALA-817: Check equality of function name in Function.equals(). Change-Id: Ib9b4ee3a21f90fdb0d7ebccd89462dc67040bd1e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1594 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1611 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2014-02-19 17:13:51 -08:00
Alex Behm	15f4f2c33a	IMP-1258: Fix incorrect assignment/propagation of predicates with outer joins. This patch includes several changes to predicate assignment and propagation. First, we now only register as outer joined those tuples of TableRefs directly participating in an outer join. In particular, materialized tuples referenced inside an outer-joined InlineView are not registered as outer joined - only the InlineView's tuple is registered. The other major change is that we detect when it is correct to propagate predicates to scan nodes participating (directly or indirectly) in an outer join by testing whether a predicate can become true if a tuple is NULL. If that is the case, then it is generally not safe to propagate a predicate because it would change the final result of the outer join. Change-Id: Ia135ab15ec8c6ef756a908f797f96812d28c84c1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1567 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1606	2014-02-19 15:27:44 -08:00
Alex Behm	1fc3b384ed	IMPALA-805: Disable elimination of redundant join predicates. Our code for eliminating redundant join predicates based on equivalence classes is not quite right. I've commented out the relevant code to ensure we don't incorrectly remove predicates. Left a TODO to fix and re-enable this feature. Change-Id: Ie76b365903dff6df271a378cbb4fd327ffa0631f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1569 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1572	2014-02-19 13:23:11 -08:00
Nong Li	0d2919fe7f	Refactor scalar and aggregate function analysis and execution. This patch cleans up analysis and execution of scalar and aggregate functions so that there is no difference between how builtins and user functions are handled. The only difference is that the catalog is populated with the builtins all the time. The BE always gets a TFunction object and just executes it (builtins will have an empty hdfs file location). This removes the opcode registry and all of the functionality is subsumed by the catalog, most of which was already duplicated there anyway. This also introduces the concept of a system database; databases that the user cannot modify and is populated automatically on startup. Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577	2014-02-18 18:40:08 -08:00
Lenni Kuff	95404d4888	Support prioritized background table loading The overall goal of this change allow for table metadata to be loaded in the background but also to allow prioritization of loading on an as-needed basis. As part of analysis, any tables that are not loaded are tracked and if analysis fails the Impalad will make an RPC to the CatalogServer to requiest the metadata loading of these tables be prioritized and analysis will be restarted. To support this, the CatalogServer now has a deque of the tables to load. For background loading, tables to load are added to the tail of the deque. However, a new CatalogServer RPC was added that can prioritize the loading of one or more tables in which case they will get added to the head of the deque. The next table to load is always taken from the head. This helps prioritize loading but is admittedly not the most fair approach. The support the prioritized loading, some changes had to made on the Impalad side during analysis: - During analysis, any tables that are missing metadata are tracked. - Analysis now runs in a loop. If it fails due to an AnalysisException AND at least 1 table/view was missing metadata, these tables missing metadata are requested to be loaded by calling the CatalogServer. - The impalad will wait until the required tables are received (by getting notified each time there is a call to updateCatalog()), and waiting to run analysis until all tables are available. Once the tables are available, analysis will restart. This change also introduces two new flags: --load_catalog_in_background (bool). When this is true (the default) the catalog server will run a period background thread to queue all unloaded tables for loading. This is generally the desired behavior, but there may be some cases (very large metastores) where this may need to be disabled. --num_metadata_loading_threads (int32). The number of threads to use when loading catalog metadata (degree of parallelism). The default is 16, but it can be increased to improve performance at the cost of stressing the Hive metastore/HDFS. Change-Id: Ib94dbbf66ffcffea8c490f50f5c04d19fb2078ad Reviewed-on: http://gerrit.ent.cloudera.com:8080/1476 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1538	2014-02-13 23:43:06 -08:00
Lenni Kuff	b4f5c1edcf	Enable lazy loading of table metadata for the CatalogService/Impalad This change adds support for lazy loading of table metadata to the CatalogService/Impalad. The way this works is that the CatalogService initially sends out an update with only the databases and table names (wrapped as IncompleteTables). When an Impalad encounters one of these tables, it will contact the catalog service to get the metadata, possibly triggering a metadata load if the catalog server has not yet loaded this table. With these changes the catalog server starts up in just seconds, even for large metastores since it only needs to call into the metastore to get the list of tables and databases. The performance of "invalidate metadata" also improves for the same reason. I also picked up the catalog cleanup patch I had to make the APIs a bit more consistent and remove the need for using a LoadingCache for databases. This also fixes up the FE tests to run in a more realistic fashion. The FE tests now run against catalog object recieved from the catalog server. This actually turned up some bugs in our previous test configuration where we were not running with the correct column stats (we were always running with avgSerializedSize = slotSize). This changed some plans so the planner tests needed to be updated. Still TODO: This does not include the changes to perform background metadata loading. I will send that out as a separate patch on top of this. Change-Id: Ied16f8a7f3a3393e89d6bfea78f0ba708d0ddd0e Saving changes Change-Id: I48c34408826b7396004177f5fc61a9523e664acc Reviewed-on: http://gerrit.ent.cloudera.com:8080/1328 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/1338 Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-21 21:43:29 -08:00
Alex Behm	6799c93922	Simplified/enhanced explain plans with a total of four explain levels. There are now 4 explain levels summarized as follows: - Level 0: MINIMAL Non-fragmented parallel plan only showing plan nodes with minimal attributes - Level 1: STANDARD Non-fragmented parallel plan with some details in plan nodes - Level 2: EXTENDED Non-fragmented parallel plan with full details in plan nodes including the table/column stats, row size, #hosts, cardinality, and estimated per-host memory requirement - Level 3: VERBOSE Fragmented parallel plan with full details (like level 2) This patch also includes several bugfixes related to plan costing and/or testing of explain plans. Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-10 19:17:59 -08:00
Alan Choi	57b961168d	IMP-1188 Fix HBase row key predicates issues This patch fixes a few row key issues: 1. We used to assert that the row key filter must be a string literal. However, it can also be a constant function. We need to eval the expr and then use the result as the start/stop key. 2. Cast(row_key as int) simply failed. This should not be transformed into start/stop key. 3. We used to assert that lower bound < upper bound. This query: select * from tbl where row_key > 'b' and row_key < 'a' would simply ASSERT. We should simply not return any rows. 4. Handle NULL predicate HBase row key can't be null. If either upper/lower bound is null, we simply don't need to return any rows. Change-Id: Ia03590a862888b377bf1f48bcb838b99193fa241 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1180 Reviewed-by: Alan Choi <alan@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:40 -08:00
Alex Behm	e4ad086dee	Added max/avg length for string columns in COMPUTE STATS. Change-Id: I6f61de2323ee12681642684ec633ed4bb7506de2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1079 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:30 -08:00

1 2 3

126 Commits