impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Dimitris Tsirogiannis	5a6f53db16	Add partition pruning tests The following changes are included in this commit: 1. Modified the alltypesagg table to include an additional partition key that has nulls. 2. Added a number of tests in hdfs.test that exercise the partition pruning logic (see IMPALA-887). 3. Modified all the tests that are affected by the change in alltypesagg. Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236	2014-06-24 02:14:27 -07:00
Alex Behm	881f3a8c33	Re-order union operands descending by their estimated per-host memory. Re-order union operands descending by their estimated per-host memory, s.t. parent nodes can gauge the peak memory consumption of a MergeNode after opening it during execution (a MergeNode opens its first operand in Open()). Scan nodes are always ordered last because they can dynamically scale down their memory usage, whereas many other nodes cannot (e.g., joins, aggregations). One goal is to decrease the likelihood of a SortNode parent claiming too much memory in its Open(), possibly causing the mem limit to be hit when subsequent union operands are executed. Change-Id: Ia51caaffd55305ea3dbd2146cd55acc7da67f382 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3146 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/3213 Tested-by: jenkins	2014-06-20 18:46:10 -07:00
Alex Behm	ef6705d7e0	Rename MergeNode to UnionNode. Change-Id: I9e3675a103757db1345b04bd1d102d2719efddd0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3128 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3154 Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-06-19 12:44:21 -07:00
Alex Behm	677062be3d	Rework planning of unions s.t. a UnionStmt produces a single MergeNode. This patch changes the planning of a UnionStmt s.t. it always produces a single fragment with a MergeNode connecting all child fragments as its root. The data partition of the returned fragment and how the child fragments are merged depends on the data partitions of the child fragments: - All child fragments are unpartitioned or partitioned: The returned fragment is has a UNPARTITIONED or RANDOM data partition, respectively. The MergeNode absorbs the plan trees of all child fragments. - Mixed partitioned/unpartitioned child fragments: The returned fragment is RANDOM partitioned. The plan trees of all partitioned child fragments are absorbed into the MergeNode. All unpartitioned child fragments are connected to the MergeNode via a RANDOM exchange, and remain unchanged otherwise. Also adds support for random partitioned data exchanges. Change-Id: I82b2d12c104d98c4e7133234653ee1b67658ef7a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2876 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3143	2014-06-19 00:56:58 -07:00
Srinath Shankar	895bdeddd8	Ignore order-by without limit in INSERT and CTAS Order-by without limit in the query statement corresponding an INSERT or CTAS must be ignored because i) There is no guarantee on row ordering when the target table is scanned again i.e. 'select * from table' may return rows in any order, regardless of how the rows were inserted, and ii) Ignoring (and not flagging an error) is consistent with the treatment of order-by w/o limit in nested queries, union operands etc. Currently, an order-by w/o limit in a QueryStmt is only evaluated if the analyzer is the root analyzer (has no ancestors). However, a new child analyzer is not created for the QueryStmt in an InsertStmt, so this technique fails for inserts. The correct thing to do is to use a child analyzer for that QueryStmt, but this has spill-over scoping effects for analysis of with clauses. This patch adds a flag, similar to the isExplain flag to the analyzer to identify insert statements. Change-Id: I9ded587cfea75eca0b7a43ee9b0df0a6c8ecb602 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3044 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3060	2014-06-14 18:36:43 -07:00
Nong Li	5d903efca3	ExecSummary The runtime profile as we present it is not very useful and I think the structure of it makes it hard to consume. This patch adds a new client facing schemed set of counters that are collected from the runtime profiles. For example, with this structure it would be easy to have the shell get the stats of a running query and print a useful progress report or to check the most relevant metrics for diagnosing issues. Here's an example of the output for one of the tpch queries: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------ 09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED 05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B 04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE 08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE 07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority) 03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB 02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST \|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST \| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o 00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-06-11 03:10:11 -07:00
Srinath Shankar	5755b0bdee	Order by without limit for Impala Enable order-by without limit Added BufferedBlockMgr to allocate buffers and spill to disk. Added Sorter for the external sort impelementation Added new SortNode execution node that completely sorts its input Changes to enable writing in IoMgr went in a separate patch. Reviewed-on: http://gerrit.ent.cloudera.com:8080/1539 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins Conflicts: testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test Change-Id: I3ece32affe5b006f53bbdfcc03ded01471e818ac Reviewed-on: http://gerrit.ent.cloudera.com:8080/2900 Reviewed-by: Srinath Shankar <sshankar@cloudera.com> Tested-by: jenkins	2014-06-09 16:58:08 -07:00
Alex Behm	15e05082c0	IMPALA-831: Distributed aggregation and top-n over unions. Change-Id: I056e8271421008378db93e8b2393861cc9dd4b90 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1886	2014-03-13 15:42:31 -07:00
Alex Behm	6799c93922	Simplified/enhanced explain plans with a total of four explain levels. There are now 4 explain levels summarized as follows: - Level 0: MINIMAL Non-fragmented parallel plan only showing plan nodes with minimal attributes - Level 1: STANDARD Non-fragmented parallel plan with some details in plan nodes - Level 2: EXTENDED Non-fragmented parallel plan with full details in plan nodes including the table/column stats, row size, #hosts, cardinality, and estimated per-host memory requirement - Level 3: VERBOSE Fragmented parallel plan with full details (like level 2) This patch also includes several bugfixes related to plan costing and/or testing of explain plans. Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-10 19:17:59 -08:00
Marcel Kornacker	fd182201bc	Predicate propagation plus join order optimization. Introduces STRAIGHT_JOIN keyword to prevent join order optimization. Structural changes to the planning framework: - slot materialization: the decision whether to materialize a slot now happens prior to plan generation. This is needed in order to be able to generate accurate cost estimates at plan generation time. see QueryStmt.materializeRequiredSlots() - added PlanNode.init(), which initializes the entire state of a PlanNode; this subsumes finalize() * computeMemLayout() now happens per-tuple in the corresponding ScanNode's init() * init() calls computeStats() by default; also marks slots as materialized and calls TupleDescriptor.computeMemLayout() - added PlanNode.tblRefIds_ - restructured UnionStmt and union plan generation to fit pred propagation model: all tuples are created (and equiv predicates registered) prior to plan generation - added Expr.isAuxExpr Change-Id: I475c1645bfca9e84ae6e5f529e7781d9532e5c9a Reviewed-on: http://gerrit.ent.cloudera.com:8080/955 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:24 -08:00
Matthew Jacobs	8a55982105	Add OFFSET to skip rows returned with a LIMIT Adds support for skipping a number of rows with an ORDER BY clause and a LIMIT. Hive does not support OFFSET so creating a view with an OFFSET will not work in Hive. For example, "SELECT * FROM T1 ORDER BY ID LIMIT 20 OFFSET 5" will do the sorting, skip 5 rows, then return the next 20. OFFSET requires an ORDER BY clause. Note this is not very efficient as we must actually keep (limit+offset) rows in memory in the topn-node, and all child sort nodes must as well. Users should be careful when using this feature. Change-Id: I4d7021c278296e7bdbfa0e6f2699cd6f23eef59d Reviewed-on: http://gerrit.ent.cloudera.com:8080/900 Tested-by: jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Matthew Jacobs <mj@cloudera.com>	2014-01-08 10:54:02 -08:00
Alex Behm	3f54240fed	PlannerTest uses explain level 'normal'. Only add stats and costs to explain output in 'verbose' mode. Change-Id: I827b4c7085b5aa2dc5521f8748d8973178f43f4c Reviewed-on: http://gerrit.ent.cloudera.com:8080/678 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:23 -08:00
Alex Behm	4bb8b38cde	Added stats and cost estimates to explain output. Change-Id: I1273745a439fd25cefa4e08ecc075c98cc8bfc45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/602 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:22 -08:00
Nong Li	2b9105cd11	IMPALA-487: don't compact data from rhs of join if it is going through an exchange node. Change-Id: I442445e7370218352cd6d3137f2a454c9afb73ba Reviewed-on: http://gerrit.ent.cloudera.com:8080/476 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-01-08 10:52:50 -08:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Alan Choi	2d25f11ec3	IMPALA-91 new explain plan output	2014-01-08 10:50:10 -08:00
Marcel Kornacker	5bfc477ccc	IMPALA-291: Plans should explicitly mention the join strategy	2014-01-08 10:49:59 -08:00
Marcel Kornacker	398e725a23	make broadcast joins the default join strategy	2014-01-08 10:49:34 -08:00
Marcel Kornacker	d7e22f44bb	Partitioned hash joins - added PlanNode.numNodes, PlanNode.avgRowSize and PlanNode.computeStats() - fixing up some cardinality estimates - Planner now tries to do a cost-based decision between broadcast join and join with full repartitioning (both inputs) - ExchangeNode now distinguishes between its input and output row descriptor: the output potentially contains more tuples - fixed problem related to cancellation and concurrent hash table builds. Not included: - partitioned joins that take advantage of existing partitions of the inputs; those will have to wait for a follow-on change	2014-01-08 10:49:29 -08:00
Marcel Kornacker	c02d25baa8	IMPALA-20: Limit clause in inline view not handled correctly by planner - this adds a SelectNode that evaluates conjuncts and enforces the limit - all limits are now distributed: enforced both by the child plan fragment and by the merging ExchangeNode - all limits w/ Order By are now distributed: enforced both by the child plan fragment and by the merging TopN node	2014-01-08 10:48:29 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Alan Choi	476a665763	IMP-620: print number of scanned partition and total scaned bytes	2014-01-08 10:46:57 -08:00
Marcel Kornacker	fd77f06f15	Moving functional-newplanner back to functional-planner (and renaming NewPlanner to Planner)	2014-01-08 10:46:20 -08:00
Marcel Kornacker	04d12f03fc	cleaning up logging output	2014-01-08 10:44:28 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00

25 Commits