impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 00:02:28 -05:00

Author	SHA1	Message	Date
Alex Behm	15e05082c0	IMPALA-831: Distributed aggregation and top-n over unions. Change-Id: I056e8271421008378db93e8b2393861cc9dd4b90 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1886	2014-03-13 15:42:31 -07:00
Alex Behm	a615ebc549	IMPALA-822,IMP-1271: Binding predicates on an aggregation now properly trigger slot materialization. The bug was that the number of materialized agg-tuple slots did not correspond to the number of materialized agg functions, due to binding predicates against an AggNode causing slot materialization after SelectStmt.materializeRequiredSlots(). This patch fixes the issue by taking binding predicates (bound to a slot in an agg tuple) into consideration in SelectStmt.materializeRequiredSlots(). I added a new sanity check in AggregationNode.toThrift() surfaced another issue with slot materialization that is also fixed in this patch. The ordering exprs must be marked before the agg exprs in SelectStmt.materializeRequiredSlots() because the odering exprs may contain agg exprs that are only referenced inside the ORDER BY clause. Change-Id: I1bdc0466f583907bed625ce6608938e59faee83f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1639 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1818 Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-03-08 00:25:26 -08:00
Alex Behm	f71767c612	IMPALA-846: Add additional regression test. This issue has coincidentally been resolved by the fix for IMPALA-820. This patch adds an additional regression test for explicitly covering IMPALA-846. Change-Id: Ib60174676e5bb53de543a1db30adc05cef4d6593 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1719 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1730	2014-03-03 16:50:36 -08:00
Alex Behm	cb8150e8ee	IMPALA-817: Check equality of function name in Function.equals(). Change-Id: Ib9b4ee3a21f90fdb0d7ebccd89462dc67040bd1e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1594 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1611 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2014-02-19 17:13:51 -08:00
Nong Li	0d2919fe7f	Refactor scalar and aggregate function analysis and execution. This patch cleans up analysis and execution of scalar and aggregate functions so that there is no difference between how builtins and user functions are handled. The only difference is that the catalog is populated with the builtins all the time. The BE always gets a TFunction object and just executes it (builtins will have an empty hdfs file location). This removes the opcode registry and all of the functionality is subsumed by the catalog, most of which was already duplicated there anyway. This also introduces the concept of a system database; databases that the user cannot modify and is populated automatically on startup. Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577	2014-02-18 18:40:08 -08:00
Alex Behm	6799c93922	Simplified/enhanced explain plans with a total of four explain levels. There are now 4 explain levels summarized as follows: - Level 0: MINIMAL Non-fragmented parallel plan only showing plan nodes with minimal attributes - Level 1: STANDARD Non-fragmented parallel plan with some details in plan nodes - Level 2: EXTENDED Non-fragmented parallel plan with full details in plan nodes including the table/column stats, row size, #hosts, cardinality, and estimated per-host memory requirement - Level 3: VERBOSE Fragmented parallel plan with full details (like level 2) This patch also includes several bugfixes related to plan costing and/or testing of explain plans. Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-10 19:17:59 -08:00
Marcel Kornacker	fd182201bc	Predicate propagation plus join order optimization. Introduces STRAIGHT_JOIN keyword to prevent join order optimization. Structural changes to the planning framework: - slot materialization: the decision whether to materialize a slot now happens prior to plan generation. This is needed in order to be able to generate accurate cost estimates at plan generation time. see QueryStmt.materializeRequiredSlots() - added PlanNode.init(), which initializes the entire state of a PlanNode; this subsumes finalize() * computeMemLayout() now happens per-tuple in the corresponding ScanNode's init() * init() calls computeStats() by default; also marks slots as materialized and calls TupleDescriptor.computeMemLayout() - added PlanNode.tblRefIds_ - restructured UnionStmt and union plan generation to fit pred propagation model: all tuples are created (and equiv predicates registered) prior to plan generation - added Expr.isAuxExpr Change-Id: I475c1645bfca9e84ae6e5f529e7781d9532e5c9a Reviewed-on: http://gerrit.ent.cloudera.com:8080/955 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:24 -08:00
Alex Behm	3f54240fed	PlannerTest uses explain level 'normal'. Only add stats and costs to explain output in 'verbose' mode. Change-Id: I827b4c7085b5aa2dc5521f8748d8973178f43f4c Reviewed-on: http://gerrit.ent.cloudera.com:8080/678 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:23 -08:00
Alex Behm	4bb8b38cde	Added stats and cost estimates to explain output. Change-Id: I1273745a439fd25cefa4e08ecc075c98cc8bfc45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/602 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:53:22 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00
Marcel Kornacker	d85b90cb22	SlotDescriptor.label plus repartitioning for inserts when column stats are missing.	2014-01-08 10:51:56 -08:00
Alan Choi	2d25f11ec3	IMPALA-91 new explain plan output	2014-01-08 10:50:10 -08:00
Marcel Kornacker	0c36c7f327	Partitioned merge aggregation.	2014-01-08 10:48:59 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Alan Choi	476a665763	IMP-620: print number of scanned partition and total scaned bytes	2014-01-08 10:46:57 -08:00
Marcel Kornacker	fd77f06f15	Moving functional-newplanner back to functional-planner (and renaming NewPlanner to Planner)	2014-01-08 10:46:20 -08:00
Marcel Kornacker	04d12f03fc	cleaning up logging output	2014-01-08 10:44:28 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00

18 Commits