Commit Graph

18 Commits

Author SHA1 Message Date
Alex Behm
15e05082c0 IMPALA-831: Distributed aggregation and top-n over unions.
Change-Id: I056e8271421008378db93e8b2393861cc9dd4b90
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1840
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1886
2014-03-13 15:42:31 -07:00
Alex Behm
a615ebc549 IMPALA-822,IMP-1271: Binding predicates on an aggregation now properly trigger slot materialization.
The bug was that the number of materialized agg-tuple slots did not correspond to the number
of materialized agg functions, due to binding predicates against an AggNode causing slot
materialization after SelectStmt.materializeRequiredSlots().
This patch fixes the issue by taking binding predicates (bound to a slot in an agg tuple)
into consideration in SelectStmt.materializeRequiredSlots().
I added a new sanity check in AggregationNode.toThrift() surfaced another issue with slot
materialization that is also fixed in this patch. The ordering exprs must be marked before
the agg exprs in SelectStmt.materializeRequiredSlots() because the odering exprs may contain
agg exprs that are only referenced inside the ORDER BY clause.

Change-Id: I1bdc0466f583907bed625ce6608938e59faee83f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1639
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1818
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-03-08 00:25:26 -08:00
Alex Behm
f71767c612 IMPALA-846: Add additional regression test.
This issue has coincidentally been resolved by the fix for IMPALA-820.
This patch adds an additional regression test for explicitly covering
IMPALA-846.

Change-Id: Ib60174676e5bb53de543a1db30adc05cef4d6593
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1719
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1730
2014-03-03 16:50:36 -08:00
Alex Behm
cb8150e8ee IMPALA-817: Check equality of function name in Function.equals().
Change-Id: Ib9b4ee3a21f90fdb0d7ebccd89462dc67040bd1e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1594
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1611
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
2014-02-19 17:13:51 -08:00
Nong Li
0d2919fe7f Refactor scalar and aggregate function analysis and execution.
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.

The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).

This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.

This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.

Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
2014-02-18 18:40:08 -08:00
Alex Behm
6799c93922 Simplified/enhanced explain plans with a total of four explain levels.
There are now 4 explain levels summarized as follows:
- Level 0: MINIMAL
  Non-fragmented parallel plan only showing plan nodes with minimal attributes
- Level 1: STANDARD
  Non-fragmented parallel plan with some details in plan nodes
- Level 2: EXTENDED
  Non-fragmented parallel plan with full details in plan nodes including
  the table/column stats, row size, #hosts, cardinality,
  and estimated per-host memory requirement
- Level 3: VERBOSE
  Fragmented parallel plan with full details (like level 2)

This patch also includes several bugfixes related to plan costing and/or
testing of explain plans.

Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-10 19:17:59 -08:00
Marcel Kornacker
fd182201bc Predicate propagation plus join order optimization.
Introduces STRAIGHT_JOIN keyword to prevent join order optimization.

Structural changes to the planning framework:
- slot materialization: the decision whether to materialize a slot now happens *prior* to
  plan generation. This is needed in order to be able to generate accurate cost estimates
  at plan generation time. see QueryStmt.materializeRequiredSlots()
- added PlanNode.init(), which initializes the entire state of a PlanNode; this subsumes
  finalize()
  * computeMemLayout() now happens per-tuple in the corresponding ScanNode's init()
  * init() calls computeStats() by default; also marks slots as materialized and calls
    TupleDescriptor.computeMemLayout()
- added PlanNode.tblRefIds_
- restructured UnionStmt and union plan generation to fit pred propagation model:
  all tuples are created (and equiv predicates registered) prior to plan generation
- added Expr.isAuxExpr

Change-Id: I475c1645bfca9e84ae6e5f529e7781d9532e5c9a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/955
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:24 -08:00
Alex Behm
3f54240fed PlannerTest uses explain level 'normal'. Only add stats and costs to explain output in 'verbose' mode.
Change-Id: I827b4c7085b5aa2dc5521f8748d8973178f43f4c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/678
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:23 -08:00
Alex Behm
4bb8b38cde Added stats and cost estimates to explain output.
Change-Id: I1273745a439fd25cefa4e08ecc075c98cc8bfc45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/602
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:53:22 -08:00
Nong Li
15db34e356 AggregationNode refactoring
This patch redoes how the aggregation node is implemented. The functionality is
now split between aggregation-node, agg-expr and aggregate-functions. This is a working
progress (there's still a lot of debug stuff I added that needs to be cleaned up) but
it does pass the tests.

Aggregation-node is now very simple and now only deals with the grouping part.
Aggregate-expr serves as the glue between the agg node and the aggregate functions.
The aggregation functions are implemented with the UDA interface. I've reimplemented
our existing aggregate functions with this setup. For true UDAs, the binaries would be
loaded in aggregate-expr.

This also includes some preliminary changes in the FE. We now need to annotate each
AggNode as executing the update vs. merge phase (root aggs execute update, others
execute merge) and if it needs a finalize step (only the root does). This is more
general than our builtins which are too simple to need this structure.

There is a big TODO here to allow the intermediate types between agg nodes to change.
For example, in distinct estimate, the input type is the column type and the output type
is a bigint. We'd like the intermediate type to be CHAR(256). This is different since
currently, the intermediate type and output type have always been the same. We've hacked
around this by having both the intermediate and output type be TYPE_STRING. I've left
this for another patch (changing the BE to support this is trivial).
For aggregates that result in strings, we used to store some additional stuff past the
end of the tuple. The layout was:
<tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc

The rationale for this is that we want to reuse the buffer for min/max and grow the buffer
more quickly for group_concat. This breaks down the abstraction between agg-expr and
agg-node and is not something UDAs can use in general. Rather than try to hack around
this, I think the proper solution is to the intermediate type not be StringValue and
to contain the buffer length itself.

This patch also resurrects the distinct estimate code. The distinct estimate functions
exercise all of the code paths.

Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346
Reviewed-on: http://gerrit.ent.cloudera.com:8080/564
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:13 -08:00
Marcel Kornacker
d85b90cb22 SlotDescriptor.label plus repartitioning for inserts when column stats are missing. 2014-01-08 10:51:56 -08:00
Alan Choi
2d25f11ec3 IMPALA-91 new explain plan output 2014-01-08 10:50:10 -08:00
Marcel Kornacker
0c36c7f327 Partitioned merge aggregation. 2014-01-08 10:48:59 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Alan Choi
476a665763 IMP-620: print number of scanned partition and total scaned bytes 2014-01-08 10:46:57 -08:00
Marcel Kornacker
fd77f06f15 Moving functional-newplanner back to functional-planner (and renaming NewPlanner to Planner) 2014-01-08 10:46:20 -08:00
Marcel Kornacker
04d12f03fc cleaning up logging output 2014-01-08 10:44:28 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00