Commit Graph

31 Commits

Author SHA1 Message Date
Alex Behm
70d7ff07af CDH-19856: Disable Hive's stats autogathering.
Change-Id: I04e91f91d29b7863848a750e362c9d94469df7f2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3156
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3169
2014-06-19 16:48:34 -07:00
Alex Behm
ef6705d7e0 Rename MergeNode to UnionNode.
Change-Id: I9e3675a103757db1345b04bd1d102d2719efddd0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3128
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3154
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-06-19 12:44:21 -07:00
Alex Behm
4be9611474 Temporarily disable insert planner tests (CDH-19856).
Change-Id: Ibcf914b87fb0ae958c5039a7cd2e8be72aa4295e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3110
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-06-17 23:34:07 -07:00
Alex Behm
eed829f778 Fix misleading test to unblock full data loading.
Change-Id: I98c218188a0cf459cacb96363e7a65ebb4525f07
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3100
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-06-17 17:45:04 -07:00
Nong Li
5d903efca3 ExecSummary
The runtime profile as we present it is not very useful and I think the structure of
it makes it hard to consume. This patch adds a new client facing schemed set of
counters that are collected from the runtime profiles. For example, with this structure
it would be easy to have the shell get the stats of a running query and print a useful
progress report or to check the most relevant metrics for diagnosing issues.

Here's an example of the output for one of the tpch queries:
Operator              #Hosts   Avg Time   Max Time    #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail
------------------------------------------------------------------------------------------------------------------------
09:MERGING-EXCHANGE        1   79.738us   79.738us        5           5         0        -1.00 B  UNPARTITIONED
05:TOP-N                   3   84.693us   88.810us        5           5  12.00 KB       120.00 B
04:AGGREGATE               3    5.263ms    6.432ms        5           5  44.00 KB       10.00 MB  MERGE FINALIZE
08:AGGREGATE               3   16.659ms   27.444ms   52.52K     600.12K   3.20 MB       15.11 MB  MERGE
07:EXCHANGE                3    2.644ms      5.1ms   52.52K     600.12K         0              0  HASH(o_orderpriority)
03:AGGREGATE               3  342.913ms  966.291ms   52.52K     600.12K  10.80 MB       15.11 MB
02:HASH JOIN               3    2s165ms    2s171ms  144.87K     600.12K  13.63 MB      941.01 KB  INNER JOIN, BROADCAST
|--06:EXCHANGE             3    8.296ms    8.692ms   57.22K      15.00K         0              0  BROADCAST
|  01:SCAN HDFS            2    1s412ms    1s978ms   57.22K      15.00K  24.21 MB      176.00 MB  tpch.orders o
00:SCAN HDFS               3    8s032ms    8s558ms    3.79M     600.12K  32.29 MB      264.00 MB  tpch.lineitem l

Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-06-11 03:10:11 -07:00
Alex Behm
47c52ade84 IMPALA-866: Make HdfsScanNode.computeStats() idempotent with respect to totalBytes_.
Change-Id: I1c243b089db82c0544586a2a1428081aa2dbcd52
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1844
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1852
2014-03-11 18:20:15 -07:00
Nong Li
0d2919fe7f Refactor scalar and aggregate function analysis and execution.
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.

The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).

This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.

This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.

Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
2014-02-18 18:40:08 -08:00
Alex Behm
6799c93922 Simplified/enhanced explain plans with a total of four explain levels.
There are now 4 explain levels summarized as follows:
- Level 0: MINIMAL
  Non-fragmented parallel plan only showing plan nodes with minimal attributes
- Level 1: STANDARD
  Non-fragmented parallel plan with some details in plan nodes
- Level 2: EXTENDED
  Non-fragmented parallel plan with full details in plan nodes including
  the table/column stats, row size, #hosts, cardinality,
  and estimated per-host memory requirement
- Level 3: VERBOSE
  Fragmented parallel plan with full details (like level 2)

This patch also includes several bugfixes related to plan costing and/or
testing of explain plans.

Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-10 19:17:59 -08:00
Alex Behm
a94f1efe29 Cleanup related to [shuffle]/[noshuffle] insert hints.
Change-Id: I360c8fdc9be4346d148c2daadc36e90072b42b29
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1071
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:28 -08:00
Marcel Kornacker
fd182201bc Predicate propagation plus join order optimization.
Introduces STRAIGHT_JOIN keyword to prevent join order optimization.

Structural changes to the planning framework:
- slot materialization: the decision whether to materialize a slot now happens *prior* to
  plan generation. This is needed in order to be able to generate accurate cost estimates
  at plan generation time. see QueryStmt.materializeRequiredSlots()
- added PlanNode.init(), which initializes the entire state of a PlanNode; this subsumes
  finalize()
  * computeMemLayout() now happens per-tuple in the corresponding ScanNode's init()
  * init() calls computeStats() by default; also marks slots as materialized and calls
    TupleDescriptor.computeMemLayout()
- added PlanNode.tblRefIds_
- restructured UnionStmt and union plan generation to fit pred propagation model:
  all tuples are created (and equiv predicates registered) prior to plan generation
- added Expr.isAuxExpr

Change-Id: I475c1645bfca9e84ae6e5f529e7781d9532e5c9a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/955
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:24 -08:00
Alex Behm
2325c8c923 Added [shuffle]/[noshuffle] plan hints for forcing/preventing repartitioning before an insert.
Change-Id: I0647366815f4488cabbcb1fc7bc3cf851960c44e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1007
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:16 -08:00
Alex Behm
3f54240fed PlannerTest uses explain level 'normal'. Only add stats and costs to explain output in 'verbose' mode.
Change-Id: I827b4c7085b5aa2dc5521f8748d8973178f43f4c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/678
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:23 -08:00
Alex Behm
4bb8b38cde Added stats and cost estimates to explain output.
Change-Id: I1273745a439fd25cefa4e08ecc075c98cc8bfc45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/602
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:53:22 -08:00
Nong Li
15db34e356 AggregationNode refactoring
This patch redoes how the aggregation node is implemented. The functionality is
now split between aggregation-node, agg-expr and aggregate-functions. This is a working
progress (there's still a lot of debug stuff I added that needs to be cleaned up) but
it does pass the tests.

Aggregation-node is now very simple and now only deals with the grouping part.
Aggregate-expr serves as the glue between the agg node and the aggregate functions.
The aggregation functions are implemented with the UDA interface. I've reimplemented
our existing aggregate functions with this setup. For true UDAs, the binaries would be
loaded in aggregate-expr.

This also includes some preliminary changes in the FE. We now need to annotate each
AggNode as executing the update vs. merge phase (root aggs execute update, others
execute merge) and if it needs a finalize step (only the root does). This is more
general than our builtins which are too simple to need this structure.

There is a big TODO here to allow the intermediate types between agg nodes to change.
For example, in distinct estimate, the input type is the column type and the output type
is a bigint. We'd like the intermediate type to be CHAR(256). This is different since
currently, the intermediate type and output type have always been the same. We've hacked
around this by having both the intermediate and output type be TYPE_STRING. I've left
this for another patch (changing the BE to support this is trivial).
For aggregates that result in strings, we used to store some additional stuff past the
end of the tuple. The layout was:
<tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc

The rationale for this is that we want to reuse the buffer for min/max and grow the buffer
more quickly for group_concat. This breaks down the abstraction between agg-expr and
agg-node and is not something UDAs can use in general. Rather than try to hack around
this, I think the proper solution is to the intermediate type not be StringValue and
to contain the buffer length itself.

This patch also resurrects the distinct estimate code. The distinct estimate functions
exercise all of the code paths.

Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346
Reviewed-on: http://gerrit.ent.cloudera.com:8080/564
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:13 -08:00
Alex Behm
9065648d77 Improvements to cost estimation and explain output.
Fixed cost estimation of union queries and exchange nodes.
Fixed propagation of stats through cloning of exprs and plan nodes.
Fixed propagation of expr stats to slots they are materialized into (e.g., grouping columns in multi-level aggs).
Improved explain output for constant selects.

Change-Id: I96d1652c00d48e4093b85ae7fc8bad28d74b8b81
Reviewed-on: http://gerrit.ent.cloudera.com:8080/547
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:53:08 -08:00
Alex Behm
e52ed0800b IMPALA-524: Fix computation of stats for ExchangeNode and merge AggregationNodes. The issue caused unnecessary repartitioning for static partition insert queries having grouped aggregation in the feeding query stmt.
Change-Id: I5f4017e2c4d5a1bf88f51c4e0ff7ab28911e14f1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/202
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:11 -08:00
Marcel Kornacker
d85b90cb22 SlotDescriptor.label plus repartitioning for inserts when column stats are missing. 2014-01-08 10:51:56 -08:00
Alex Behm
937a44f9f8 IMPALA-68: Support Values() statement. 2014-01-08 10:50:31 -08:00
Alex Behm
c7819f4db7 IMPALA-87: Support INSERT from SELECT without FROM. 2014-01-08 10:50:30 -08:00
Alan Choi
2d25f11ec3 IMPALA-91 new explain plan output 2014-01-08 10:50:10 -08:00
Marcel Kornacker
21ec49e810 IMPALA-150: Performing dynamic partition insert via Impala on "large" table fails and takes down HDFS
This is solved by repartitioning the input to the hdfs table sinks on the partition key columns of the hdfs
table, so that each partition is only written by a single node.
2014-01-08 10:50:07 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Henry Robinson
15228f945f IMP-503: INSERTS into unpartitioned tables should be checked for union compatibility 2014-01-08 10:46:57 -08:00
Alan Choi
476a665763 IMP-620: print number of scanned partition and total scaned bytes 2014-01-08 10:46:57 -08:00
Michael Ubell
0750384b41 IMP-497 Insert with limit, remove extra files from test. 2014-01-08 10:46:33 -08:00
Michael Ubell
116241f1d1 IMP-497 Insert with limit. 2014-01-08 10:46:33 -08:00
Marcel Kornacker
fd77f06f15 Moving functional-newplanner back to functional-planner (and renaming NewPlanner to Planner) 2014-01-08 10:46:20 -08:00
Henry Robinson
e7348a209b IMP-232: Parallel INSERT OVERWRITE 2014-01-08 10:45:04 -08:00
Henry Robinson
c472213eeb Parallel INSERT, sink-per-scan-node plan 2014-01-08 10:44:35 -08:00
Marcel Kornacker
04d12f03fc cleaning up logging output 2014-01-08 10:44:28 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00