The issue: Computing the full transitive closure for all slots can be very
expensive (10s of seconds for >2k slots, minutes for >4k slots).
Queries with many views and/or unions were affected most because each
union/view adds a new tuple with slots, increasing the total number of slots.
The fix: The new algorithm exploits the sparse structure of the value transfer
graph for a significant speedup (>100x). The high-level steps are:
1. Identify complete subgraps based on bi-directional value transfers, and
coalesce the slots of each complete subgraph into a single slot.
2. Map the remaining uni-directional value transfers into the new slot domain.
3. Identify the connected components of the uni-directional value transfers.
This step partitions the value transfers into disjoint sets.
4. Compute the transitive closure of each partition from (3) in the new slot
domain separately. Hopefully, the partitions are small enough to afford
the O(N^3) complexity of the brute-force transitive closure computation.
Change-Id: I35b57295d8f04b92f00ac48c04d1ef1be4daf41b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2360
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
This patch simplifies the complex slot materialization logic for unions by
making the materialization independent of conjuncts assigned to MergeNodes.
When 'pushing down' predicates into union operands, we drop union operands
with constant predicates evaluating to false. Constant predicates that
evaluate to true are simply ignored.
Change-Id: I0e7ccfb206bed29db2b5d667e2bb61310980e80a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2327
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
The reported issue is that we can have redundant hash expressions in exchanges.
The underlying cause is that we fail to remove redundant join predicates.
This patch enforces slot equivalences based on our computed equivalence classes
at the lowest possible plan node by generating new equality predicates.
Each plan subtree now has a minimal set of equality predicates that express
all known equivalences between slots belonging to tuples materialized at that
plan node.
As a result, eliminating redundant join predicates becomes trivial: It is
sufficient to pick a single representative predicate of each relevant equivalence
class. All predicates beyond that are redundant.
Change-Id: I7998fe8d7bdf84cc8eb129d32c86269bedeab68e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2177
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2278
To keep the predicate assignment/propagation logic simple, we assign conjuncts
whose underlying base table exprs are constant in at least one union operand
to the evaluating MergeNode, and not in the operand(s) whose corresponding base
table exprs are constant.
The JIRA describes two different bugs:
The first bug was that the slots required for evaluating such predicates in the
MergeNode were not marked as materialized. The second bug was that predicates
'pushed' into union operands did not get re-analyzed after substituting the
predicate's exprs with the result exprs of that union operand. Missing casts
lead to a crash. The new test covers both bugs.
Change-Id: I0f5b8a366b32f7d4b2587e13793b6103cdf7e8b3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2162
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
The bug: It is generally incorrect to re-order joins across outer/semi joins.
For example, an inner join following an outer join may reduce the cardinality,
so placing the inner join before the outer join during join re-ordering
would be incorrect because the outer join is cardinality preserving
(on one or both sides). The same argument holds for semi joins.
The fix: Place outer and semi joins at a fixed position in the plan based
on where they appeared in the original query. Inner joins to the left/right of
outer/semi joins are still re-ordered properly.
Change-Id: Idae837097b9376473d7f8124eef69b51f612b210
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1909
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1922
count(x) with no distinct and no group-by expressions returns NULL on empty input
if other distinct aggs (e.g. COUNT(distinct x) are present.
This happens because the COUNT is transformed to SUM(COUNT()),
with the inner COUNT being evaluated WITH a group-by expression (e.g. x).
SUM over empty input returns NULL, but COUNT should return 0.
This patch fixes this by replacing COUNT with zeroifnull(COUNT) before AggregateInfo
is generated if there are distinct aggs and no group-bys. The logic in AggregateInfo
itself has not been modified.
Change-Id: I902e3fdd95767135b2f3fe423e8802ef57366af1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1921
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins
The bug: Slot materialization on distinct aggs inside an inline view did not work
if the only reference to the 2nd-phase agg-tuple slots was in a predicate from an
outer query block (e.g., Where-clause of the block with the inline view ref).
The reason was that bound predicates were fetched from the wrong tuple
(from the 1st phase agg).
The fix: Assign predicates to the top-most agg in the single-node plan that can
evaluate them, as follows: For non-distinct aggs place them in the 1st phase agg
node. For distinct aggs place them in the 2nd phase agg node.
Change-Id: I0f6ab53cf7bb0c6aed9524ad2e24a849d2dc0ec4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1843
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1881
The bug: Multi-slot predicates bound to a single
outer-joined tuple were not marked as materialized.
In addition, such predicates were not picked up by nodes
under the join via getBoundPredicates() even if it
would be correct to do so.
The fix: Always mark slots of predicates that must be
evaluated by a join in SelectStmt.materializeRequiredSlots(),
regardless of whether the predicate can also be safely
evaluated below the join.
This patch also generalizes getBoundPredicates() to handle
multi-slot predicates and fixes some issues with redundant
predicate assignment. Still, the new approach has several
limitations which are documented in the predicate propatation
planner test to ease future improvements.
Change-Id: If5da0354a83c00a9766fc63b7780ed4d5a9c46e5
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1717
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1819
The bug was that the number of materialized agg-tuple slots did not correspond to the number
of materialized agg functions, due to binding predicates against an AggNode causing slot
materialization after SelectStmt.materializeRequiredSlots().
This patch fixes the issue by taking binding predicates (bound to a slot in an agg tuple)
into consideration in SelectStmt.materializeRequiredSlots().
I added a new sanity check in AggregationNode.toThrift() surfaced another issue with slot
materialization that is also fixed in this patch. The ordering exprs must be marked before
the agg exprs in SelectStmt.materializeRequiredSlots() because the odering exprs may contain
agg exprs that are only referenced inside the ORDER BY clause.
Change-Id: I1bdc0466f583907bed625ce6608938e59faee83f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1639
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1818
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Having predicates need to be transferred to the 2nd phase merge agg
for distinct + non-distinct aggregates without group by.
For distinct + non-distinct aggregates with group by, it is correct
to evaluate the predicates at the 2nd phase (non-merge) agg.
Change-Id: I71d73c4ef92becbb81e142bc0cb5f54e790b1fb5
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1743
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1817
This patch includes several changes to predicate assignment and propagation.
First, we now only register as outer joined those tuples of TableRefs
directly participating in an outer join. In particular,
materialized tuples referenced inside an outer-joined InlineView are not
registered as outer joined - only the InlineView's tuple is registered.
The other major change is that we detect when it is correct to propagate
predicates to scan nodes participating (directly or indirectly) in an outer
join by testing whether a predicate can become true if a tuple
is NULL. If that is the case, then it is generally not safe to propagate
a predicate because it would change the final result of the outer join.
Change-Id: Ia135ab15ec8c6ef756a908f797f96812d28c84c1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1567
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1606
Our code for eliminating redundant join predicates based on equivalence
classes is not quite right. I've commented out the relevant code
to ensure we don't incorrectly remove predicates. Left a TODO
to fix and re-enable this feature.
Change-Id: Ie76b365903dff6df271a378cbb4fd327ffa0631f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1569
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1572
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.
The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).
This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.
This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.
Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
The overall goal of this change allow for table metadata to be loaded in the background
but also to allow prioritization of loading on an as-needed basis. As part of analysis,
any tables that are not loaded are tracked and if analysis fails the Impalad will make
an RPC to the CatalogServer to requiest the metadata loading of these tables be
prioritized and analysis will be restarted.
To support this, the CatalogServer now has a deque of the tables to load. For
background loading, tables to load are added to the tail of the deque. However, a new
CatalogServer RPC was added that can prioritize the loading of one or more tables in
which case they will get added to the head of the deque. The next table to load is
always taken from the head. This helps prioritize loading but is admittedly not the most
fair approach.
The support the prioritized loading, some changes had to made on the Impalad side during
analysis:
- During analysis, any tables that are missing metadata are tracked.
- Analysis now runs in a loop. If it fails due to an AnalysisException AND at least 1
table/view was missing metadata, these tables missing metadata are requested to be
loaded by calling the CatalogServer.
- The impalad will wait until the required tables are received (by getting notified each
time there is a call to updateCatalog()), and waiting to run analysis until all tables
are available. Once the tables are available, analysis will restart.
This change also introduces two new flags:
--load_catalog_in_background (bool). When this is true (the default) the catalog server
will run a period background thread to queue all unloaded tables for loading. This is
generally the desired behavior, but there may be some cases (very large metastores) where
this may need to be disabled.
--num_metadata_loading_threads (int32). The number of threads to use when loading catalog
metadata (degree of parallelism). The default is 16, but it can be increased to improve
performance at the cost of stressing the Hive metastore/HDFS.
Change-Id: Ib94dbbf66ffcffea8c490f50f5c04d19fb2078ad
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1476
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1538
This change adds support for lazy loading of table metadata to the
CatalogService/Impalad. The way this works is that the CatalogService initially
sends out an update with only the databases and table names (wrapped as
IncompleteTables). When an Impalad encounters one of these tables, it will contact
the catalog service to get the metadata, possibly triggering a metadata load if the
catalog server has not yet loaded this table.
With these changes the catalog server starts up in just seconds, even for large
metastores since it only needs to call into the metastore to get the list of tables
and databases. The performance of "invalidate metadata" also improves for the same reason.
I also picked up the catalog cleanup patch I had to make the APIs a bit more consistent and
remove the need for using a LoadingCache for databases.
This also fixes up the FE tests to run in a more realistic fashion. The FE tests now run
against catalog object recieved from the catalog server. This actually turned up some bugs
in our previous test configuration where we were not running with the correct column stats
(we were always running with avgSerializedSize = slotSize). This changed some plans so the
planner tests needed to be updated.
Still TODO:
This does not include the changes to perform background metadata loading. I will send
that out as a separate patch on top of this.
Change-Id: Ied16f8a7f3a3393e89d6bfea78f0ba708d0ddd0e
Saving changes
Change-Id: I48c34408826b7396004177f5fc61a9523e664acc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1328
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1338
Tested-by: Lenni Kuff <lskuff@cloudera.com>
There are now 4 explain levels summarized as follows:
- Level 0: MINIMAL
Non-fragmented parallel plan only showing plan nodes with minimal attributes
- Level 1: STANDARD
Non-fragmented parallel plan with some details in plan nodes
- Level 2: EXTENDED
Non-fragmented parallel plan with full details in plan nodes including
the table/column stats, row size, #hosts, cardinality,
and estimated per-host memory requirement
- Level 3: VERBOSE
Fragmented parallel plan with full details (like level 2)
This patch also includes several bugfixes related to plan costing and/or
testing of explain plans.
Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
This patch fixes a few row key issues:
1. We used to assert that the row key filter must be a string literal.
However, it can also be a constant function. We need to eval the expr
and then use the result as the start/stop key.
2. Cast(row_key as int) simply failed.
This should not be transformed into start/stop key.
3. We used to assert that lower bound < upper bound.
This query:
select * from tbl where row_key > 'b' and row_key < 'a'
would simply ASSERT. We should simply not return any rows.
4. Handle NULL predicate
HBase row key can't be null. If either upper/lower bound is null, we simply
don't need to return any rows.
Change-Id: Ia03590a862888b377bf1f48bcb838b99193fa241
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1180
Reviewed-by: Alan Choi <alan@cloudera.com>
Tested-by: jenkins
Cross joins should be handled like outer joins in the join order
optimization in that the right table referenced by a cross join may not
be reordered anywhere before tables referenced to the left of the cross
join. If there are inner joins to the right of the cross join, those
tables may be reordered before the cross join.
E.g., if we have A JOIN B CROSS JOIN C JOIN D, then C must come after A
and B, but D may be reordered to come before C.
Also adds test cases for join order optimization and predicate propagation.
Change-Id: I6b1022dd3e862efbff81e283b43284d846c8eca4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1096
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Impala now exploits existing partitioning of the lhs and/or rhs of joins. Existing
partitioning effects the cost of choosing between a partitioned and a broadcast join,
as well as the final plan because Impala can avoid repartitioning the data.
An existing data partition is exploitable iff the lhs/rhs input partition is equivalent
to the target partition of the join. This matching procedure considers equivalence classes.
Change-Id: Ica080f35cf5063bea828963bc234ba69797e2030
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1070
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Adds the TPCDS queries as planner tests and fixes a few small issues
with the Planner test file parser. This adds the TPC-DS queries using
SQL-92 style joins that have a hand optimized (although
not perfect) join order.
Change-Id: I2d81e66af740b2d826b8ebd0c5ba8553b5faf0a2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1019
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Adds a CROSS JOIN (cartesian product). Common join code is moved from to
a new abstract base class BlockingJoinNode. We must keep all build RowBatches in
memory in order to iterate over them for every row from the left child. The
TupleRowList provides a convenient way to iterate over all of the rows.
A future change will address codegen for the CrossJoinNode.
Change-Id: I5e0caa6fb4ec802a9c87e700f9dd6238cea8cdf2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/970
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Introduces STRAIGHT_JOIN keyword to prevent join order optimization.
Structural changes to the planning framework:
- slot materialization: the decision whether to materialize a slot now happens *prior* to
plan generation. This is needed in order to be able to generate accurate cost estimates
at plan generation time. see QueryStmt.materializeRequiredSlots()
- added PlanNode.init(), which initializes the entire state of a PlanNode; this subsumes
finalize()
* computeMemLayout() now happens per-tuple in the corresponding ScanNode's init()
* init() calls computeStats() by default; also marks slots as materialized and calls
TupleDescriptor.computeMemLayout()
- added PlanNode.tblRefIds_
- restructured UnionStmt and union plan generation to fit pred propagation model:
all tuples are created (and equiv predicates registered) prior to plan generation
- added Expr.isAuxExpr
Change-Id: I475c1645bfca9e84ae6e5f529e7781d9532e5c9a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/955
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Updates our compute stats script to execute using Impala. This allows us
to easily compute stats on all tables in a database or all tables in the
metastore.
The updated stats caused one of the TPCH plans to change so this also
updates the TPCH planner test results.
Change-Id: I17e5dcd1036a35e40eb4eb2c8e4a20702db9049c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1024
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Adds support for skipping a number of rows with an ORDER BY clause and a LIMIT. Hive
does not support OFFSET so creating a view with an OFFSET will not work in Hive.
For example, "SELECT * FROM T1 ORDER BY ID LIMIT 20 OFFSET 5" will do the sorting, skip
5 rows, then return the next 20. OFFSET requires an ORDER BY clause.
Note this is not very efficient as we must actually keep (limit+offset) rows in memory
in the topn-node, and all child sort nodes must as well. Users should be careful when
using this feature.
Change-Id: I4d7021c278296e7bdbfa0e6f2699cd6f23eef59d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/900
Tested-by: jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
This patch redoes how the aggregation node is implemented. The functionality is
now split between aggregation-node, agg-expr and aggregate-functions. This is a working
progress (there's still a lot of debug stuff I added that needs to be cleaned up) but
it does pass the tests.
Aggregation-node is now very simple and now only deals with the grouping part.
Aggregate-expr serves as the glue between the agg node and the aggregate functions.
The aggregation functions are implemented with the UDA interface. I've reimplemented
our existing aggregate functions with this setup. For true UDAs, the binaries would be
loaded in aggregate-expr.
This also includes some preliminary changes in the FE. We now need to annotate each
AggNode as executing the update vs. merge phase (root aggs execute update, others
execute merge) and if it needs a finalize step (only the root does). This is more
general than our builtins which are too simple to need this structure.
There is a big TODO here to allow the intermediate types between agg nodes to change.
For example, in distinct estimate, the input type is the column type and the output type
is a bigint. We'd like the intermediate type to be CHAR(256). This is different since
currently, the intermediate type and output type have always been the same. We've hacked
around this by having both the intermediate and output type be TYPE_STRING. I've left
this for another patch (changing the BE to support this is trivial).
For aggregates that result in strings, we used to store some additional stuff past the
end of the tuple. The layout was:
<tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc
The rationale for this is that we want to reuse the buffer for min/max and grow the buffer
more quickly for group_concat. This breaks down the abstraction between agg-expr and
agg-node and is not something UDAs can use in general. Rather than try to hack around
this, I think the proper solution is to the intermediate type not be StringValue and
to contain the buffer length itself.
This patch also resurrects the distinct estimate code. The distinct estimate functions
exercise all of the code paths.
Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346
Reviewed-on: http://gerrit.ent.cloudera.com:8080/564
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Fixed cost estimation of union queries and exchange nodes.
Fixed propagation of stats through cloning of exprs and plan nodes.
Fixed propagation of expr stats to slots they are materialized into (e.g., grouping columns in multi-level aggs).
Improved explain output for constant selects.
Change-Id: I96d1652c00d48e4093b85ae7fc8bad28d74b8b81
Reviewed-on: http://gerrit.ent.cloudera.com:8080/547
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>