With the prefetching changes, the probe expressions' local
allocations are no longer freed via QueryMaintenance() in
PHJ. Instead, they are freed explicitly in GetNext() after
an entire probe batch has been processed. Due to this
change in how we handle local allocations of probe expressions,
a DCHECK was added to verify that there is no local allocation
from the probe expression in ProcessBuildInput(). Turns out that
Expr::Open() called in ConstructBuildSide() on the probe
expressions may have caused local allocations to occur for
certain UDFs (e.g. extract()).
This change handles the situation above by freeing local
allocations of the probe expressions once before calling
ProcessBuildInput() in ConstructBuildSide(). A new regression
test is also added for this specific case.
Change-Id: I2096ca3e2093c5ab0ecc0e7ca4cd1b5f3c1ed1ed
Reviewed-on: http://gerrit.cloudera.org:8080/3253
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
Enforces that the planner treats IS NOT DISTINCT FROM as eligible for
hash joins, but does not find the minimum spanning tree of
equivalences for use in optimizing query plans; this is left as future
work.
Change-Id: I62c5300b1fbd764796116f95efe36573eed4c8d0
Reviewed-on: http://gerrit.cloudera.org:8080/710
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
In some cases in the NLJ node eos_ wasn't set even though the limit was
reached. This prevented the limit from being handled correctly before
returning rows to the caller of GetNext(). This could result in either
too many rows being returned, or a crash when the row batch size was
set to an invalid negative number.
The fix is to always check for whether the limit was reached before
returning from GetNext().
Change-Id: I660e774787870213ada9f2d3e6f10953d9937022
Reviewed-on: http://gerrit.cloudera.org:8080/797
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Implement nested-loop join in Impala with support for multiple join
modes, including inner, outer, semi and anti joins. Null-aware left
anti-join is not currently supported.
Summary of changes:
Introduced the NestedLoopJoinNode class in the FE that represents the nested
loop join. Common functionality between NestedLoopJoinNode and HashJoinNode
(e.g. cardinality estimation) was moved to the JoinNode class.
In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop
join execution strategy.
Change-Id: I238ec7dc0080f661847e5e1b84e30d61c3b0bb5c
Reviewed-on: http://gerrit.cloudera.org:8080/652
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Implement nested-loop join in Impala with support for multiple join
modes, including inner, outer, semi and anti joins. Null-aware left
anti-join is not currently supported.
Summary of changes:
Introduced the NestedLoopJoinNode class in the FE that represents the nested
loop join. Common functionality between NestedLoopJoinNode and HashJoinNode
(e.g. cardinality estimation) was moved to the JoinNode class.
In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop
join execution strategy.
Change-Id: Id65a1aae84335bba53f06339bdfa64a1b0be079e
Reviewed-on: http://gerrit.cloudera.org:8080/457
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.
Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This also reverts back to using CRC hash since FNV is not codegen'd
yet. The perf is not as good as the original HJ in a microbenchmark; I
haven't run a cluster run yet.
Change-Id: Ie4dc983f31631fbc78720425a0e354dd1d3342a6
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4219
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
Adding the "anti join" keyword in the frontend and the corresponding backend paths for the
partitioned hash join implementation. Adding some basic testing for this new join (the
other types have already tests).
Also, fixing a bug in the tuple stream when it was handling strings.
Change-Id: Ied8cff96b2bca284a5f66f7d11df5c5b5ec789cc
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3805
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
This patch ensures that all hash-partitioning senders to a hash-partitioned
fragment hash on exprs of identical types. Casts are added as necessary.
Otherwise, the hashes generated for identical partition values may differ
among senders if the partition-expr types are not identical.
The new logic is placed into PlanFragment.finalize() in order to avoid
repeated re-casting of senders during plan generation, since every time
a child fragment is absorbed into a partition-compatible parent we
potentially need to add casts to all senders of that fragment again.
Change-Id: Id9f581cc03127f64f0631d9b288fab4cd4dd8a82
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3689
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3708
The following changes are included in this commit:
1. Modified the alltypesagg table to include an additional partition key
that has nulls.
2. Added a number of tests in hdfs.test that exercise the partition
pruning logic (see IMPALA-887).
3. Modified all the tests that are affected by the change in alltypesagg.
Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236
preconditions check
This commit fixes IMPALA-964 where full outer join between two inline
views followed by a group by (e.g. select 1 FROM (VALUES(1 x, 1 y)) a
FULL OUTER JOIN (VALUES(1 x, 1 y)) b ON (a.x = b.y) GROUP BY a.x;)
hits a preconditions check. This check evaluates if the numNodes
(number of nodes for the purpose of resource estimation) variable
is greater or equal to zero and is triggered when we try to compute
the resource estimates (number of distinct values) of a plan fragment.
The following changes are included in this commit:
1. Modified the getNumDistinctValues function in PlanFragment class to
consider the special case where the numNodes of a plan fragment is -1.
2. Added a test case in QueryTest/joins.test.
Change-Id: I2962ed5079e174d0e76ad990ab84e1fb1a4607ef
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2466
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2514
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Cross joins should be handled like outer joins in the join order
optimization in that the right table referenced by a cross join may not
be reordered anywhere before tables referenced to the left of the cross
join. If there are inner joins to the right of the cross join, those
tables may be reordered before the cross join.
E.g., if we have A JOIN B CROSS JOIN C JOIN D, then C must come after A
and B, but D may be reordered to come before C.
Also adds test cases for join order optimization and predicate propagation.
Change-Id: I6b1022dd3e862efbff81e283b43284d846c8eca4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1096
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Adds a CROSS JOIN (cartesian product). Common join code is moved from to
a new abstract base class BlockingJoinNode. We must keep all build RowBatches in
memory in order to iterate over them for every row from the left child. The
TupleRowList provides a convenient way to iterate over all of the rows.
A future change will address codegen for the CrossJoinNode.
Change-Id: I5e0caa6fb4ec802a9c87e700f9dd6238cea8cdf2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/970
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Fixed the following stats-related bugs:
- Per-partition row count was not distributed properly via CatalogService
- HBase column stats were not loaded and distributed properly
Enhancements to test framework:
- Allow regex specification of expected row or column values
- Fixed expected results of some tests because the test framework
did not catch that they were incorrect
Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51
Reviewed-on: http://gerrit.ent.cloudera.com:8080/813
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:
./run-benchmark --workloads=hive-benchmark,tpch
We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.
To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.
Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:
./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.
Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3
Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3
This changeset also includes a few other minor tweaks to some of the test
scripts.
Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6