This commit adds support for uncorrelated EXISTS subqueries in Impala.
Uncorrelated EXISTS subqueries are rewritten using a CROSS JOIN.
Uncorrelated NOT EXISTS subqueries are not supported.
Change-Id: I0003dcdc0fa5cc99931b9a9f4deddbcd42572490
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4140
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4186
Row batches contain auxiliary memory that can reside in tuple pools, io buffers and
now tuple streams. Like the other resources, these need to attached to row batches
and transfered up the operator tree to make sure the tuple ptrs are always valid.
Fixed bug in BufferedTupleStream to not delete blocks on read if it is pinned.
Fixed PHJ bug with row batch boundaries causing current_probe_row_ to be NULL.
Change-Id: I4c66d9961a117bfe3ed577de6170e875ea1feee7
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3983
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4157
distinct
This commit fixes two subquery issues:
1. During the rewrite of aggregate subqueries with count, a new select
list is created for the outer select block to eliminate new visible
tuples. However, the new select list was not initialized correctly,
causing distinct clauses to not be preserved.
2. Pushing negation to operands during a query rewrite was causing a
StackOverflowError when it was encountering predicates for which a
negate function is not implemented. Consequently, it was using the
negate function from the parent class causing it to recurse infinitely.
Change-Id: I6f1b8090af40fa55b13661d637f9aaaa00dfcf5c
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4115
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4141
This commit implements nested queries with [NOT] IN, [NOT] EXISTS and
aggregate subquery predicates in Impala. The following cases are
supported:
1. Correlated and uncorrelated [NOT] IN.
2. Correlated [NOT] EXISTS.
3. Correlated and uncorrelated aggregate subqueries.
Change-Id: Ia3f4843c5f07d4e31ef3faedc58a15e623f91a5d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3754
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4109
Semi or anti-joined table references are now only visible inside the
On-clause of the corresponding join.
Change-Id: Id93e53ecdf2a74baf9736aa427fa7af15358ca27
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3789
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
The following changes are included in this commit:
1. Modified the alltypesagg table to include an additional partition key
that has nulls.
2. Added a number of tests in hdfs.test that exercise the partition
pruning logic (see IMPALA-887).
3. Modified all the tests that are affected by the change in alltypesagg.
Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236
The select exprs of an inline view may not always be materialised, yet
the output tuple itself may be. This patch fixes a crash in this
situation in the backend aggregation node which assumed its output tuple
would always have at least one materialised slot.
The cause was a couple of too-conservative DCHECKs that failed if the
tuple was NULL. In fact, the code was robust to this possibility without
the checks, so this bug didn't affect release builds of Impala.
Change-Id: If0b90809d30fcd196f55197953392452d1ac9c4f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1431
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8c1c21b66c43e900760ace54d090305f32a85a1f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1471
Tested-by: Henry Robinson <henry@cloudera.com>
Fixed the following stats-related bugs:
- Per-partition row count was not distributed properly via CatalogService
- HBase column stats were not loaded and distributed properly
Enhancements to test framework:
- Allow regex specification of expected row or column values
- Fixed expected results of some tests because the test framework
did not catch that they were incorrect
Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51
Reviewed-on: http://gerrit.ent.cloudera.com:8080/813
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh
Fixes bug in Planner.createHashJoinFragment(), which didn't set the left child of the
hj node to the output of the left child fragment.
Also: row descriptor was set incorrectly (too wide; included tuples that weren't materialized)
for roots of plan trees of non-root fragments if those fragments materialized an aggregate
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:
./run-benchmark --workloads=hive-benchmark,tpch
We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.
To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.
Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:
./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.
Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3
Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3
This changeset also includes a few other minor tweaks to some of the test
scripts.
Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6