Commit Graph

47 Commits

Author SHA1 Message Date
Tim Armstrong
63f5e8ec00 IMPALA-1270: add distinct aggregation to semi joins
When generating plans with left semi/anti joins (typically
resulting from subquery rewrites), the planner now
considers inserting a distinct aggregation on the inner
side of the join. The decision is based on whether that
aggregation would reduce the number of rows by more than
75%. This is fairly conservative and the optimization
might be beneficial for smaller reductions, but the
conservative threshold is chosen to reduce the number
of potential plan regressions.

The aggregation can both reduce the # of rows and the
width of the rows, by projecting out unneeded slots.

ENABLE_DISTINCT_SEMI_JOIN_OPTIMIZATION query option is
added to allow toggling the optimization.

Tests:
* Add positive and negative planner tests for various
  cases - including semi/anti joins, missing stats,
  broadcast/shuffle, different numbers of join predicates.
* Add some end-to-end tests to verify plans execute correctly.

Change-Id: Icbb955e805d9e764edf11c57b98f341b88a37fcc
Reviewed-on: http://gerrit.cloudera.org:8080/16180
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-07-15 17:10:50 +00:00
Tim Armstrong
4e2498da6f IMPALA-9949: fix SELECT list subqueries with HAVING/LIMIT
The patch for IMPALA-8954 failed to account for subqueries
that could produce < 1 row. SelectStmt.returnsSingleRow()
is confusing because it actually returns true if it
returns *at most* one row.

As a fix I split it into returnsExactlyOneRow() and
returnsAtMostOneRow(), then used returnsExactlyOneRow()
to determine if the subquery should instead be rewritten
into a LEFT OUTER JOIN, which produces the correct result.

CROSS JOIN is still preferred because it can be more freely
reordered during planning.

Testing:
* Added planner tests for a range of scenarios where it can
  be rewritten as a CROSS JOIN and where it needs to be a LEFT
  OUTER JOIN for correctness.
* Added some targeted end-to-end tests where the results were
  previously incorrect. Checked the behaviour against Hive and
  postgres.

Ran exhaustive tests.

Change-Id: I6034aedac776783bdc8cdb3a2df344e2b3662da6
Reviewed-on: http://gerrit.cloudera.org:8080/16171
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-07-13 22:38:36 +00:00
Tim Armstrong
fea5dffec5 IMPALA-9924: handle single subquery in or predicate
This patch supports a subset of cases of subqueries
inside OR inside WHERE and HAVING clauses.

The approach used is to rewrite the subquery into
a many-to-one LEFT OUTER JOIN with the subquery and
then replace the subquery in the expression with a
reference to the single select list expressions of
the subquery. This works because:
* A many-to-one LEFT OUTER JOIN returns one output row
  for each left input row, meaning that for every row
  in the original query before the rewrite, we get
  the same row plus a single matched row from the subquery
* Expressions can be rewritten to refer to a slotref from
  the right side of the LEFT OUTER JOIN without affecting
  semantics. E.g. an IN subquery becomes <slot> IS NOT NULL
  or <operator> (<subquery>) becomes <operator> <slot>.

This does not affect SELECT list subqueries, which are
rewritten using a different mechanism that can already
support some subqueries in disjuncts.

Correlated and uncorrelated subqueries are both supported, but
various limitations are present.
Limitations:
* Only one subquery per predicate is supported. The rewriting approach
  should generalize to multiple subqueries but other code needs
  refactoring to handle this case.
* EXISTS and NOT EXISTS subqueries are not supported. The rewriting
  approach can generalise to that, but we need to add or pick a
  select list item from the subquery to check for NULL/IS NOT NULL
  and a little more work is required to do that correctly.
* NOT IN is not supported because of the special NULL semantics.
* Subqueries with aggregates + grouping by are not supported because
  we rely on adding distinct to select list and we don't
  support distinct + aggregations because of IMPALA-5098.

Tests:
* Positive analysis tests for IN and binary predicate operators.
* Negative analysis tests for unsupported subquery operators.
* Negative analysis tests for multiple subqueries.
* Negative analysis tests for runtime scalar subqueries.
* Positive and negative analysis tests for aggregations in subquery.
* TPC-DS Query 45 planner and query tests
* Targeted planner tests for various supported queries.
* Targeted functional tests to confirm plans are executable and
  return correct result. These exercise a mix of the supported
  features - correlated/correlated, aggregate functions,
  EXISTS/comparator, etc.
* Tests for BETWEEN predicate, which is supported as a side-effect
  of being rewritten during analysis.

Change-Id: I64588992901afd7cd885419a0b7f949b0b174976
Reviewed-on: http://gerrit.cloudera.org:8080/16152
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2020-07-13 16:02:27 +00:00
Shant Hovsepian
2dca55695e IMPALA-9784, IMPALA-9905: Uncorrelated subqueries in HAVING.
Support rewriting subqueries in the HAVING clause by nesting the
aggregation query and pulling up the subquery predicates into the outer
WHERE clause.

Testing:
  * New analyzer tests
  * New functional subquery tests
  * Added Q23, Q24 and Q44 to the tpcds workload
  * Ran subquery rewrite tests

Change-Id: I124a58a09a1a47e1222a22d84b54fe7d07844461
Reviewed-on: http://gerrit.cloudera.org:8080/16052
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-07-05 22:03:42 +00:00
Shant Hovsepian
388ad555d7 IMPALA-8954: Uncorrelated scalar subqueries in the select list
Extend StmtRewriter with the ability to rewrite scalar subqueries in the
select list into cross joins. Currently the subquery must pass plan-time
checks to determine that it returns a single row which may miss cases
that may be valid at runtime or with more complex evaluation of the
predicate expressions in the planner. Support for correlated subqueries
will be a follow on change.

Testing:
  * Added new analyzer tests, updated previous subquery tests
  * test_queries.py::TestQueries::test_subquery
  * Added test_tpcds_q9 to e2e and planner tests

Change-Id: Ibcf55d26889aa01d69bb85f18c9241dda095fb66
Reviewed-on: http://gerrit.cloudera.org:8080/16007
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-07-05 22:03:42 +00:00
Attila Jeges
b5805de3e6 IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.

This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.

The changes are as follows:
- Support for DATE literal syntax.

- Explicit casting between DATE and other types (note that invalid
  casts will fail with an error just like invalid DECIMAL_V2 casts,
  while failed casts to other types do no lead to warning or error):
    - from STRING to DATE. The string value must be formatted as
      yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
      the time component is optional. If the time component is
      present, it will be truncated silently.
    - from DATE to STRING. The resulting string value is formatted as
      yyyy-MM-dd.
    - from TIMESTAMP to DATE. The source timestamp's time of day
      component is ignored.
    - from DATE to TIMESTAMP. The target timestamp's time of day
      component is set to 00:00:00.

- Implicit casting between DATE and other types:
    - from STRING to DATE if the source string value is used in a
      context where a DATE value is expected.
    - from DATE to TIMESTAMP if the source date value is used in a
      context where a TIMESTAMP value is expected.

- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
  implicit conversions are now all possible, the existing function
  overload resolution logic is not adequate anymore.
  For example, it resolves the
  if(false, '2011-01-01', DATE '1499-02-02') function call to the
  if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
  function, instead of the if(BOOLEAN, DATE, DATE) version.

  This is clearly wrong, so the function overload resolution logic had
  to be changed to resolve function calls to the best-fit overloaded
  function definition if there are multiple applicable candidates.

  An overloaded function definition is an applicable candidate for a
  function call if each actual parameter in the function call either
  matches the corresponding formal parameter's type (without casting)
  or is implicitly castable to that type.

  When looking for the best-fit applicable candidate, a parameter
  match score (i.e. the number of actual parameters in the function
  call that match their corresponding formal parameter's type without
  casting) is calculated and the applicable candidate with the highest
  parameter match score is chosen.

  There's one more issue that the new resolution logic has to address:
  if two applicable candidates have the same parameter match score and
  the only difference between the two is that the first one requires a
  STRING -> TIMESTAMP implicit cast for some of its parameters while
  the second one requires a STRING -> DATE implicit cast for the same
  parameters then the first candidate has to be chosen not to break
  backward compatibility.
  E.g: year('2019-02-15') function call must resolve to
  year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
  implemented yet, so this is not an issue at the moment but it will
  be in the future.
  When the resolution algorithm considers overloaded function
  definitions, first it orders them lexicographically by the types in
  their parameter lists. To ensure the backward compatible behavior
  Primitivetype.DATE enum value has to come after
  PrimitiveType.TIMESTAMP.

- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
  math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.

These items are tightly coupled and it makes sense to implement them
in one change-set.

Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
  corresponding HBASE table 'functional_hbase.date_tbl') was
  introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
    - since DATE type is supported for TEXT and HBASE fileformats
      only, most DATE tests were implemented separately in
      tests/query_test/test_date_queries.py.

Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.

Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-23 13:33:57 +00:00
Thomas Tauber-Marshall
6fa35478d5 IMPALA-5847: Fix incorrect use of SET in .test files
The '.test' files are used to run queries for tests. These files are
run with a vector of default query options. They also sometimes
include SET queries that modify query options. If SET is used on a
query option that is included in the vector, the default value from
the vector will override the value from the SET, leading to tests that
don't actually run with the query options they appear to.

This patch asserts that '.test' files don't use SET for values present
in the default vector. It also fixes various tests that already had
this incorrect behavior.

Testing:
- Passed a full exhaustive run.

Change-Id: I4e4c0f31bf4850642b624acdb1f6cb8837957990
Reviewed-on: http://gerrit.cloudera.org:8080/12220
Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-22 01:20:31 +00:00
Zoltan Borok-Nagy
e6ca7ca14d IMPALA-7108: IllegalStateException hit during CardinalityCheckNode.<init>
Since IMPALA-6314 on runtime scalar subqueries we set LIMIT 2
in StmtRewriter.mergeExpr(). We do that because later we add a
CardinalityCheckNode on top of such subqueries and with
LIMIT 2 we can still check if they return more than one row.

In the constructor of CardinalityCheckNode there is a
precondition that checks if the child node has LIMIT 2 to
be certain that we've set the limit for all the necessary
cases.

However, some subqueries will get a LIMIT 1 later breaking the
precondition in CardinalityCheckNode. An example to these
subqueries is a select stmt that selects from an inline view
that returns a single row:

select * from functional.alltypes
where int_col = (select f.id from (
                 select * from functional.alltypes limit 1) f);

Note that we shouldn't add a CardinalityCheckNode to the plan
of this query in the first place. To generate a proper plan I
updated SelectStmt.returnsSingleRow() because this method didn't
handle this case well.

I also changed
the precondition from
Preconditions.checkState(child.getLimit() == 2);
to
Preconditions.checkState(child.getLimit() <= 2);
in order to be more permissive.

I added tests for the aforementioned query.

Change-Id: I82a7a3fe26db3e12131c030c4ad055a9c4955407
Reviewed-on: http://gerrit.cloudera.org:8080/10605
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-08 20:15:50 +00:00
Zoltan Borok-Nagy
fab65d4479 IMPALA-7022: TestQueries.test_subquery: Subquery must not return more than one row
TestQueries.test_subquery sometimes fails during exhaustive
tests.

In the tests we expect to catch an exception that is
prefixed by the "Query aborted:" string. The prefix is
usually added by impala_beeswax.py::wait_for_completion(),
but in rare cases it isn't added.

From the point of the test it is irrelevant if the exception
is prefixed with "Query aborted:" or not, so I removed it
from the expected exception string.

Change-Id: I3b8655ad273b1dd7a601099f617db609e4a4797b
Reviewed-on: http://gerrit.cloudera.org:8080/10407
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-05-15 23:37:06 +00:00
Zoltan Borok-Nagy
1e79f14798 IMPALA-6314: Add run time scalar subquery check for uncorrelated subqueries
If a scalar subquery is used with a binary predicate,
or, used in an arithmetic expression, it must return
only one row/column to be valid. If this cannot be
guaranteed at parse time through a single row aggregate
or limit clause, Impala fails the query like such.

E.g., currently the following query is not allowed:
SELECT bigint_col
FROM alltypesagg
WHERE id = (SELECT id FROM alltypesagg WHERE id = 1)

However, it would be allowed if the query contained
a LIMIT 1 clause, or instead of id it was max(id).

This commit makes the example valid by introducing a
runtime check to test if the subquery returns a single
row. If the subquery returns more than one row, it
aborts the query with an error.

I added a new node type, called CardinalityCheckNode. It
is created during planning on top of the subquery when
needed, then during execution it checks if its child
only returns a single row.

I extended the frontend tests and e2e tests as well.

Change-Id: I0f52b93a60eeacedd242a2f17fa6b99c4fc38e06
Reviewed-on: http://gerrit.cloudera.org:8080/9005
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-27 20:06:56 +00:00
Zoltan Borok-Nagy
25422c74b2 IMPALA-6934: Wrong results with EXISTS subquery containing ORDER BY, LIMIT, and OFFSET
Queries may return wrong results if an EXISTS subquery has
an ORDER BY with a LIMIT and OFFSET clause. The EXISTS
subquery may incorrectly evaluate to TRUE even though it is
FALSE.

The bug was found during the code review of IMPALA-6314
(https://gerrit.cloudera.org/#/c/9005/). Turned out
QueryStmt.setLimit() wipes the offset. I modified it to
keep the offset expr.

Added tests to 'PlannerTest/subquery-rewrite.test' and
'QueryTest/subquery.test'

Change-Id: I9693623d3d0a8446913261252f8e4a07935645e0
Reviewed-on: http://gerrit.cloudera.org:8080/10218
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 20:12:38 +00:00
Taras Bobrovytsky
46c3e43edb IMPALA-3232: Allow not-exists uncorrelated subqueries
Before this patch, correlated exists and not exists subqueries were
rewritten as as left semi and anti joins respectively. Uncorrelated
exists subqueries were rewritten as cross joins, and uncorrelated
not-exists subqueries were not supported at all. This patch takes
advantage of the nested loop join that was recently introduced, which
allows us to rewrite both correlated and uncorrelated exists subqueries
as left semi joins and both correlated and uncorrelated not-exists
subqueries as anti joins.

Change-Id: I52ae12f116d026190f3a2a7575cda855317d11e8
Reviewed-on: http://gerrit.cloudera.org:8080/2792
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 23:06:36 -07:00
Dimitris Tsirogiannis
ccf1f8f73f IMPALA-2734: Correlated EXISTS subqueries with HAVING clause return wrong results
This commit fixes an issue where wrong results are returned if an EXISTS subquery
contains a HAVING clause and non-equality correlated binary predicates. This case does
not have a valid rewrite as the HAVING clause needs to be applied after the correlated
predicates have been evaluated. With this fix, we detect cases like this and throw an
AnalysisException.

Change-Id: I159f956e2b01f408601829b5d2afcf11d76bedcd
Reviewed-on: http://gerrit.cloudera.org:8080/1927
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-02-04 01:06:03 +00:00
Dimitris Tsirogiannis
4eceeacf16 IMPALA-1550: Invalid rewrite when EXISTS subqueries contain aggregate
functions

This commit fixes an issue where a [NOT] EXISTS subquery that contains
an aggregate function will sometimes be incorrectly rewritten into a
join, thereby returning incorrect results.

Change-Id: I18b211d76ee3de77d8061603ff5bb1fbceae2e60
Reviewed-on: http://gerrit.cloudera.org:8080/266
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-04-02 19:11:00 +00:00
Alex Behm
f696861c5c Throw error on unrecognized test sections.
Our .test file parser used to not abort tests when there
is a malformed test/section. This patch changes that behavior
to report an error and treat the test as failed.

Quite a few tests were not well-formed, and were not executed
as a result. This patch fixes those tests.

Arguably, the test file parser should be more flexible in which places
to accept comments, but this patch does not address that problem.

Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-12-02 18:08:09 -08:00
casey
516d7483dd IMPALA-1300: Allow subqueries in UNION operands
This enables the existing subquery rewrite rules to rewrite UNION
statements. UNION rewriting is easily done by simply calling the
rewriter for each operand in the UNION. At least one TPC-DS query
requires this functionality (IMPALA-1365).

The more difficult case of a UNION within a subquery is still not
supported.

Change-Id: I7f83eed0eb8ae81565e629f09f6918a4ba86ee13
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4859
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
2014-11-17 11:19:09 -08:00
Alex Behm
7b6ecbeea5 Fix exhaustive test run: Modify test to produce identical results on HBase.
Change-Id: I7187f9aca63f61ea1686820b3cbec277240da191
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4866
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
2014-11-17 11:19:01 -08:00
Nong Li
86aebc7f8f IMPALA-1348: Fix NAAJ where the null partitions have streams with multiple blocks.
Change-Id: I892f3435814bd4fcddeb496017dbb60704f13419
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4728
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-10-14 12:01:53 -07:00
Alex Behm
3e7de9f304 IMPALA-1318: Joins should not return semi-joined tuples.
Change-Id: I93f5ddb8317af7794b5977e145805f9ff498d722
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4633
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-10-06 15:17:22 -07:00
Ippokratis Pandis
5c4486a2b2 Proper handling of NULL tuples by buffered-tuple-stream.
Adding a bitstring at the head of each block in the TupleStream that indicates which
tuples of the appended rows in the block are NULLs. When reading the stream, through
GetNext() or GetTupleRow() calls, the NULL tuples are stitched back to their correct
position.

This fixes crashes in PHJ of bushy plans with NULLs on the build side(s) as well as
similar crashes in PAGG and the analytic node.
For example, it fixes IMPALA-1204, IMPALA-1223, and IMPALA-1249.
Also, adds regression tests for IMPALA-1175, IMPALA-1204, IMPALA-1223, IMPALA-1249
and IMPALA-1306.

Change-Id: I30ad0dbd4dfeabcda8fae444d1c6ec9291f38398
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4596
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
2014-10-06 15:10:58 -07:00
Dimitris Tsirogiannis
5db0f877cb Fix subqueries test for HBase
Change-Id: I8d3c10d29a198135e87ab848ba206c2662166760
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4597
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
2014-10-06 15:09:37 -07:00
Dimitris Tsirogiannis
b201c7a7d1 IMPALA-1299: Analytic should be allows in correlated EXISTS subquery
With this commit we enable correlated and uncorrelated EXISTS
subqueries with grouping and/or aggregation including analytic
functions. Furthermore, we enable correlated EXISTS subqueries
with a LIMIT clause.

Change-Id: I36c33f80b152b7f175bf803cbe920ce1983d7162
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4583
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
2014-10-06 15:09:25 -07:00
Dimitris Tsirogiannis
5046a47dc3 IMPALA-1297: Results of NOT IN may not be correct if subquery results in
NULL

This commit fixes a bug in the implementation of the null-aware anti
join that resulted in wrong results being returned from NOT IN correlated
subqueries in the presence of nulls.

Change-Id: I6f2eb326ec7e40d80ec8da94ba33946b9ac9b115
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4506
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
2014-09-26 12:02:47 -07:00
Dimitris Tsirogiannis
f21aed16fd Bug fixes in null-aware anti-join
This commit fixes issue IMPALA-1215 where NOT IN subqueries return wrong
results in the presence of null values.

Change-Id: I97e41c8df8ba864d0189595d670b3f0349fcad36
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4467
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-09-23 07:33:23 -07:00
Dimitris Tsirogiannis
3b5f1d3ab5 Rewrite NOT IN subqueries with a null-aware anti-join.
This commit fixes the issue (IMPALA-1215) where NOT IN subqueries return
wrong results in the presence of NULL values. The null-matching equality
operator is introduced in the front-end and the NOT IN subqueries are
rewritten using the null-aware anti-join operator.

Change-Id: I5a323357025d77c2143db86e1057999ec8a371c0
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4391
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
2014-09-20 16:13:49 -07:00
Dimitris Tsirogiannis
e1e874a77f IMPALA-1212 Accept subquery as LHS or RHS of between operator
This commit fixes the issue where an error was thrown if a subquery was
used in either side of a between predicate. Between predicates with
subqueries are replaced by their corresponding compound predicates
during query rewrite.

Change-Id: I4315a6e91c9306c6817bf6aa6bc1d0b586a1a067
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4246
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
2014-09-13 00:17:36 -07:00
Dimitris Tsirogiannis
b670d98e40 IMPALA-1195: IllegalStateException in query with agg scalar subquery
This commit fixes IMPALA-1195 in which an exception is thrown when a
scalar subquery is in an IS NULL predicate. With this commit we also add
support for scalar subqueries in functions and other exprs.

Change-Id: Id995e77e6561a6450c4347706e4901fb3e236cfe
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4185
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
2014-09-12 18:17:28 -07:00
Dimitris Tsirogiannis
2ab66c4ca2 Add support for uncorrelated EXISTS subqueries
This commit adds support for uncorrelated EXISTS subqueries in Impala.
Uncorrelated EXISTS subqueries are rewritten using a CROSS JOIN.
Uncorrelated NOT EXISTS subqueries are not supported.

Change-Id: I0003dcdc0fa5cc99931b9a9f4deddbcd42572490
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4140
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4186
2014-09-05 12:36:18 -07:00
Nong Li
8fbd5fe2c9 PHJ memory transfer fixes and misc bug fixes.
Row batches contain auxiliary memory that can reside in tuple pools, io buffers and
now tuple streams. Like the other resources, these need to attached to row batches
and transfered up the operator tree to make sure the tuple ptrs are always valid.

Fixed bug in BufferedTupleStream to not delete blocks on read if it is pinned.

Fixed PHJ bug with row batch boundaries causing current_probe_row_ to be NULL.

Change-Id: I4c66d9961a117bfe3ed577de6170e875ea1feee7
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3983
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4157
2014-09-03 20:12:24 -07:00
Dimitris Tsirogiannis
d9fa1a2e60 Fix issue where subqueries return wrong results in the presence of
distinct

This commit fixes two subquery issues:
1. During the rewrite of aggregate subqueries with count, a new select
list is created for the outer select block to eliminate new visible
tuples. However, the new select list was not initialized correctly,
causing distinct clauses to not be preserved.
2. Pushing negation to operands during a query rewrite was causing a
StackOverflowError when it was encountering predicates for which a
negate function is not implemented. Consequently, it was using the
negate function from the parent class causing it to recurse infinitely.

Change-Id: I6f1b8090af40fa55b13661d637f9aaaa00dfcf5c
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4115
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4141
2014-09-03 12:25:59 -07:00
Dimitris Tsirogiannis
c2abcd6f3d Query transformation of nested queries.
This commit implements nested queries with [NOT] IN, [NOT] EXISTS and
aggregate subquery predicates in Impala. The following cases are
supported:
1. Correlated and uncorrelated [NOT] IN.
2. Correlated [NOT] EXISTS.
3. Correlated and uncorrelated aggregate subqueries.

Change-Id: Ia3f4843c5f07d4e31ef3faedc58a15e623f91a5d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3754
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4109
2014-08-29 15:35:21 -07:00
Alex Behm
bceeb834f3 IMPALA-677: Fix visibility of semi and anti-joined table references.
Semi or anti-joined table references are now only visible inside the
On-clause of the corresponding join.

Change-Id: Id93e53ecdf2a74baf9736aa427fa7af15358ca27
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3789
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-08-17 12:45:45 -07:00
Dimitris Tsirogiannis
5a6f53db16 Add partition pruning tests
The following changes are included in this commit:
1. Modified the alltypesagg table to include an additional partition key
that has nulls.
2. Added a number of tests in hdfs.test that exercise the partition
pruning logic (see IMPALA-887).
3. Modified all the tests that are affected by the change in alltypesagg.

Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236
2014-06-24 02:14:27 -07:00
Nong Li
b0de4bbe40 IMPALA-812: Fix select node to properly transfer memory ownership.
Change-Id: I83b6d085362726aa080077845d3bef71b184621c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2076
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-03-25 18:38:55 -07:00
Henry Robinson
16af29ea5f IMPALA-770: Fix crash in aggregation node with zero-width tuple
The select exprs of an inline view may not always be materialised, yet
the output tuple itself may be. This patch fixes a crash in this
situation in the backend aggregation node which assumed its output tuple
would always have at least one materialised slot.

The cause was a couple of too-conservative DCHECKs that failed if the
tuple was NULL. In fact, the code was robust to this possibility without
the checks, so this bug didn't affect release builds of Impala.

Change-Id: If0b90809d30fcd196f55197953392452d1ac9c4f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1431
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8c1c21b66c43e900760ace54d090305f32a85a1f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1471
Tested-by: Henry Robinson <henry@cloudera.com>
2014-02-05 22:01:35 -08:00
Alex Behm
1497002013 Added SHOW TABLE/COLUMN STATS command.
Fixed the following stats-related bugs:
- Per-partition row count was not distributed properly via CatalogService
- HBase column stats were not loaded and distributed properly

Enhancements to test framework:
- Allow regex specification of expected row or column values
- Fixed expected results of some tests because the test framework
  did not catch that they were incorrect

Change-Id: I1fa8e710bbcf0ddb62b961fdd26ecd9ce7b75d51
Reviewed-on: http://gerrit.ent.cloudera.com:8080/813
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:51 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Alex Behm
937a44f9f8 IMPALA-68: Support Values() statement. 2014-01-08 10:50:31 -08:00
Alex Behm
21685d4f8f Fixed a failed Preconditions check if a join predicate has constants. 2014-01-08 10:49:52 -08:00
Alex Behm
5db3f2cdf5 IMPALA-227: SELECT * on partitioned table returns columns in different order than Hive. 2014-01-08 10:49:48 -08:00
Alex Behm
805fa50d6f IMPALA-67: Constant SELECT clauses do not work in subqueries. 2014-01-08 10:49:48 -08:00
ishaan
09d6d931f4 Change the way data is loaded 2014-01-08 10:48:09 -08:00
Lenni Kuff
837f35eab3 Updated results for more query tests to reflect proper ordering + improved result updating 2014-01-08 10:46:53 -08:00
Lenni Kuff
ef48f65e76 Add test framework for running Impala query tests via Python
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.

As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).

You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.

These new tests can be run using the script at tests/run-tests.sh
2014-01-08 10:46:50 -08:00
Marcel Kornacker
2fda5d9b99 IMP-491
Fixes bug in Planner.createHashJoinFragment(), which didn't set the left child of the
hj node to the output of the left child fragment.

Also: row descriptor was set incorrectly (too wide; included tuples that weren't materialized)
for roots of plan trees of non-root fragments if those fragments materialized an aggregate
2014-01-08 10:46:33 -08:00
Alan Choi
595edaa9d1 Disable all string to numeric and boolean implicit cast 2014-01-08 10:46:24 -08:00
Lenni Kuff
04edc8f534 Update benchmark tests to run against generic workload, data loading with scale factor, +more
This change updates the run-benchmark script to enable it to target one or more
workloads. Now benchmarks can be run like:

./run-benchmark --workloads=hive-benchmark,tpch

We lookup the workload in the workloads directory, then read the associated
query .test files and start executing them.

To ensure the queries are not duplicated between benchmark and query tests, I
moved all existing queries (under fe/src/test/resources/* to the workloads
directory. You do NOT need to look through all the .test files, I've just moved
them. The one new file is the 'hive-benchmark.test' which contains the hive
benchmark queries.

Also added support for generating schema for different scale factors as well as
executing against these scale factors. For example, let's say we have a dataset
with a scale factor called "SF1". We would first generate the schema using:

./generate_schema_statements --workload=<workload> --scale_factor="SF3"
This will create tables with a unique names from the other scale factors.

Run the generated .sql file to load the data. Alternatively, the data can loaded
by running a new python script:
./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor]
For example: load-data.sh -w tpch -e core -s SF3

Then run against this:
./run-benchmark --workloads=<workload> --scale_factor=SF3

This changeset also includes a few other minor tweaks to some of the test
scripts.

Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6
2014-01-08 10:44:22 -08:00