IMPALA-9924: handle single subquery in or predicate

This patch supports a subset of cases of subqueries
inside OR inside WHERE and HAVING clauses.

The approach used is to rewrite the subquery into
a many-to-one LEFT OUTER JOIN with the subquery and
then replace the subquery in the expression with a
reference to the single select list expressions of
the subquery. This works because:
* A many-to-one LEFT OUTER JOIN returns one output row
  for each left input row, meaning that for every row
  in the original query before the rewrite, we get
  the same row plus a single matched row from the subquery
* Expressions can be rewritten to refer to a slotref from
  the right side of the LEFT OUTER JOIN without affecting
  semantics. E.g. an IN subquery becomes <slot> IS NOT NULL
  or <operator> (<subquery>) becomes <operator> <slot>.

This does not affect SELECT list subqueries, which are
rewritten using a different mechanism that can already
support some subqueries in disjuncts.

Correlated and uncorrelated subqueries are both supported, but
various limitations are present.
Limitations:
* Only one subquery per predicate is supported. The rewriting approach
  should generalize to multiple subqueries but other code needs
  refactoring to handle this case.
* EXISTS and NOT EXISTS subqueries are not supported. The rewriting
  approach can generalise to that, but we need to add or pick a
  select list item from the subquery to check for NULL/IS NOT NULL
  and a little more work is required to do that correctly.
* NOT IN is not supported because of the special NULL semantics.
* Subqueries with aggregates + grouping by are not supported because
  we rely on adding distinct to select list and we don't
  support distinct + aggregations because of IMPALA-5098.

Tests:
* Positive analysis tests for IN and binary predicate operators.
* Negative analysis tests for unsupported subquery operators.
* Negative analysis tests for multiple subqueries.
* Negative analysis tests for runtime scalar subqueries.
* Positive and negative analysis tests for aggregations in subquery.
* TPC-DS Query 45 planner and query tests
* Targeted planner tests for various supported queries.
* Targeted functional tests to confirm plans are executable and
  return correct result. These exercise a mix of the supported
  features - correlated/correlated, aggregate functions,
  EXISTS/comparator, etc.
* Tests for BETWEEN predicate, which is supported as a side-effect
  of being rewritten during analysis.

Change-Id: I64588992901afd7cd885419a0b7f949b0b174976
Reviewed-on: http://gerrit.cloudera.org:8080/16152
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
This commit is contained in:
Tim Armstrong
2020-07-06 18:39:34 -07:00
parent 605e301739
commit fea5dffec5
10 changed files with 1436 additions and 61 deletions

View File

@@ -1317,3 +1317,104 @@ select id, sum(int_col) from alltypestiny where
---- CATCH
Subquery must not return more than one row:
====
---- QUERY
# Basic IN subquery in OR predicate.
# Crafted so that each branch of the OR matches a handful of rows.
select id, timestamp_col from alltypes
where (timestamp_col between '2009-04-13' and '2009-04-14') or
id in (select int_col from alltypestiny)
---- TYPES
INT, TIMESTAMP
---- RESULTS
0,2009-01-01 00:00:00
1,2009-01-01 00:01:00
1020,2009-04-13 02:00:05.400000000
1021,2009-04-13 02:01:05.400000000
1022,2009-04-13 02:02:05.410000000
1023,2009-04-13 02:03:05.430000000
1024,2009-04-13 02:04:05.460000000
1025,2009-04-13 02:05:05.500000000
1026,2009-04-13 02:06:05.550000000
1027,2009-04-13 02:07:05.610000000
1028,2009-04-13 02:08:05.680000000
1029,2009-04-13 02:09:05.760000000
---- DBAPI_RESULTS
0,2009-01-01 00:00:00
1,2009-01-01 00:01:00
1020,2009-04-13 02:00:05.400000
1021,2009-04-13 02:01:05.400000
1022,2009-04-13 02:02:05.410000
1023,2009-04-13 02:03:05.430000
1024,2009-04-13 02:04:05.460000
1025,2009-04-13 02:05:05.500000
1026,2009-04-13 02:06:05.550000
1027,2009-04-13 02:07:05.610000
1028,2009-04-13 02:08:05.680000
1029,2009-04-13 02:09:05.760000
====
---- QUERY
# Subquery in OR predicate inside non-trivial expression.
# Crafted so that each branch of the OR matches a few rows.
select year, id, int_col, string_col, date_string_col
from alltypes
where (int_col = 9 and date_string_col > '12/31/00') or
(year = 2010 and date_string_col > '12/28/10' and
string_col in (select min(string_col) from alltypestiny))
---- TYPES
INT, INT, INT, STRING, STRING
---- RESULTS
2009,3649,9,'9','12/31/09'
2010,7270,0,'0','12/29/10'
2010,7280,0,'0','12/30/10'
2010,7290,0,'0','12/31/10'
2010,7299,9,'9','12/31/10'
====
---- QUERY
# Subquery in HAVING clause.
# Crafted so that each branch of the OR matches one row.
select year, month, min(id)
from alltypes
group by year, month
having (year = 2010 and month > 11) or
min(id) in (select int_col from alltypestiny)
---- TYPES
INT, INT, INT
---- RESULTS
2010,12,6990
2009,1,0
====
---- QUERY
# Comparator-based subquery with correlated predicate in disjunction.
select id from alltypes t
where id % 1234 = 0 or
t.id <= (select min(id) from alltypesagg g where t.int_col = g.int_col)
---- TYPES
INT
---- RESULTS
6170
2468
4936
1234
0
1
2
3
4
5
6
7
8
9
3702
====
---- QUERY
# Subquery that only returns NULL values. The IN predicate should always
# evaluate to false in this case, because the hand-picked values have a null 'int_col'.
select id, int_col from functional.alltypes t
where t.id = 42 or t.int_col IN (
select int_col from functional.alltypesagg where id in (1000,2000,3000,4000))
---- TYPES
INT, INT
---- RESULTS
42,2
====