If we unnest an array coming from a UNION ALL, we read invalid memory
and in ASAN builds we crash.
Example:
with v as (select arr1 from complextypes_arrays
union all select arr1 from complextypes_arrays)
select am.item from v, v.arr1 am;
The problem seems to be that in the item tuple of the collections, the
item slots are present twice. This is because both the inline view
analyzer and the main analyzer add slots with the same path to the
tuple. This is possible because
- the target tuple is determined based on the path via
Path.getRootDesc(), so it will be the same both in the inline view
and in the main scope
AND
- the inline view analyzer and the main one do not share
'slotPathMap_', so the analyzer cannot recognise that a slot for the
path has already been added.
This commit solves the problem by checking the target tuple whether a
slot with the same path already exists in it, and if it does, we reuse
that slot. Note, however, that when Analyzer.registerSlotRef() is called
with 'duplicateIfCollections=true', a separate slot is added for
collections which should not be reused. This commit adds a set,
'duplicateCollectionSlots', in Analyzer.GlobalState to keep track of
such collection slots, and these slots are never reused.
Note that there is another bug, IMPALA-12753, that a predicate on the
collection item in the above query is only enforced on the first child
of the union. Therefore this commit disallows placing a predicate on a
collection item when the unnested collection comes from a union.
Testing:
- added test queries in nested-array-in-select-list.test,
nested-map-in-select-list.test, zipping-unnest-in-from-clause.test
and zipping-unnest-in-select-list.test
Change-Id: I340adc50e6d7cda6f59dacd7a46b6adc31635d46
Reviewed-on: http://gerrit.cloudera.org:8080/20953
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In ASAN builds, if we UNION ALL an array containing a struct of a string
with itself, Impala crashes. This is how to reproduce it:
In Hive:
create table su (arr ARRAY<STRUCT<s: STRING>>) stored as parquet;
insert into su values (array(named_struct("s", "A")));
In Impala:
select 1, arr from su
union all select 2, arr from su;
The ASAN error message indicates a heap-use-after-free.
Normally, UNIONs of structs are not supported yet (see IMPALA-10752),
but if the struct is inside an array it is allowed now. This was
probably not intentional and it leads to the above error, so this change
disables structs in unions completely, including embedded structs.
Testing:
- adjusted existing tests
- added a query that tests that types with embedded structs are not
allowed in a UNION statement, in mixed-collections-and-structs.test
Change-Id: Id728f1254b74636be594a33313a478b0b77c7ae4
Reviewed-on: http://gerrit.cloudera.org:8080/20970
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change, queries with SELECT DISTINCT on a complex type
failed.
With structs, we got a FE exception:
use functional_parquet;
select distinct(struct_val) from alltypes_structs;
ERROR: IllegalStateException: null
With collections, the BE hits a DCHECK and crashes:
use functional_parquet;
select distinct(arr1) from complextypes_arrays;
Socket error 104: [Errno 104] Connection reset by peer
Aggregate functions with complex DISTINCT parameters also failed without
a clear error message. For example:
select count(distinct struct_val) from alltypes_structs;
select count(distinct arr1) from complextypes_arrays;
To support DISTINCT for complex types we would need to implement
equality and hash for them. We are not planning to do it in the near
future, so this change introduces informative error messages in these
cases.
Testing:
- added test queries for SELECT DISTINCT and SELECT COUNT(DISTINCT ...)
with arrays, maps and structs, expecting the correct error messages.
Change-Id: Ibe2642d1683a10fd05a95e2ad8470d16f0d5242c
Reviewed-on: http://gerrit.cloudera.org:8080/20752
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-12019 implemented support for collections of fixed length types
in the sorting tuple. This change implements it for collections of
variable length types.
Note that the limitation that structs that contain any type of
collection are not allowed in the sorting tuple is still in place (see
IMPALA-12160).
Note that it was not and still is not allowed to sort by complex types,
this change only allows them to be present in the select list when
sortin by some other expression.
This change also allows collections of variable length types to be
non-passthrough children of UNION ALL nodes.
Testing:
- Renamed the 'simple_arrays_big' table to 'arrays_big' and extended it
with collections containing variable length types. This table is
mainly used to test that spilling works during sorting.
- Renamed
test_sort.py::TestArraySort::{test_simple_arrays,
test_simple_arrays_with_limit}
to {test_array_sort,test_array_sort_with_limit}
- Extended the tests run in test_queries.py::TestQueries::{test_sort,
test_top_n,test_partitioned_top_n} with collections containing
var-len types.
- Added tests in sort-complex.test that assert that it is not allowed
to sort by collections. For structs we already have such tests in
struct-in-select-list.test.
Change-Id: Ic15b29393f260b572e11a8dbb9deeb8c02981852
Reviewed-on: http://gerrit.cloudera.org:8080/20108
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-12373 introduces small string optimisation, after which not all
strings will have a var-len part.
IMPALA-12159 adds support for ORDER BY for collections of variable
length types in the select list, but the test tables it uses only/mostly
contain short strings.
This patch has two modifications:
1. It introduces longer strings in 'collection_tbl' and
'collection_struct_mix'. It also adds two more rows to the existing one
in 'collection_tbl' so that it can be used in sorting tests. These
tables are only used by complex types tests, so the impact is limited.
2. It modifies RandomNestedDataGenerator.java, so that now it takes a
parameter for string length. Some variable names are changed to clearer
names. The references to and uses of RandomNestedDataGenerator are
updated.
Change-Id: Ief770d6bc9258fce159a733d5afa34fe594b96f8
Reviewed-on: http://gerrit.cloudera.org:8080/20718
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-12019 implemented support for collections of fixed length types
in the sorting tuple. This was made possible by implementing the
materialisation of these collections.
Building on this, this change allows such collections as non-passthrough
children of UNION ALL operations. Note that plain UNIONs are not
supported for any collections for other reasons and this patch does not
affect them or any other set operation.
Testing:
Tests in nested-array-in-select-list.test and
nested-map-in-select-list.test check that
- the newly allowed cases work correctly and
- the correct error message is given for collections of variable length
types.
Change-Id: I14c13323d587e5eb8a2617ecaab831c059a0fae3
Reviewed-on: http://gerrit.cloudera.org:8080/19903
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As a first stage of IMPALA-10939, this change implements support for
including in the sorting tuple top-level collections that only contain
fixed length types (including fixed length structs). For these types the
implementation is almost the same as the existing handling of strings.
Another limitation is that structs that contain any type of collection
are not yet allowed in the sorting tuple.
Also refactored the RawValue::Write*() functions to have a clearer
interface.
Testing:
- Added a new test table that contains many rows with arrays. This is
queried in a new test added in test_sort.py, to ensure that we handle
spilling correctly.
- Added tests that have arrays and/or maps in the sorting tuple in
test_queries.py::TestQueries::{test_sort,
test_top_n,test_partitioned_top_n}.
Change-Id: Ic7974ef392c1412e8c60231e3420367bd189677a
Reviewed-on: http://gerrit.cloudera.org:8080/19660
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
NULL values are printed as "NULL" if they are top level or in
collections, but as "null" in structs. We should print collections and
structs in JSON form, so it should be "null" in collections, too. Hive
also follows the latter (correct) approach.
This commit changes the printing of NULL values to "null" in
collections.
Testing:
- Modified the tests to expect "null" instead of "NULL" in collections.
Change-Id: Ie5e7f98df4014ea417ddf73ac0fb8ec01ef655ba
Reviewed-on: http://gerrit.cloudera.org:8080/19236
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Non-matching rows from the left side will null out all slots from the
right side in left outer joins. If the right side is a subquery, it
is possible that some returned expressions will be non-NULL even if all
slots are NULL (e.g. constants) - these expressions are wrapped as
IF(TupleIsNull(tids), NULL, expr) to null them in the non-matching
case.
The logic above used to hit a precondition for complex types. We can
safely ignore complex types for now, as currently the only possible
expression that returns a complex type is SlotRef, which doesn't
need to be wrapped. We will have to revisit this once functions are
added that return complex types.
Testing:
- added a regression test and ran it
Change-Id: Iaa8991cd4448d5c7ef7f44f73ee07e2a2b6f37ce
Reviewed-on: http://gerrit.cloudera.org:8080/18954
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adding support for MAP types in the select list.
An example of how maps are printed:
{"k1":2,"k2":null}
Nested collection types (maps and arrays) are supported in any
combination. However, structs in collections and collections in structs
are not supported.
Limitations (other than map support) as described in the commit for
IMPALA-9498 still apply, the following are to be implemented later:
- Unify HS2 / Beeswax logic with the way STRUCTs are handled.
This could be done in a "final" logic that can handle
STRUCTS/ARRAYS nested to each other
- Implement "deep copy" and "deep serialize" for collections in BE.
This would enable all operators, e.g. ORDER BY and UNION.
Testing:
- modified the FE tests that checked that maps were not allowed in the
select list - now the test expect maps are allowed there
- added FE and EE tests involving maps based on the array tests
Change-Id: I921c647f1779add36e7f5df4ce6ca237dcfaf001
Reviewed-on: http://gerrit.cloudera.org:8080/18736
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
More than 1d arrays in select list tried to register a
CollectionTableRef with name "item" for the inner arrays,
leading to name collision if there was more than one such array.
The logic is changed to always use the full path as implicit alias
in CollectionTableRefs backing arrays in select list.
As a side effect this leads to using the fully qualified names
in expressions in the explain plans of queries that use arrays
from views. This is not an intended change, but I don't consider
it to be critical. Created IMPALA-11452 to deal with more
sophisticated alias handling in collections.
Testing:
- added a new table to testdata and a regression test
Change-Id: I6f2b6cad51fa25a6f6932420eccf1b0a964d5e4e
Reviewed-on: http://gerrit.cloudera.org:8080/18734
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The expectation for predicates on unnested arrays is that they are
either picked up by the SCAN node or the UNNEST node for evaluation. If
there is only one array being unnested then the SCAN node, otherwise
the UNNEST node will be responsible for the evaluation. However, if
there is a JOIN node involved where the JOIN construction happens
before creating the UNNEST node then the JOIN node incorrectly picks
up the predicates for the unnested arrays as well. This patch is to fix
this behaviour.
Tests:
- Added E2E tests to cover result correctness.
- Added planner tests to verify that the desired node picks up the
predicates for unnested arrays.
Change-Id: I89fed4eef220ca513b259f0e2649cdfbe43c797a
Reviewed-on: http://gerrit.cloudera.org:8080/18614
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Until now ARRAYs had to be unnested in queries. This patch adds
support to return ARRAYs as STRINGs (JSON arrays) in select list,
for example:
select id, int_array from functional_parquet.complextypestbl where id = 1;
returns: 1, [1,2,3]
Returning ARRAYs from inline or HMS views is also supported -
these arrays can be used both in the select list or as relative
table references. Using them as non-relative table reference is
not supported (IMPALA-11052).
Though STRUCTs are already supported, ARRAYs and STRUCTs nested in
each other are not supported yet.
Things intentionally postponed for later commits:
- Add MAP suppport too - this shouldn't be too tricky after
ARRAY support, but I don't want to make this patch even more
complex.
- Unify HS2 / Beeswax logic with the way STRUCTs are handled.
This could be done in a "final" logic that can handle
STRUCTS/ARRAYS nested to each other
- Implement "deep copy" and "deep serialize" for ARRAYs in BE.
This would enable all operators, e.g. ORDER BY and UNION.
Testing:
- FE tests were added for analyses and authorization
- EE tests were added
- core tests were ran
Change-Id: Ibb1e42ffb21c7ddc033aba0f754b0108e46f34d0
Reviewed-on: http://gerrit.cloudera.org:8080/17811
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>