impala

jprdonnelly/impala

Fork 0

mirror of https://github.com/apache/impala.git synced 2026-02-02 06:00:36 -05:00

Commit Graph

Author	SHA1	Message	Date
Daniel Becker	9baf790606	IMPALA-10838: Fix substitution and improve unification of struct slots The following query fails: ''' with sub as ( select id, outer_struct from functional_orc_def.complextypes_nested_structs) select sub.id, sub.outer_struct.inner_struct2 from sub; ''' with the following error: ''' ERROR: IllegalStateException: Illegal reference to non-materialized tuple: debugname=InlineViewRef sub alias=sub tid=6 ''' while if 'outer_struct.inner_struct2' is added to the select list of the inline view, the query works as expected. This change fixes the problem by two modifications: - if a field of a struct needs to be materialised, also materialise all of its enclosing structs (ancestors) - in InlineViewRef, struct fields are inserted into the 'smap' and 'baseTableSmap' with the appropriate inline view prefix This change also changes the way struct fields are materialised: until now, if a member of a struct was needed to be materialised, the whole struct, including other members of the struct were materialised. This behaviour can lead to using significantly more memory than necessary if we for example query a single member of a large struct. This change modifies this behaviour so that we only materialise the struct members that are actually needed. Tests: - added queries that are fixed by this change (including the one above) in nested-struct-in-select-list.test - added a planner test in fe/src/test/java/org/apache/impala/planner/PlannerTest.java that asserts that only the required parts of structs are materialised Change-Id: Iadb9233677355b85d424cc3f22b00b5a3bf61c57 Reviewed-on: http://gerrit.cloudera.org:8080/17847 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Daniel Becker <daniel.becker@cloudera.com>	2022-05-02 07:21:37 +00:00
Daniel Becker	c802be42b6	IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC) When querying a non-toplevel nested struct from an ORC file, the NULL values are displayed at an incorrect level. E.g.: select id, outer_struct.inner_struct3 from functional_orc_def.complextypes_nested_structs where id >= 4; +----+----------------------------+ \| id \| outer_struct.inner_struct3 \| +----+----------------------------+ \| 4 \| {"s":{"i":null,"s":null}} \| \| 5 \| {"s":null} \| +----+----------------------------+ However, in the first row it is expected that 's' should be null and not its members; in the second row the result should be 'NULL', i.e. 'outer_struct.inner_struct3' is null. For reference see what is returned when querying 'outer_struct' instead of 'outer_struct.inner_struct3': +----+-------------------------------------------------------------------------------------------------------------------------------+ \| 4 \| {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} \| \| 5 \| {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null} \| +----+-------------------------------------------------------------------------------------------------------------------------------+ The problem comes from the incorrect handling of the different depths of the following trees: - the ORC type hierarchy (schema) - the tuple descriptor / slot descriptor hierarchy as the ORC type hierarchy contains a node for every level in the schema but the tuple/slot descriptor hierarchy omits the levels of structs that are not in the select list (but an ancestor of theirs is), as these structs are not materialised. In the case of the example query, the two hierarchies are the following: ORC: root --> outer_struct -> inner_struct3 -> s --> i \| \-> s \-> id Tuple/slot descriptors: main_tuple --> inner_struct3 -> s --> i \| \-> s \-> id We create 'OrcColumnReader's for each node in the ORC type tree. Each OrcColumnReader is assigned an ORC type node and a slot descriptor. The incorrect behaviour comes from the incorrect pairing of ORC type nodes with slot descriptors. The old behaviour is described below: Starting from the root, going along a path in both trees (for example the path leading to outer_struct.inner_struct3.s.i), for each step we consume a level in both trees until no more nodes remain in the tuple/slot desc tree, and then we pair the last element from that tree with the remaining ORC type node(s). In the example, we get the following pairs: (root, main_tuple) -> (outer_struct, inner_struct3) -> (inner_struct3, s) -> (s, i) -> (i, i) When we run out of structs in the tuple/slot desc tree, we still create OrcStructReaders (because the ORC type is still a struct, but the slot descriptor now refers to an INT), but we mark them incorrectly as non-materialised. Also, the OrcStructReaders for non-materialised structs do not need to check for null-ness as they are not present in the select list, only their descendants, and the ORC batch object stores null information also for the descendants of null values. Let's look at the row with id 4 in the example: Because of the bug, the non-materialising OrcStructReader appears at the level of the (s, i) pair, so the 's' struct is not checked for null-ness, although it is actually null. One level lower, for 'i' (and the inner 's' string field), the ORC batch object tells us that the values are null (because their parent is). Therefore the nulls appear one level lower than they should. The correct behaviour is that ORC type nodes are paired with slot descriptors if either - the ORC type node matches the slot descriptor (they refer to the same node in the schema) or - the slot descriptor is a descendant of the schema node that the ORC type node refers to. This patch fixes the incorrect pairing of ORC types and slot descriptors, so we have the following pairs: (root, main_tuple) -> (outer_struct, main_tuple) -> (inner_struct3, inner_struct3) -> (s, s) -> (i, i) In this case the OrcStructReader for the pair (outer_struct, main_tuple) becomes non-materialising and the one for (s, s) will be materialising, so the 's' struct will also be null-checked, recognising null-ness at the correct level. This commit also fixes some comments in be/src/exec/orc-column-readers.h and be/src/exec/hdfs-orc-scanner.h mentioning the field HdfsOrcScanner::col_id_path_map_, which has been removed by "IMPALA-10485: part(1): make ORC column reader creation independent of schema resolution". Testing: - added tests to testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test that query various levels of the struct 'outer_struct' to check that NULLs are at the correct level. Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040 Reviewed-on: http://gerrit.cloudera.org:8080/18403 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-04-21 13:59:17 +00:00
Gabor Kaszab	1e21aa6b96	IMPALA-9495: Support struct in select list for ORC tables This patch implements the functionality to allow structs in the select list of inline views, topmost blocks. When displaying the value of a struct it is formatted into a JSON value and returned as a string. An example of such a value: SELECT struct_col FROM some_table; '{"int_struct_member":12,"string_struct_member":"string value"}' Another example where we query a nested struct: SELECT outer_struct_col FROM some_table; '{"inner_struct":{"string_member":"string value","int_member":12}}' Note, the conversion from struct to JSON happens on the server side before sending out the value in HS2 to the client. However, HS2 is capable of handling struct values as well so in a later change we might want to add a functionality to send the struct in thrift to the client so that the client can use the struct directly. -- Internal representation of a struct: When scanning a struct the rowbatch will hold the values of the struct's children as if they were queried one by one directly in the select list. E.g. Taking the following table: CREATE TABLE tbl (id int, s struct<a:int,b:string>) STORED AS ORC And running the following query: SELECT id, s FROM tbl; After scanning a row in a row batch will hold the following values: (note the biggest size comes first) 1: The pointer for the string in s.b 2: The length for the string in s.b 3: The int value for s.a 4: The int value of id 5: A single null byte for all the slots: id, s, s.a, s.b The size of a struct has an effect on the order of the memory layout of a row batch. The struct size is calculated by summing the size of its fields and then the struct gets a place in the row batch to precede all smaller slots by size. Note, all the fields of a struct are consecutive to each other in the row batch. Inside a struct the order of the fields is also based on their size as it does in a regular case for primitives. When evaluating a struct as a SlotRef a newly introduced StructVal will be used to refer to the actual values of a struct in the row batch. This StructVal holds a vector of pointers where each pointer represents a member of the struct. Following the above example the StructVal would keep two pointers, one to point to an IntVal and one to point to a StringVal. -- Changes related to tuple and slot descriptors: When providing a struct in the select list there is going to be a SlotDescriptor for the struct slot in the topmost TupleDescriptor. Additionally, another TupleDesriptor is created to hold SlotDescriptors for each of the struct's children. The struct SlotDescriptor points to the newly introduced TupleDescriptor using 'itemTupleId'. The offsets for the children of the struct is calculated from the beginning of the topmost TupleDescriptor and not from the TupleDescriptor that directly holds the struct's children. The null indicator bytes as well are stored on the level of the topmost TupleDescriptor. -- Changes related to scalar expressions: A struct in the select list is translated into an expression tree where the top of this tree is a SlotRef for the struct itself and its children in the tree are SlotRefs for the members of the struct. When evaluating a struct SlotRef after the null checks the evaluation is delegated to the children SlotRefs. -- Restrictions: - Codegen support is not included in this patch. - Only ORC file format is supported by this patch. - Only HS2 client supports returning structs. Beeswax support is not implemented as it is going to be deprecated anyway. Currently we receive an error when trying to query a struct through Beeswax. -- Tests added: - The ORC and Parquet functional databases are extended with 3 new tables: 1: A small table with one level structs, holding different kind of primitive types as members. 2: A small table with 2 and 3 level nested structs. 3: A bigger, partitioned table constructed from alltypes where all the columns except the 'id' column are put into a struct. - struct-in-select-list.test and nested-struct-in-select-list.test uses these new tables to query structs directly or through an inline view. Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a Reviewed-on: http://gerrit.cloudera.org:8080/17638 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-14 21:21:47 +00:00

Author

SHA1

Message

Date

Daniel Becker

9baf790606

IMPALA-10838: Fix substitution and improve unification of struct slots

The following query fails:
'''
with sub as (
    select id, outer_struct
    from functional_orc_def.complextypes_nested_structs)
select sub.id, sub.outer_struct.inner_struct2 from sub;
'''

with the following error:
'''
ERROR: IllegalStateException: Illegal reference to non-materialized
tuple: debugname=InlineViewRef sub alias=sub tid=6
'''

while if 'outer_struct.inner_struct2' is added to the select list of the
inline view, the query works as expected.

This change fixes the problem by two modifications:
  - if a field of a struct needs to be materialised, also materialise
    all of its enclosing structs (ancestors)
  - in InlineViewRef, struct fields are inserted into the 'smap' and
    'baseTableSmap' with the appropriate inline view prefix

This change also changes the way struct fields are materialised: until
now, if a member of a struct was needed to be materialised, the whole
struct, including other members of the struct were materialised. This
behaviour can lead to using significantly more memory than necessary if
we for example query a single member of a large struct. This change
modifies this behaviour so that we only materialise the struct members
that are actually needed.

Tests:
  - added queries that are fixed by this change (including the one
    above) in nested-struct-in-select-list.test
  - added a planner test in
    fe/src/test/java/org/apache/impala/planner/PlannerTest.java that
    asserts that only the required parts of structs are materialised

Change-Id: Iadb9233677355b85d424cc3f22b00b5a3bf61c57
Reviewed-on: http://gerrit.cloudera.org:8080/17847
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Daniel Becker <daniel.becker@cloudera.com>

2022-05-02 07:21:37 +00:00

Daniel Becker

c802be42b6

IMPALA-10839: NULL values are displayed on a wrong level for nested structs (ORC)

When querying a non-toplevel nested struct from an ORC file, the NULL
values are displayed at an incorrect level. E.g.:

select id, outer_struct.inner_struct3 from
functional_orc_def.complextypes_nested_structs where id >= 4;
+----+----------------------------+
| id | outer_struct.inner_struct3 |
+----+----------------------------+
| 4  | {"s":{"i":null,"s":null}}  |
| 5  | {"s":null}                 |
+----+----------------------------+

However, in the first row it is expected that 's' should be null and not
its members; in the second row the result should be 'NULL', i.e.
'outer_struct.inner_struct3' is null.
For reference see what is returned when querying 'outer_struct' instead
of 'outer_struct.inner_struct3':

+----+-------------------------------------------------------------------------------------------------------------------------------+
| 4  | {"str":"","inner_struct1":{"str":"somestr2","de":12345.12},"inner_struct2":{"i":1,"str":"string"},"inner_struct3":{"s":null}} |
| 5  | {"str":null,"inner_struct1":null,"inner_struct2":null,"inner_struct3":null}                                                   |
+----+-------------------------------------------------------------------------------------------------------------------------------+

The problem comes from the incorrect handling of the different depths of
the following trees:
 - the ORC type hierarchy (schema)
 - the tuple descriptor / slot descriptor hierarchy
as the ORC type hierarchy contains a node for every level in the schema
but the tuple/slot descriptor hierarchy omits the levels of structs that
are not in the select list (but an ancestor of theirs is), as these
structs are not materialised.

In the case of the example query, the two hierarchies are the following:
ORC:
 root --> outer_struct -> inner_struct3 -> s --> i
      |                                      \-> s
      \-> id
Tuple/slot descriptors:
 main_tuple --> inner_struct3 -> s --> i
            |                      \-> s
            \-> id

We create 'OrcColumnReader's for each node in the ORC type tree. Each
OrcColumnReader is assigned an ORC type node and a slot descriptor. The
incorrect behaviour comes from the incorrect pairing of ORC type nodes
with slot descriptors.

The old behaviour is described below:
Starting from the root, going along a path in both trees (for example
the path leading to outer_struct.inner_struct3.s.i), for each step we
consume a level in both trees until no more nodes remain in the
tuple/slot desc tree, and then we pair the last element from that tree
with the remaining ORC type node(s).

In the example, we get the following pairs:
(root, main_tuple) -> (outer_struct, inner_struct3) ->
(inner_struct3, s) -> (s, i) -> (i, i)

When we run out of structs in the tuple/slot desc tree, we still create
OrcStructReaders (because the ORC type is still a struct, but the slot
descriptor now refers to an INT), but we mark them incorrectly as
non-materialised.

Also, the OrcStructReaders for non-materialised structs do not need to
check for null-ness as they are not present in the select list, only
their descendants, and the ORC batch object stores null information also
for the descendants of null values.

Let's look at the row with id 4 in the example:
Because of the bug, the non-materialising OrcStructReader appears at the
level of the (s, i) pair, so the 's' struct is not checked for
null-ness, although it is actually null. One level lower, for 'i' (and
the inner 's' string field), the ORC batch object tells us that the
values are null (because their parent is). Therefore the nulls appear
one level lower than they should.

The correct behaviour is that ORC type nodes are paired with slot
descriptors if either
 - the ORC type node matches the slot descriptor (they refer to the same
   node in the schema) or
 - the slot descriptor is a descendant of the schema node that the ORC
   type node refers to.

This patch fixes the incorrect pairing of ORC types and slot
descriptors, so we have the following pairs:
(root, main_tuple) -> (outer_struct, main_tuple) ->
(inner_struct3, inner_struct3) -> (s, s) -> (i, i)

In this case the OrcStructReader for the pair (outer_struct, main_tuple)
becomes non-materialising and the one for (s, s) will be materialising,
so the 's' struct will also be null-checked, recognising null-ness at
the correct level.

This commit also fixes some comments in be/src/exec/orc-column-readers.h
and be/src/exec/hdfs-orc-scanner.h mentioning the field
HdfsOrcScanner::col_id_path_map_, which has been removed by
"IMPALA-10485: part(1): make ORC column reader creation independent of
schema resolution".

Testing:
  - added tests to
    testdata/workloads/functional-query/queries/QueryTest/nested-struct-in-select-list.test
    that query various levels of the struct 'outer_struct' to check that
    NULLs are at the correct level.

Change-Id: Iff5034e7bdf39c036aecc491fbd324e29150f040
Reviewed-on: http://gerrit.cloudera.org:8080/18403
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2022-04-21 13:59:17 +00:00

Gabor Kaszab

1e21aa6b96

IMPALA-9495: Support struct in select list for ORC tables

This patch implements the functionality to allow structs in the select
list of inline views, topmost blocks. When displaying the value of a
struct it is formatted into a JSON value and returned as a string. An
example of such a value:

SELECT struct_col FROM some_table;
'{"int_struct_member":12,"string_struct_member":"string value"}'

Another example where we query a nested struct:
SELECT outer_struct_col FROM some_table;
'{"inner_struct":{"string_member":"string value","int_member":12}}'

Note, the conversion from struct to JSON happens on the server side
before sending out the value in HS2 to the client. However, HS2 is
capable of handling struct values as well so in a later change we might
want to add a functionality to send the struct in thrift to the client
so that the client can use the struct directly.

-- Internal representation of a struct:
When scanning a struct the rowbatch will hold the values of the
struct's children as if they were queried one by one directly in the
select list.

E.g. Taking the following table:
CREATE TABLE tbl (id int, s struct<a:int,b:string>) STORED AS ORC

And running the following query:
SELECT id, s FROM tbl;

After scanning a row in a row batch will hold the following values:
(note the biggest size comes first)
 1: The pointer for the string in s.b
 2: The length for the string in s.b
 3: The int value for s.a
 4: The int value of id
 5: A single null byte for all the slots: id, s, s.a, s.b

The size of a struct has an effect on the order of the memory layout of
a row batch. The struct size is calculated by summing the size of its
fields and then the struct gets a place in the row batch to precede all
smaller slots by size. Note, all the fields of a struct are consecutive
to each other in the row batch. Inside a struct the order of the fields
is also based on their size as it does in a regular case for primitives.

When evaluating a struct as a SlotRef a newly introduced StructVal will
be used to refer to the actual values of a struct in the row batch.
This StructVal holds a vector of pointers where each pointer represents
a member of the struct. Following the above example the StructVal would
keep two pointers, one to point to an IntVal and one to point to a
StringVal.

-- Changes related to tuple and slot descriptors:
When providing a struct in the select list there is going to be a
SlotDescriptor for the struct slot in the topmost TupleDescriptor.
Additionally, another TupleDesriptor is created to hold SlotDescriptors
for each of the struct's children. The struct SlotDescriptor points to
the newly introduced TupleDescriptor using 'itemTupleId'.
The offsets for the children of the struct is calculated from the
beginning of the topmost TupleDescriptor and not from the
TupleDescriptor that directly holds the struct's children. The null
indicator bytes as well are stored on the level of the topmost
TupleDescriptor.

-- Changes related to scalar expressions:
A struct in the select list is translated into an expression tree where
the top of this tree is a SlotRef for the struct itself and its
children in the tree are SlotRefs for the members of the struct. When
evaluating a struct SlotRef after the null checks the evaluation is
delegated to the children SlotRefs.

-- Restrictions:
  - Codegen support is not included in this patch.
  - Only ORC file format is supported by this patch.
  - Only HS2 client supports returning structs. Beeswax support is not
    implemented as it is going to be deprecated anyway. Currently we
    receive an error when trying to query a struct through Beeswax.

-- Tests added:
  - The ORC and Parquet functional databases are extended with 3 new
    tables:
    1: A small table with one level structs, holding different
    kind of primitive types as members.
    2: A small table with 2 and 3 level nested structs.
    3: A bigger, partitioned table constructed from alltypes where all
    the columns except the 'id' column are put into a struct.
  - struct-in-select-list.test and nested-struct-in-select-list.test
    uses these new tables to query structs directly or through an
    inline view.

Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a
Reviewed-on: http://gerrit.cloudera.org:8080/17638
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2021-09-14 21:21:47 +00:00

3 Commits