Files
impala/testdata/datasets
Gabor Kaszab 1e21aa6b96 IMPALA-9495: Support struct in select list for ORC tables
This patch implements the functionality to allow structs in the select
list of inline views, topmost blocks. When displaying the value of a
struct it is formatted into a JSON value and returned as a string. An
example of such a value:

SELECT struct_col FROM some_table;
'{"int_struct_member":12,"string_struct_member":"string value"}'

Another example where we query a nested struct:
SELECT outer_struct_col FROM some_table;
'{"inner_struct":{"string_member":"string value","int_member":12}}'

Note, the conversion from struct to JSON happens on the server side
before sending out the value in HS2 to the client. However, HS2 is
capable of handling struct values as well so in a later change we might
want to add a functionality to send the struct in thrift to the client
so that the client can use the struct directly.

-- Internal representation of a struct:
When scanning a struct the rowbatch will hold the values of the
struct's children as if they were queried one by one directly in the
select list.

E.g. Taking the following table:
CREATE TABLE tbl (id int, s struct<a:int,b:string>) STORED AS ORC

And running the following query:
SELECT id, s FROM tbl;

After scanning a row in a row batch will hold the following values:
(note the biggest size comes first)
 1: The pointer for the string in s.b
 2: The length for the string in s.b
 3: The int value for s.a
 4: The int value of id
 5: A single null byte for all the slots: id, s, s.a, s.b

The size of a struct has an effect on the order of the memory layout of
a row batch. The struct size is calculated by summing the size of its
fields and then the struct gets a place in the row batch to precede all
smaller slots by size. Note, all the fields of a struct are consecutive
to each other in the row batch. Inside a struct the order of the fields
is also based on their size as it does in a regular case for primitives.

When evaluating a struct as a SlotRef a newly introduced StructVal will
be used to refer to the actual values of a struct in the row batch.
This StructVal holds a vector of pointers where each pointer represents
a member of the struct. Following the above example the StructVal would
keep two pointers, one to point to an IntVal and one to point to a
StringVal.

-- Changes related to tuple and slot descriptors:
When providing a struct in the select list there is going to be a
SlotDescriptor for the struct slot in the topmost TupleDescriptor.
Additionally, another TupleDesriptor is created to hold SlotDescriptors
for each of the struct's children. The struct SlotDescriptor points to
the newly introduced TupleDescriptor using 'itemTupleId'.
The offsets for the children of the struct is calculated from the
beginning of the topmost TupleDescriptor and not from the
TupleDescriptor that directly holds the struct's children. The null
indicator bytes as well are stored on the level of the topmost
TupleDescriptor.

-- Changes related to scalar expressions:
A struct in the select list is translated into an expression tree where
the top of this tree is a SlotRef for the struct itself and its
children in the tree are SlotRefs for the members of the struct. When
evaluating a struct SlotRef after the null checks the evaluation is
delegated to the children SlotRefs.

-- Restrictions:
  - Codegen support is not included in this patch.
  - Only ORC file format is supported by this patch.
  - Only HS2 client supports returning structs. Beeswax support is not
    implemented as it is going to be deprecated anyway. Currently we
    receive an error when trying to query a struct through Beeswax.

-- Tests added:
  - The ORC and Parquet functional databases are extended with 3 new
    tables:
    1: A small table with one level structs, holding different
    kind of primitive types as members.
    2: A small table with 2 and 3 level nested structs.
    3: A bigger, partitioned table constructed from alltypes where all
    the columns except the 'id' column are put into a struct.
  - struct-in-select-list.test and nested-struct-in-select-list.test
    uses these new tables to query structs directly or through an
    inline view.

Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a
Reviewed-on: http://gerrit.cloudera.org:8080/17638
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-14 21:21:47 +00:00
..

This directory contains Impala test data sets. The directory layout is structured as follows:

datasets/
   <data set>/<data set>_schema_template.sql
   <data set>/<data files SF1>/data files
   <data set>/<data files SF2>/data files

Where SF is the scale factor controlling data size. This allows for scaling the same schema to
different sizes based on the target test environment.

The schema template SQL files have the following format:

  The goal is to provide a single place to define a table + data files
  and have the schema and data load statements generated for each combination of file
  format, compression, etc. The way this works is by specifying how to create a
  'base table'. The base table can be used to generate tables in other file formats
  by performing the defined INSERT / SELECT INTO statement. Each new table using the
  file format/compression combination needs to have a unique name, so all the
  statements are pameterized on table name.
  The template file is read in by the 'generate_schema_statements.py' script to
  to generate all the schema for the Imapla benchmark tests.

  Each table is defined as a new section in the file with the following format:

  ====
  ---- SECTION NAME
  section contents
  ...
  ---- ANOTHER SECTION
  ... section contents
  ---- ... more sections...

  Note that tables are delimited by '====' and that even the first table in the
  file must include this header line.

  The supported section names are:

  DATASET
      Data set name - Used to group sets of tables together
  BASE_TABLE_NAME
      The name of the table within the database
  CREATE
      Explicit CREATE statement used to create the table (executed by Impala)
  CREATE_HIVE
      Same as the above, but will be executed by Hive instead. If specified,
      'CREATE' must not be specified.
  CREATE_KUDU
      Customized CREATE TABLE statement used to create the table for Kudu-specific
      syntax.

  COLUMNS
  PARTITION_COLUMNS
  ROW_FORMAT
  HBASE_COLUMN_FAMILIES
  TABLE_PROPERTIES
  HBASE_REGION_SPLITS
      If no explicit CREATE statement is provided, a CREATE statement is generated
      from these sections (see 'build_table_template' function in
      'generate-schema-statements.py' for details)

  ALTER
      A set of ALTER statements to be executed after the table is created
      (typically to add partitions, but may also be used for other settings that
      cannot be specified directly in the CREATE TABLE statement).

      These statements are ignored for HBase and Kudu tables.

  LOAD
      The statement used to load the base (text) form of the table. This is
      typically a LOAD DATA statement.

  DEPENDENT_LOAD
  DEPENDENT_LOAD_KUDU
  DEPENDENT_LOAD_HIVE
  DEPENDENT_LOAD_ACID
      Statements to be executed during the "dependent load" phase. These statements
      are run after the initial (base table) load is complete.

  HIVE_MAJOR_VERSION
       The required major version of Hive for this table. If the major version
       of Hive at runtime does not exactly match the version specified in this section,
       the table will be skipped.

       NOTE: this is not a _minimum_ version -- if HIVE_MAJOR_VERSION specifies '2',
                   the table will _not_ be loaded/created on Hive 3.