STRUCT Complex Type (<keyword keyref="impala23"/> or higher only)

A complex data type, representing multiple fields of a single item. Frequently used as the element type of an ARRAY or the VALUE part of a MAP.

column_name STRUCT < name : type [COMMENT 'comment_string'], ... > type ::= primitive_type | complex_type

The names and number of fields within the STRUCT are fixed. Each field can be a different type. A field within a STRUCT can also be another STRUCT, or an ARRAY or a MAP, allowing you to create nested data structures with a maximum nesting depth of 100.

A STRUCT can be the top-level type for a column, or can itself be an item within an ARRAY or the value part of the key-value pair in a MAP.

When a STRUCT is used as an ARRAY element or a MAP value, you use a join clause to bring the ARRAY or MAP elements into the result set, and then refer to array_name.ITEM.field or map_name.VALUE.field. In the case of a STRUCT directly inside an ARRAY or MAP, you can omit the .ITEM and .VALUE pseudocolumns and refer directly to array_name.field or map_name.field.

A STRUCT is similar conceptually to a table row: it contains a fixed number of named fields, each with a predefined type. To combine two related tables, while using complex types to minimize repetition, the typical way to represent that data is as an ARRAY of STRUCT elements.

Because a STRUCT has a fixed number of named fields, it typically does not make sense to have a STRUCT as the type of a table column. In such a case, you could just make each field of the STRUCT into a separate column of the table. The STRUCT type is most useful as an item of an ARRAY or the value part of the key-value pair in a MAP. A nested type column with a STRUCT at the lowest level lets you associate a variable number of row-like objects with each row of the table.

The STRUCT type is straightforward to reference within a query. You do not need to include the STRUCT column in a join clause or give it a table alias, as is required for the ARRAY and MAP types. You refer to the individual fields using dot notation, such as struct_column_name.field_name, without any pseudocolumn such as ITEM or VALUE.

Within the Parquet data file, the values for each STRUCT field are stored adjacent to each other, so that they can be encoded and compressed using all the Parquet techniques for storing sets of similar or repeated values. The adjacency applies even when the STRUCT values are part of an ARRAY or MAP. During a query, Impala avoids unnecessary I/O by reading only the portions of the Parquet data file containing the requested STRUCT fields.

The following example shows a table with various kinds of STRUCT columns, both at the top level and nested within other complex types. Practice the CREATE TABLE and query notation for complex type columns using empty tables, until you can visualize a complex data structure and construct corresponding SQL statements reliably.

, -- A STRUCT as the element type of an ARRAY. places_lived ARRAY < STRUCT >, -- A STRUCT as the value portion of the key-value pairs in a MAP. memorable_moments MAP < STRING, STRUCT < year: INT, place: STRING, details: STRING >>, -- A STRUCT where one of the fields is another STRUCT. current_address STRUCT < street_address: STRUCT , country: STRING, postal_code: STRING > ) STORED AS PARQUET; ]]>

The following example shows how to examine the structure of a table containing one or more STRUCT columns by using the DESCRIBE statement. You can visualize each STRUCT as its own table, with columns named the same as each field of the STRUCT. If the STRUCT is nested inside another complex type, such as ARRAY, you can extend the qualified name passed to DESCRIBE until the output shows just the STRUCT fields.

The top-level column EMPLOYEE_INFO is a STRUCT. Describing table_name.struct_name displays the fields of the STRUCT as if they were columns of a table:

Because PLACES_LIVED is a STRUCT inside an ARRAY, the initial DESCRIBE shows the structure of the ARRAY:

| | pos | bigint | +------+------------------+ ]]>

Ask for the details of the ITEM field of the ARRAY to see just the layout of the STRUCT:

Likewise, MEMORABLE_MOMENTS has a STRUCT inside a MAP, which requires an extra level of qualified name to see just the STRUCT part:

| +-------+------------------+ ]]>

For a MAP, ask to see the VALUE field to see the corresponding STRUCT fields in a table-like structure:

For a STRUCT inside a STRUCT, we can see the fields of the outer STRUCT:

Then we can use a further qualified name to see just the fields of the inner STRUCT:

| | | places_lived | array> | | | memorable_moments | map> | | | current_address | struct< | | | | street_address:struct< | | | | street_number:int, | | | | street_name:string, | | | | street_type:string | | | | >, | | | | country:string, | | | | postal_code:string | | | | > | | +-------------------+--------------------------+---------+ SELECT id, employee_info.id FROM struct_demo; SELECT id, employee_info.id AS employee_id FROM struct_demo; SELECT id, employee_info.id AS employee_id, employee_info.employer FROM struct_demo; SELECT id, name, street, city, country FROM struct_demo, struct_demo.places_lived; SELECT id, name, places_lived.pos, places_lived.street, places_lived.city, places_lived.country FROM struct_demo, struct_demo.places_lived; SELECT id, name, pl.pos, pl.street, pl.city, pl.country FROM struct_demo, struct_demo.places_lived AS pl; SELECT id, name, places_lived.pos, places_lived.street, places_lived.city, places_lived.country FROM struct_demo, struct_demo.places_lived; SELECT id, name, pos, street, city, country FROM struct_demo, struct_demo.places_lived; SELECT id, name, memorable_moments.key, memorable_moments.value.year, memorable_moments.value.place, memorable_moments.value.details FROM struct_demo, struct_demo.memorable_moments WHERE memorable_moments.key IN ('Birthday','Anniversary','Graduation'); SELECT id, name, mm.key, mm.value.year, mm.value.place, mm.value.details FROM struct_demo, struct_demo.memorable_moments AS mm WHERE mm.key IN ('Birthday','Anniversary','Graduation'); SELECT id, name, memorable_moments.key, memorable_moments.value.year, memorable_moments.value.place, memorable_moments.value.details FROM struct_demo, struct_demo.memorable_moments WHERE key IN ('Birthday','Anniversary','Graduation'); SELECT id, name, key, value.year, value.place, value.details FROM struct_demo, struct_demo.memorable_moments WHERE key IN ('Birthday','Anniversary','Graduation'); SELECT id, name, key, year, place, details FROM struct_demo, struct_demo.memorable_moments WHERE key IN ('Birthday','Anniversary','Graduation'); SELECT id, name, current_address.street_address.street_number, current_address.street_address.street_name, current_address.street_address.street_type, current_address.country, current_address.postal_code FROM struct_demo; ]]>

For example, this table uses a struct that encodes several data values for each phone number associated with a person. Each person can have a variable-length array of associated phone numbers, and queries can refer to the category field to locate specific home, work, mobile, and so on kinds of phone numbers.

> ) STORED AS PARQUET; ]]>

Because structs are naturally suited to composite values where the fields have different data types, you might use them to decompose things such as addresses:

); ]]>

In a big data context, splitting out data fields such as the number part of the address and the street name could let you do analysis on each field independently. For example, which streets have the largest number range of addresses, what are the statistical properties of the street names, which areas have a higher proportion of Roads, Courts or Boulevards, and so on.

, ,