impala

mirror of https://github.com/apache/impala.git synced 2026-01-22 00:01:21 -05:00

Files

Alex Behm 7d8acee814 IMPALA-4725: Query option to control Parquet array resolution.

Summary of changes:
Introduces a new query option PARQUET_ARRAY_RESOLUTION to
control the path-resolution behavior for Parquet files
with nested arrays. The values are:
- THREE_LEVEL
  Assumes arrays are encoded with the 3-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 2-level resolution.
- TWO_LEVEL
  Assumes arrays are encoded with the 2-level representation.
  Also resolves arrays encoded with a single level.
  Does not attempt a 3-level resolution.
- TWO_LEVEL_THEN_THREE_LEVEL
  First tries to resolve assuming the 2-level representation,
  and if unsuccessful, tries the 3-level representation.
  Also resolves arrays encoded with a single level.
  This is the current Impala behavior and is used as the
  default value for compatibility.

Note that 'failure' to resolve a schema path with a given
array-resolution policy does not necessarily mean a warning or
error is returned by the query. A mismatch might be treated
like a missing field which is necessary to support schema
evolution. There is no way to reliably distinguish the
'bad resolution' and 'legitimately missing field' cases.

The new query option is independent of and can be combined
with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION.

Background:
Arrays can be represented in several ways in Parquet:
- Three Level Encoding (standard)
- Two Level Encoding (legacy)
- One Level Encoding (legacy)
More details are in the "Lists" section of the spec:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Unfortunately, there is no reliable metadata within Parquet files
to indicate which encoding was used. There is even the possibility
of having mixed encodings within the same file if there are multiple
arrays.

As a result, Impala currently tries to auto-detect the file encoding
when resolving a schema path in a Parquet file using the
TWO_LEVEL_THEN_THREE_LEVEL policy.

However, regardless of whether a Parquet data file uses the 2-level
or 3-level encoding, the index-based resolution may return incorrect
results if the representation in the Parquet file does not
exactly match the attempted array-resoution policy. Intuitively,
when attempting a 2-level resolution on a 3-level file, the matched
schema node may not be deep enough in the schema tree, but could still
be a scalar node with expected type. Similarly, when attempting a
3-level resolution on a 2-level file a level may be incorrectly
skipped.

The name-based policy generally does not have this problem because it
avoids traversing incorrect schema paths. However, the index-based
resoution allows a different set of schema-evolution operations,
so just using name-based resolution is not an acceptable workaround
in all cases.

Testing:
- Added new Parquet data files that show how incorrect results
  can be returned with a mismatched file encoding and resolution
  policy. Added both 2-level and 3-level versions of the data.
- Added a new test in test_nested_types.py that shows the behavior
  with the new PARQUET_ARRAY_RESOLUTION query option.
- Locally ran test_scanners.py and test_nested_types.py on core.

Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
Reviewed-on: http://gerrit.cloudera.org:8080/6250
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins

2017-03-09 05:07:44 +00:00

function-registry

IMPALA-4729: Implement REPLACE()

2017-02-15 01:33:23 +00:00

thrift

IMPALA-4725: Query option to control Parquet array resolution.

2017-03-09 05:07:44 +00:00

.gitignore

Implmented opcode registry. Added substr() and pi() functions. Added backend testing to buildall.sh

2011-11-20 13:44:41 -08:00