mirror of
https://github.com/apache/impala.git
synced 2025-12-30 12:02:10 -05:00
This patch modifies the Parquet scanner to resolve nested schemas, and read and materialize collection types. The high-level modification is to create a CollectionColumnReader that recursively materializes map- and array-type slots. This patch also adds many tests, most of which query a new table called complextypestbl. This table contains hand-generated data that is meant to expose edge cases in the scanner. The tests mostly test the scanner, with a few tests of other functionality (e.g. array serialization). I ran a local benchmark comparing this scanner code to the original scanner code on an expanded version of tpch_parquet.lineitem with 48009720 rows. My benchmark involved selecting different numbers of columns with a single scanner thread, and I looked at the HDFS scan node time in the query profiles. This code introduces a 10%-20% regression in single-threaded scan time. Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a Reviewed-on: http://gerrit.cloudera.org:8080/576 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins
15 lines
258 B
JSON
15 lines
258 B
JSON
[
|
|
{"id": 8,
|
|
"int_array": [-1],
|
|
"int_array_array": [[-1,-2],[]],
|
|
"int_map": {"k1": -1},
|
|
"int_map_array": [{}, {"k1": 1}, {}, {}],
|
|
"nested_struct": {
|
|
"a": -1,
|
|
"b": [-1],
|
|
"c": {
|
|
"d": [
|
|
[{"e": -1, "f": "nonnullable"}]]},
|
|
"g": {}}}
|
|
]
|