mirror of
https://github.com/apache/impala.git
synced 2026-01-17 03:00:37 -05:00
A script is added that generates two parquet files with nested data. One file has modern nested types encoding and the other one has legacy encoding. This data will be used for testing nested types support for "create table like file" statement. Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef Reviewed-on: http://gerrit.cloudera.org:8080/610 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins
30 lines
1.2 KiB
Plaintext
30 lines
1.2 KiB
Plaintext
The two Parquet files (legacy_nested.parquet and modern_nested.parquet) were generated
|
|
using the kite script located here:
|
|
testdata/src/main/java/com/cloudera/impala/datagenerator/JsonToParquetConverter.java
|
|
|
|
The Parquet files can be regenerated by running the following commands in the testdata
|
|
directory:
|
|
|
|
mvn package
|
|
|
|
mvn exec:java \
|
|
-Dexec.mainClass="com.cloudera.impala.datagenerator.JsonToParquetConverter" \
|
|
-Dexec.args="--legacy_collection_format
|
|
data/schemas/nested/nested.avsc
|
|
data/schemas/nested/nested.json
|
|
data/schemas/nested/legacy_nested.parquet"
|
|
|
|
mvn exec:java \
|
|
-Dexec.mainClass="com.cloudera.impala.datagenerator.JsonToParquetConverter" \
|
|
-Dexec.args="
|
|
data/schemas/nested/nested.avsc
|
|
data/schemas/nested/nested.json
|
|
data/schemas/nested/modern_nested.parquet"
|
|
|
|
The script takes an Avro schema and a JSON file with data and creates a Parquet file.
|
|
The --legacy_collection_format flag makes the script output a Parquet file that uses the
|
|
legacy two-level format for nested types, rather than the modern three-level format.
|
|
|
|
More information about the Parquet nested types format can be found here:
|
|
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
|