Files
impala/testdata/data/schemas/nested/README
Taras Bobrovytsky 3c9ceb1a2b Add Parquet nested schemas to testdata
A script is added that generates two parquet files with nested data.
One file has modern nested types encoding and the other one has
legacy encoding. This data will be used for testing nested types
support for "create table like file" statement.

Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef
Reviewed-on: http://gerrit.cloudera.org:8080/610
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-13 10:25:39 +00:00

30 lines
1.2 KiB
Plaintext

The two Parquet files (legacy_nested.parquet and modern_nested.parquet) were generated
using the kite script located here:
testdata/src/main/java/com/cloudera/impala/datagenerator/JsonToParquetConverter.java
The Parquet files can be regenerated by running the following commands in the testdata
directory:
mvn package
mvn exec:java \
-Dexec.mainClass="com.cloudera.impala.datagenerator.JsonToParquetConverter" \
-Dexec.args="--legacy_collection_format
data/schemas/nested/nested.avsc
data/schemas/nested/nested.json
data/schemas/nested/legacy_nested.parquet"
mvn exec:java \
-Dexec.mainClass="com.cloudera.impala.datagenerator.JsonToParquetConverter" \
-Dexec.args="
data/schemas/nested/nested.avsc
data/schemas/nested/nested.json
data/schemas/nested/modern_nested.parquet"
The script takes an Avro schema and a JSON file with data and creates a Parquet file.
The --legacy_collection_format flag makes the script output a Parquet file that uses the
legacy two-level format for nested types, rather than the modern three-level format.
More information about the Parquet nested types format can be found here:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md