mirror of
https://github.com/apache/impala.git
synced 2026-01-07 18:02:33 -05:00
A script is added that generates two parquet files with nested data. One file has modern nested types encoding and the other one has legacy encoding. This data will be used for testing nested types support for "create table like file" statement. Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef Reviewed-on: http://gerrit.cloudera.org:8080/610 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins
The two Parquet files (legacy_nested.parquet and modern_nested.parquet) were generated using the kite script located here: testdata/src/main/java/com/cloudera/impala/datagenerator/JsonToParquetConverter.java The Parquet files can be regenerated by running the following commands in the testdata directory: mvn package mvn exec:java \ -Dexec.mainClass="com.cloudera.impala.datagenerator.JsonToParquetConverter" \ -Dexec.args="--legacy_collection_format data/schemas/nested/nested.avsc data/schemas/nested/nested.json data/schemas/nested/legacy_nested.parquet" mvn exec:java \ -Dexec.mainClass="com.cloudera.impala.datagenerator.JsonToParquetConverter" \ -Dexec.args=" data/schemas/nested/nested.avsc data/schemas/nested/nested.json data/schemas/nested/modern_nested.parquet" The script takes an Avro schema and a JSON file with data and creates a Parquet file. The --legacy_collection_format flag makes the script output a Parquet file that uses the legacy two-level format for nested types, rather than the modern three-level format. More information about the Parquet nested types format can be found here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md