Files
impala/testdata/data/schemas/nested/README
Thomas Tauber-Marshall b2c2fe7813 IMPALA-3786: Replace "cloudera" with "apache" (part 2)
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*

A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:

find . | grep "\.java\|\.py\|\.h\|\.cc" | \
  xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g

along with some manual fixes.

After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
  eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/

Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-09-29 21:14:13 +00:00

30 lines
1.2 KiB
Plaintext

The two Parquet files (legacy_nested.parquet and modern_nested.parquet) were generated
using the kite script located here:
testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java
The Parquet files can be regenerated by running the following commands in the testdata
directory:
mvn package
mvn exec:java \
-Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \
-Dexec.args="--legacy_collection_format
data/schemas/nested/nested.avsc
data/schemas/nested/nested.json
data/schemas/nested/legacy_nested.parquet"
mvn exec:java \
-Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \
-Dexec.args="
data/schemas/nested/nested.avsc
data/schemas/nested/nested.json
data/schemas/nested/modern_nested.parquet"
The script takes an Avro schema and a JSON file with data and creates a Parquet file.
The --legacy_collection_format flag makes the script output a Parquet file that uses the
legacy two-level format for nested types, rather than the modern three-level format.
More information about the Parquet nested types format can be found here:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md