Files
impala/testdata/data/schemas/nested
Thomas Tauber-Marshall b2c2fe7813 IMPALA-3786: Replace "cloudera" with "apache" (part 2)
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*

A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:

find . | grep "\.java\|\.py\|\.h\|\.cc" | \
  xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g

along with some manual fixes.

After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
  eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/

Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-09-29 21:14:13 +00:00
..

The two Parquet files (legacy_nested.parquet and modern_nested.parquet) were generated
using the kite script located here:
testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java

The Parquet files can be regenerated by running the following commands in the testdata
directory:

mvn package

mvn exec:java \
  -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \
  -Dexec.args="--legacy_collection_format
  data/schemas/nested/nested.avsc
  data/schemas/nested/nested.json
  data/schemas/nested/legacy_nested.parquet"

mvn exec:java \
  -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \
  -Dexec.args="
  data/schemas/nested/nested.avsc
  data/schemas/nested/nested.json
  data/schemas/nested/modern_nested.parquet"

The script takes an Avro schema and a JSON file with data and creates a Parquet file.
The --legacy_collection_format flag makes the script output a Parquet file that uses the
legacy two-level format for nested types, rather than the modern three-level format.

More information about the Parquet nested types format can be found here:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md