mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . | grep "\.java\|\.py\|\.h\|\.cc" | \ xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins
The two Parquet files (legacy_nested.parquet and modern_nested.parquet) were generated using the kite script located here: testdata/src/main/java/org/apache/impala/datagenerator/JsonToParquetConverter.java The Parquet files can be regenerated by running the following commands in the testdata directory: mvn package mvn exec:java \ -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \ -Dexec.args="--legacy_collection_format data/schemas/nested/nested.avsc data/schemas/nested/nested.json data/schemas/nested/legacy_nested.parquet" mvn exec:java \ -Dexec.mainClass="org.apache.impala.datagenerator.JsonToParquetConverter" \ -Dexec.args=" data/schemas/nested/nested.avsc data/schemas/nested/nested.json data/schemas/nested/modern_nested.parquet" The script takes an Avro schema and a JSON file with data and creates a Parquet file. The --legacy_collection_format flag makes the script output a Parquet file that uses the legacy two-level format for nested types, rather than the modern three-level format. More information about the Parquet nested types format can be found here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md