Files
impala/testdata/TableFlattener
Philip Zeyliger f755910e97 Remove unused deps, centralize some pom versions, upgrade SLF4J and commons-io.
As a follow-on to centralizing into one parent pom, we can now manage
thirdparty dependency versions in Java a little bit more clearly.

Upgrades SLF4J, commons.io:
  slf4j: 1.7.5 -> 1.7.25
  commons.io: 2.4 -> 2.6

  The SLF4J upgrade is nice to be able to run under Java9. The release
  notes at https://www.slf4j.org/news.html are uneventful.

  Commons IO 2.6 supports Java 9 and is source and binary compatible,
  per https://commons.apache.org/proper/commons-io/upgradeto2_6.html and
  https://commons.apache.org/proper/commons-io/upgradeto2_5.html.

Removes the following dependencies:
  htrace-core
  hadoop-mapreduce-client-core
  hive-shims
  com.stumbleupon:async
  commons-dbcp
  jdo-api

  I ran "mvn dependency:analyze" and these were some (but not all)
  of the "Unused declared dependencies found." Spelunking in git logs,
  these dependencies are from 2013 and possibly from an effort
  to run with dependencies from the filesystem. They don't seem
  to be required anymore.

Stops pulling in an old version of hadoop-client and kite-data-core in
testdata/TableFlattener by using the same versions as the Hadoop we use.
Doing so was unnecessarily causing us to download extra, old Hadoop
jars, and the new Hadoop jars seem to work just as well. This is the
kind of divergence that centralizing the versions into variables will
help with.

Creates variables for:
  junit.version
  slf4j.version
  hadoop.version
  commons-io.version
  httpcomponents.core.version
  thrift.version
  kite.version (controlled via $IMPALA_KITE_VERSION in impala-config.sh)

Cleans up unused IMPALA_PARQUET_URL variables in impala-config.sh. We
only download Parquet via Maven, rather than downloading it in the
toolchain, so this variable wasn't doing anything.

I ran the core tests with this change.

Change-Id: I717e0625dfe0fdbf7e9161312e9e80f405a359c5
Reviewed-on: http://gerrit.cloudera.org:8080/8853
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-20 22:04:18 +00:00
..

This is a tool to convert a nested dataset to an unnested dataset. The source and/or
destination can be the local file system or HDFS.

Structs get converted to a column (with a long name). Arrays and Maps get converted to
a table which can be joined with the parent table on id column.

$ mvn exec:java \
    -Dexec.mainClass=org.apache.impala.infra.tableflattener.Main \
    -Dexec.arguments="file:///tmp/in.parquet,file:///tmp/out,-sfile:///tmp/in.avsc"

$ mvn exec:java \
    -Dexec.mainClass=org.apache.impala.infra.tableflattener.Main \
    -Dexec.arguments="hdfs://localhost:20500/nested.avro,file://$PWD/unnested"

There are various options to specify the type of input file but the output is always
parquet/snappy.

For additional help, use the following command:
$ mvn exec:java \
    -Dexec.mainClass=org.apache.impala.infra.tableflattener.Main -Dexec.arguments="--help"

This is used by testdata/bin/generate-load-nested.sh.