Files
impala/testdata/multi_compression_parquet_data
stiga-huang 192cd96d9e IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet file
using gzip and snappy compression codec will be reported as:
	FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Reviewed-on: http://gerrit.cloudera.org:8080/8147
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-10-10 01:30:33 +00:00
..

These Parquet files were created by modifying Impala's HdfsParquetTableWriter.

String Data
-----------
These files have two string columns 'a' and 'b'. Each columns using different compression types.

tinytable_0_gzip_snappy.parq: column 'a' is compressed by gzip, column 'b' is compressed by snappy
tinytable_1_gzip_snappy.parq: column 'a' is compressed by snappy, column 'b' is compressed by gzip