IMPALA-8617: Add support for lz4 in parquet

A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to
distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents
the block compression scheme used by Hadoop. Its similar to
SNAPPY_BLOCKED as far as the block format is concerned, with the only
difference being the codec used for compression and decompression.

Added Lz4BlockCompressor and Lz4BlockDecompressor classes for
compressing and decompressing parquet data using Hadoop's
lz4 block compression scheme.

The Lz4BlockCompressor treats the input
as a single block and generates a compressed block with following layout
  <4 byte big endian uncompressed size>
  <4 byte big endian compressed size>
  <lz4 compressed block>
The hdfs parquet table writer should call the Lz4BlockCompressor
using the ideal input size (unit of compression in parquet is a page),
and so the Lz4BlockCompressor does not further break down the input
into smaller blocks.

The Lz4BlockDecompressor on the other hand should be compatible with
blocks written by Impala and other engines in Hadoop ecosystem. It can
decompress compressed data in following format
  <4 byte big endian uncompressed size>
  <4 byte big endian compressed size>
  <lz4 compressed block>
  ...
  <4 byte big endian compressed size>
  <lz4 compressed block>
  ...
  <repeated untill uncompressed size from outer block is consumed>

Externally users can now set the lz4 codec for parquet using:
  set COMPRESSION_CODEC=lz4
This gets translated into LZ4_BLOCKED codec for the
HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet
data, the LZ4_BLOCKED codec is used.

Testing:
 - Added unit tests for LZ4_BLOCKED in decompress-test.cc
 - Added unit tests for Hadoop compatibility in decompress-test.cc,
   basically being able to decompress an outer block with multiple inner
   blocks (the Lz4BlockDecompressor description above)
 - Added interoperability tests for Hive and Impala for all parquet
   codecs. New test added to
   tests/custom_cluster/test_hive_parquet_codec_interop.py

Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c
Reviewed-on: http://gerrit.cloudera.org:8080/13582
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
Abhishek
2019-06-10 09:55:24 -07:00
committed by Impala Public Jenkins
parent 2b76da027d
commit 97a6a3c807
18 changed files with 440 additions and 59 deletions

View File

@@ -416,6 +416,19 @@ error_codes = (
"The user authorized on the connection '$0' does not match the session username '$1'"),
("ZSTD_ERROR", 137, "$0 failed with error: $1"),
("LZ4_BLOCK_DECOMPRESS_DECOMPRESS_SIZE_INCORRECT", 138,
"LZ4Block: Decompressed size is not correct."),
("LZ4_BLOCK_DECOMPRESS_INVALID_INPUT_LENGTH", 139,
"LZ4Block: Invalid input length."),
("LZ4_BLOCK_DECOMPRESS_INVALID_COMPRESSED_LENGTH", 140,
"LZ4Block: Invalid compressed length. Data is likely corrupt."),
("LZ4_DECOMPRESS_SAFE_FAILED", 141, "LZ4: LZ4_decompress_safe failed"),
("LZ4_COMPRESS_DEFAULT_FAILED", 142, "LZ4: LZ4_compress_default failed"),
)
import sys