Files
impala/testdata/data/decimal_padded_fixed_len_byte_array.parquet
Tim Armstrong 2c76ff5e6e IMPALA-2515: support parquet decimal with extra padding
This adds support for reading Parquet files where the
DECIMAL is encoded as a FIXED_LEN_BYTE_ARRAY field
with extra padding. This requires loosening file
validation and fixing up the decoding so that it
no longer assumes that the in-memory value is at
least as large as the encoded representation.

The decimal decoding logic was reworked so that we
could add the extra condition handling without regressing
performance of the decoding logic in the common case. In
the end I was able to significantly speed up the decoding
logic. The bottleneck, revealed by perf record while running
the below benchmark, was CPU stalls on the bitshift used for
sign extension instruction waiting on loading the result of
ByteSwap(). I worked around this by doing the sign-extension
before the ByteSwap(),

Perf:
Ran a microbenchmark to check that scanning perf didn't regress
as a result of the change. The query scans a DECIMAL column
that is mostly plain-encoded, so to maximally stress the
FIXED_LEN_BYTE_ARRAY decoding performance.

  set mt_dop=1;
  set num_nodes=1;
  select min(l_extendedprice)
  from tpch_parquet.lineitem

The SCAN time in the summary averaged out to 94ms before
the change and is reduced to 74ms after the change. The
actual speedup of the DECIMAL decoding is greater - it
went from ~20% of time in to ~6% of time as measured by
perf.

Testing:
Added a couple of parquet files that were generated with a
hacked version of Impala to have extra padding. Sanity-checked
that hacked tables returned the same results on Hive. The
tests failed before this code change.

Ran exhaustive tests with the hacked version of Impala
(so that all decimal tables got extra padding).

Change-Id: I2700652eab8ba7f23ffa75800a1712d310d4e1ec
Reviewed-on: http://gerrit.cloudera.org:8080/16090
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-06-18 21:00:27 +00:00

1.8 KiB