mirror of
https://github.com/apache/impala.git
synced 2025-12-26 14:02:53 -05:00
This adds support for reading Parquet files where the DECIMAL is encoded as a FIXED_LEN_BYTE_ARRAY field with extra padding. This requires loosening file validation and fixing up the decoding so that it no longer assumes that the in-memory value is at least as large as the encoded representation. The decimal decoding logic was reworked so that we could add the extra condition handling without regressing performance of the decoding logic in the common case. In the end I was able to significantly speed up the decoding logic. The bottleneck, revealed by perf record while running the below benchmark, was CPU stalls on the bitshift used for sign extension instruction waiting on loading the result of ByteSwap(). I worked around this by doing the sign-extension before the ByteSwap(), Perf: Ran a microbenchmark to check that scanning perf didn't regress as a result of the change. The query scans a DECIMAL column that is mostly plain-encoded, so to maximally stress the FIXED_LEN_BYTE_ARRAY decoding performance. set mt_dop=1; set num_nodes=1; select min(l_extendedprice) from tpch_parquet.lineitem The SCAN time in the summary averaged out to 94ms before the change and is reduced to 74ms after the change. The actual speedup of the DECIMAL decoding is greater - it went from ~20% of time in to ~6% of time as measured by perf. Testing: Added a couple of parquet files that were generated with a hacked version of Impala to have extra padding. Sanity-checked that hacked tables returned the same results on Hive. The tests failed before this code change. Ran exhaustive tests with the hacked version of Impala (so that all decimal tables got extra padding). Change-Id: I2700652eab8ba7f23ffa75800a1712d310d4e1ec Reviewed-on: http://gerrit.cloudera.org:8080/16090 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
1.8 KiB
1.8 KiB