mirror of
https://github.com/apache/impala.git
synced 2026-02-03 09:00:39 -05:00
Before this fix Impala did not check whether a timestamp's time part is out of the valid [0, 24 hour) range when reading Parquet files, so these timestamps were memcopied as they were to slots, leading to results like: 1970-01-01 -00:00:00.000000001 1970-01-01 24:00:00 Different parts of Impala treat these timestamp differently: - string conversion leads to invalid representation that cannot be converted back to timestamp - timezone conversions handle the overflowing time part and give a valid timestamp result (at least since CCTZ, I did not check older versions of Impala) - Parquet writing inserts these timestamp as they are, so the resulting Parquet file will also contain corrupt timestamps The fix adds a check that converts these corrupt timestamps to NULL, similarly to the handling of timestamp outside the [1400..10000) range. A new error code is added for this case. If both the date and the time part is corrupt, then error about corrupt time is returned. Testing: - added a new scanner test that reads a corrupted Parquet file with edge values Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058 Reviewed-on: http://gerrit.cloudera.org:8080/11521 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>