IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file

Impala returns "Couldn't skip rows in file" error for old Parquet
file written by an old Impala (e.g. Impala 2.5, 2.6) In DEBUG build
Impala crashes by a DCHECK:

 Check failed: num_buffered_values_ > 0 (-1 vs. 0)

The problem is that in some old Parquet files there can be a mismatch
between 'num_values' in a page and the encoded def/rep levels.
There is usually one more def/rep levels encoded in these files.

In SkipTopLevelRows() we skipped values based on how many def levels are
92ce6fe48e/be/src/exec/parquet/parquet-column-readers.cc (L1308-L1314)

Since there are more def levels than values in some old files,
num_buferred_values_ could become negative.

This patch also takes the value of num_buferred_values_ into account
when calculating 'read_count', so we can deal with such files. With
this patch we also include the column name in the "Couldn't skip rows"
error message, so in the future it'll be easier to identify the
problematic columns.

Testing:
 * added Parquet file written by Impala 2.5 and e2e test for it

Change-Id: I568fe59df720ea040be4926812412ba4c1510a26
Reviewed-on: http://gerrit.cloudera.org:8080/18257
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
Zoltan Borok-Nagy
2022-02-19 18:21:49 +01:00
committed by Impala Public Jenkins
parent f0d4afaff1
commit b60ccabd5b
6 changed files with 28 additions and 4 deletions

View File

@@ -475,6 +475,8 @@ error_codes = (
("JWKS_PARSE_ERROR", 153, "Error parsing JWKS: $0."),
("JWT_VERIFY_FAILED", 154, "Error verifying JWT Token: $0."),
("PARQUET_ROWS_SKIPPING", 155, "Couldn't skip rows in column '$0' in file '$1'."),
)
import sys