Files
impala/testdata/data/customer_nested_multiblock_multipage.parquet
Zoltan Borok-Nagy c56cd7b214 IMPALA-11780: Wrong FILE__POSITION values for multi row group Parquet files when page filtering is used
Impala generated wrong values for the FILE__POSITION column when the
Parquet file contained multiple row groups and page filtering was
used as well.

We are using the value of 'current_row_' in the Parquet column readers
to populate the file position slot. The problem is that 'current_row_'
denotes the index of the row within the row group and not within the
file. We cannot change 'current_row_' as page filtering depends on its
value, as the page index also uses the row group-based indexes of the
rows, not the file indexes.

In the meantime it turned out FILE__POSITION was also not set correctly
in the Parquet late materialization code, as
BaseScalarColumnReader::SkipRowsInternal() didn't update 'current_row_'
in some code paths.

The value of FILE__POSITION is critical for Iceberg V2 tables as
position delete files store file positions of the deleted rows.

Testing:
 * added e2e tests
 * the tests are now running w/o PARQUET_READ_STATISTICS to exercise
   more code paths

Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605
Reviewed-on: http://gerrit.cloudera.org:8080/19328
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-08 23:07:08 +00:00

788 KiB