mirror of
https://github.com/apache/impala.git
synced 2026-02-03 18:00:39 -05:00
Impala generated wrong values for the FILE__POSITION column when the Parquet file contained multiple row groups and page filtering was used as well. We are using the value of 'current_row_' in the Parquet column readers to populate the file position slot. The problem is that 'current_row_' denotes the index of the row within the row group and not within the file. We cannot change 'current_row_' as page filtering depends on its value, as the page index also uses the row group-based indexes of the rows, not the file indexes. In the meantime it turned out FILE__POSITION was also not set correctly in the Parquet late materialization code, as BaseScalarColumnReader::SkipRowsInternal() didn't update 'current_row_' in some code paths. The value of FILE__POSITION is critical for Iceberg V2 tables as position delete files store file positions of the deleted rows. Testing: * added e2e tests * the tests are now running w/o PARQUET_READ_STATISTICS to exercise more code paths Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605 Reviewed-on: http://gerrit.cloudera.org:8080/19328 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
788 KiB
788 KiB