IMPALA-14619: Reset levels_readahead_ for late materialization

Previously, `BaseScalarColumnReader::levels_readahead_` was not reset when the reader did not do page filtering. If a query selected the last row containing a collection value in a row group, `levels_readahead_` would be set and would not be reset when advancing to the next row group without page filtering. As a result, trying to skip collection values at the start of the next row group would cause a check failure. This patch fixes the failure by resetting `levels_readahead_` in `BaseScalarColumnReader::Reset()`, which is always called when advancing to the next row group. `levels_readahead_` is also moved out of the "Members used for page filtering" section as the variable is also used in late materialization. Testing: - Added an E2E test for the fix. Change-Id: Idac138ffe4e1a9260f9080a97a1090b467781d00 Reviewed-on: http://gerrit.cloudera.org:8080/23779 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-19 18:12:08 -05:00 · 2025-12-11 17:18:59 +08:00
parent 2ebdc05c1d
commit d54b75ccf1
4 changed files with 29 additions and 13 deletions
--- a/be/src/exec/parquet/parquet-column-readers.cc
+++ b/be/src/exec/parquet/parquet-column-readers.cc
@@ -1067,6 +1067,7 @@ Status BaseScalarColumnReader::Reset(const HdfsFileDesc& file_desc,
  pos_current_value_ = ParquetLevel::INVALID_POS;
  row_group_first_row_ = row_group_first_row;
  current_row_ = -1;
+  levels_readahead_ = false;

  vector<ScanRange::SubRange> sub_ranges;
  CreateSubRanges(&sub_ranges);
--- a/be/src/exec/parquet/parquet-column-readers.h
+++ b/be/src/exec/parquet/parquet-column-readers.h
@@ -452,6 +452,19 @@ class BaseScalarColumnReader : public ParquetColumnReader {
  /// processed the first (zeroeth) row.
  int64_t current_row_ = -1;

+  /// This flag is needed for the proper tracking of the last processed row.
+  /// The batched and non-batched interfaces behave differently. E.g. when using the
+  /// batched interface you don't need to invoke NextLevels() in advance, while you need
+  /// to do that for the non-batched interface. In fact, the batched interface doesn't
+  /// call NextLevels() at all. It directly reads the levels then the corresponding value
+  /// in a loop. On the other hand, the non-batched interface (ReadValue()) expects that
+  /// the levels for the next value are already read via NextLevels(). And after reading
+  /// the value it calls NextLevels() to read the levels of the next value. Hence, the
+  /// levels are always read ahead in this case.
+  /// Returns true, if we read ahead def and rep levels. In this case 'current_row_'
+  /// points to the row we'll process next, not to the row we already processed.
+  bool levels_readahead_ = false;
+
  /////////////////////////////////////////
  /// BEGIN: Members used for page filtering
  /// They are not set when we don't filter out pages at all.
@@ -475,19 +488,6 @@ class BaseScalarColumnReader : public ParquetColumnReader {
  /// rows and increment this field.
  int current_row_range_ = 0;

-  /// This flag is needed for the proper tracking of the last processed row.
-  /// The batched and non-batched interfaces behave differently. E.g. when using the
-  /// batched interface you don't need to invoke NextLevels() in advance, while you need
-  /// to do that for the non-batched interface. In fact, the batched interface doesn't
-  /// call NextLevels() at all. It directly reads the levels then the corresponding value
-  /// in a loop. On the other hand, the non-batched interface (ReadValue()) expects that
-  /// the levels for the next value are already read via NextLevels(). And after reading
-  /// the value it calls NextLevels() to read the levels of the next value. Hence, the
-  /// levels are always read ahead in this case.
-  /// Returns true, if we read ahead def and rep levels. In this case 'current_row_'
-  /// points to the row we'll process next, not to the row we already processed.
-  bool levels_readahead_ = false;
-
  /// END: Members used for page filtering
  /////////////////////////////////////////