mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
Adds the following new stats: * ParquetCompressedPageSize - a summary (average, min, max) counter that tracks the size of compressed pages read, if no compressed pages are read then this counter is empty * ParquetUncompressedPageSize - a summary counter that tracks the size of uncompressed pages read, it is updated in two places: (1) when a compressed page is de-compressed, and (2) when a page that is not compressed is read * ParquetCompressedDataReadPerColumn - a summary counter that tracks the amount of compressed data read per column for a scan node * ParquetUncompressedDataReadPerColumn - a summary counter that tracks the amount of uncompressed data read per column for a scan node The PerColumn counters are calculated by aggregating the number of bytes read for each column across all scan ranges processed by a scan node. Each sample in the counter is the size of a single column. Here is an example of what the updated HDFS scan profile looks like: - ParquetCompressedDataReadPerColumn: (Avg: 227.56 KB (233018) ; Min: 225.14 KB (230540) ; Max: 229.98 KB (235496) ; Number of samples: 2) - ParquetUncompressedDataReadPerColumn: (Avg: 227.96 KB (233426) ; Min: 224.91 KB (230306) ; Max: 231.00 KB (236547) ; Number of samples: 2) - ParquetCompressedPageSize: (Avg: 4.46 KB (4568) ; Min: 3.86 KB (3955) ; Max: 5.19 KB (5315) ; Number of samples: 102) - ParquetDecompressedPageSize: (Avg: 4.47 KB (4576) ; Min: 3.86 KB (3950) ; Max: 5.22 KB (5349) ; Number of samples: 102) Testing: * Added new tests to test_scanners.py that do some basic validation of the new counters above Change-Id: I322f9b324b6828df28e5caf79529085c43d7c817 Reviewed-on: http://gerrit.cloudera.org:8080/11575 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>