Impala detects the HDFS version by reading the Namenode web UI and run
the corresponding check.
On 4.1, Impala tries to check the datanode (server side) config by reading
the datanode web UI.
the average number of active scanner thread (i.e. those that are not blocked by
IO).
In the hdfs-scan-node, whenever a thread is started, it will increment the
active_scanner_thread_counter_. When a scanner thread enter the
scan-range-context's GetRawBytes or GetBytes, the counter will be decremented.
A new sampling thread is created to sample the value of
active_scanner_thread_counter_ and compute the average.
A bucket couting of HdfsReadThreadConcurrent is also added.
The output of the hdfs-scan-node profile is also updated. Here's the new output
for hdfs-scan-node after running count(*) from tpch.lineitem.
HDFS_SCAN_NODE (id=0):(10s254ms 99.75%)
File Formats: TEXT/NONE:12
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:6/351.21M
(351208888) 1:6/402.65M (402653184)
- AverageHdfsReadThreadConcurrency: 1.95
- HdfsReadThreadConcurrencyCountPercentage=0: 0.00
- HdfsReadThreadConcurrencyCountPercentage=1: 5.00
- HdfsReadThreadConcurrencyCountPercentage=2: 95.00
- HdfsReadThreadConcurrencyCountPercentage=3: 0.00
- AverageScannerThreadConcurrency: 0.15
- BytesRead: 718.94 MB
- MemoryUsed: 0.00
- NumDisksAccessed: 2
- PerReadThreadRawHdfsThroughput: 36.75 MB/sec
- RowsReturned: 6.00M (6001215)
- RowsReturnedRate: 585.25 K/sec
- ScanRangesComplete: 12
- ScannerThreadsInvoluntaryContextSwitches: 168
- ScannerThreadsTotalWallClockTime: 1m40s
- DelimiterParseTime: 2s128ms
- MaterializeTupleTime: 723.0us
- ScannerThreadsSysTime: 10.0ms
- ScannerThreadsUserTime: 2s090ms
- ScannerThreadsVoluntaryContextSwitches: 99
- TotalRawHdfsReadTime: 19s561ms
- TotalReadThroughput: 68.69 MB/sec