Impala detects the HDFS version by reading the Namenode web UI and run
the corresponding check.
On 4.1, Impala tries to check the datanode (server side) config by reading
the datanode web UI.
the average number of active scanner thread (i.e. those that are not blocked by
IO).
In the hdfs-scan-node, whenever a thread is started, it will increment the
active_scanner_thread_counter_. When a scanner thread enter the
scan-range-context's GetRawBytes or GetBytes, the counter will be decremented.
A new sampling thread is created to sample the value of
active_scanner_thread_counter_ and compute the average.
A bucket couting of HdfsReadThreadConcurrent is also added.
The output of the hdfs-scan-node profile is also updated. Here's the new output
for hdfs-scan-node after running count(*) from tpch.lineitem.
HDFS_SCAN_NODE (id=0):(10s254ms 99.75%)
File Formats: TEXT/NONE:12
Hdfs split stats (<volume id>:<# splits>/<split lengths>): 0:6/351.21M
(351208888) 1:6/402.65M (402653184)
- AverageHdfsReadThreadConcurrency: 1.95
- HdfsReadThreadConcurrencyCountPercentage=0: 0.00
- HdfsReadThreadConcurrencyCountPercentage=1: 5.00
- HdfsReadThreadConcurrencyCountPercentage=2: 95.00
- HdfsReadThreadConcurrencyCountPercentage=3: 0.00
- AverageScannerThreadConcurrency: 0.15
- BytesRead: 718.94 MB
- MemoryUsed: 0.00
- NumDisksAccessed: 2
- PerReadThreadRawHdfsThroughput: 36.75 MB/sec
- RowsReturned: 6.00M (6001215)
- RowsReturnedRate: 585.25 K/sec
- ScanRangesComplete: 12
- ScannerThreadsInvoluntaryContextSwitches: 168
- ScannerThreadsTotalWallClockTime: 1m40s
- DelimiterParseTime: 2s128ms
- MaterializeTupleTime: 723.0us
- ScannerThreadsSysTime: 10.0ms
- ScannerThreadsUserTime: 2s090ms
- ScannerThreadsVoluntaryContextSwitches: 99
- TotalRawHdfsReadTime: 19s561ms
- TotalReadThroughput: 68.69 MB/sec
- added PlanNode.numNodes, PlanNode.avgRowSize and PlanNode.computeStats()
- fixing up some cardinality estimates
- Planner now tries to do a cost-based decision between broadcast join and join with full repartitioning (both inputs)
- ExchangeNode now distinguishes between its input and output row descriptor: the output potentially contains more tuples
- fixed problem related to cancellation and concurrent hash table builds.
Not included:
- partitioned joins that take advantage of existing partitions of the inputs; those will have to wait for a follow-on change
* API simplified to deal only with 'topics', not services and objects
* Scalability improved: heartbeat loop is now multi-threaded
* State-store can store arbitrary objects
* State-store may send either deltas or complete topic state (delta computation to come)