mirror of
https://github.com/apache/impala.git
synced 2025-12-22 03:18:15 -05:00
7273cfdfb901b9ef564c2737cf00c7a8abb57f07
10 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
374783c55e |
IMPALA-10898: Add runtime IN-list filters for ORC tables
ORC files have optional bloom filter indexes for each column. Since ORC-1.7.0, the C++ reader supports pushing down predicates to skip unreleated RowGroups. The pushed down predicates will be evaludated on file indexes (i.e. statistics and bloom filter indexes). Note that only EQUALS and IN-list predicates can leverage bloom filter indexes. Currently Impala has two kinds of runtime filters: bloom filter and min-max filter. Unfortunately they can't be converted into EQUALS or IN-list predicates. So they can't leverage the file level bloom filter indexes. This patch adds runtime IN-list filters for this purpose. Currently they are generated for the build side of a broadcast join. They will only be applied on ORC tables and be pushed down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list, if # of distinct values of the build side exceeds a threshold (default to 1024), we set the filter to ALWAYS_TRUE and clear its entry. The threshold can be configured by a new query option, RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT. Evaluating runtime IN-list filters is much slower than evaluating runtime bloom filters due to the current simple implementation (i.e. std::unorder_set) and the lack of codegen. So we disable it at row level. For visibility, this patch addes two counters in the HdfsScanNode: - NumPushedDownPredicates - NumPushedDownRuntimeFilters They reflect the predicates and runtime filters that are pushed down to the ORC reader. Currently, runtime IN-list filters are disabled by default. This patch extends the query option, ENABLED_RUNTIME_FILTER_TYPES, to support a comma separated list of filter types. It defaults to be "BLOOM,MIN_MAX". Add "IN_LIST" in it to enable runtime IN-list filters. Ran perf tests on a 3 instances cluster on my desktop using TPC-DS with scale factor 20. It shows significant improvements in some queries: +-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval | +-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ | TPCDS(20) | TPCDS-Q67A | orc / snap / block | 35.07 | 44.01 | I -20.32% | 0.38% | 1.38% | 10 | I -25.69% | -3.58 | -45.33 | | TPCDS(20) | TPCDS-Q37 | orc / snap / block | 1.08 | 1.45 | I -25.23% | 7.14% | 3.09% | 10 | I -34.09% | -3.58 | -12.94 | | TPCDS(20) | TPCDS-Q70A | orc / snap / block | 6.30 | 8.60 | I -26.81% | 5.24% | 4.21% | 10 | I -36.67% | -3.58 | -14.88 | | TPCDS(20) | TPCDS-Q16 | orc / snap / block | 1.33 | 1.85 | I -28.28% | 4.98% | 5.92% | 10 | I -39.38% | -3.58 | -12.93 | | TPCDS(20) | TPCDS-Q18A | orc / snap / block | 5.70 | 8.06 | I -29.25% | 3.00% | 4.12% | 10 | I -40.30% | -3.58 | -19.95 | | TPCDS(20) | TPCDS-Q22A | orc / snap / block | 2.01 | 2.97 | I -32.21% | 6.12% | 5.94% | 10 | I -47.68% | -3.58 | -14.05 | | TPCDS(20) | TPCDS-Q77A | orc / snap / block | 8.49 | 12.44 | I -31.75% | 6.44% | 3.96% | 10 | I -49.71% | -3.58 | -16.97 | | TPCDS(20) | TPCDS-Q75 | orc / snap / block | 7.76 | 12.27 | I -36.76% | 5.01% | 3.87% | 10 | I -59.56% | -3.58 | -23.26 | | TPCDS(20) | TPCDS-Q21 | orc / snap / block | 0.71 | 1.27 | I -44.26% | 4.56% | 4.24% | 10 | I -77.31% | -3.58 | -28.31 | | TPCDS(20) | TPCDS-Q80A | orc / snap / block | 9.24 | 20.42 | I -54.77% | 4.03% | 3.82% | 10 | I -123.12% | -3.58 | -40.90 | | TPCDS(20) | TPCDS-Q39-1 | orc / snap / block | 1.07 | 2.26 | I -52.74% | * 23.83% * | 2.60% | 10 | I -149.68% | -3.58 | -14.43 | | TPCDS(20) | TPCDS-Q39-2 | orc / snap / block | 1.00 | 2.33 | I -56.95% | * 19.53% * | 2.07% | 10 | I -151.89% | -3.58 | -20.81 | +-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+ "Base Avg" is the avg of the original time. "Avg" is the current time. However, we also see some regressions due to the suboptimal implementation. The follow-up JIRAs will focus on improvements: - IMPALA-11140: Codegen InListFilter::Insert() and InListFilter::Find() - IMPALA-11141: Use exact data types in IN-list filters instead of casting data to a set of int64_t or a set of string. - IMPALA-11142: Consider IN-list filters in partitioned joins. Tests: - Test IN-list filter on string, date and all integer types - Test IN-list filter with NULL - Test IN-list filter on complex exprs targets Change-Id: I25080628233799aa0b6be18d5a832f1385414501 Reviewed-on: http://gerrit.cloudera.org:8080/18141 Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
f38da0df8e |
IMPALA-4400: aggregate runtime filters locally
Move RuntimeFilterBank to QueryState(). Implement fine-grained locking for each filter to mitigate any increased lock contention from the change. Make RuntimeFilterBank handle multiple producers of the same filter, e.g. multiple instances of a partitioned join. It computes the expected number of filters upfront then sends the filter to the coordinator once all the local instances have been merged together. The merging can be done in parallel locally to improve latency of filter propagation. Add Or() methods to MinMaxFilter and BloomFilter, since we now need to merge those, not just the thrift versions. Update coordinator filter routing to expect only one instance of a filter from each producer backend and to only send one instance to each consumer backend (instead of sending one per fragment). Update memory reservations and estimates to be lower to account for sharing of filters between fragment instances. mt_dop plans are modified to show these shared and non-shared resources separately. Enable waiting for runtime filters for kudu scanner with mt_dop. Made min/max filters const-correct. Testing * Added unit tests for Or() methods. * Added some additional e2e test coverage for mt_dop queries * Updated planner tests with new estimates and reservation. * Ran a single node 3-impalad stress test with TPC-H kudu and TPC-DS parquet. * Ran exhaustive tests. * Ran core tests with ASAN. Perf * Did a single-node perf run on TPC-H with default settings. No perf change. * Single-node perf run with mt_dop=8 showed significant speedups: +----------+-----------------------+---------+------------+------------+----------------+ | Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) | +----------+-----------------------+---------+------------+------------+----------------+ | TPCH(30) | parquet / none / none | 10.14 | -7.29% | 5.05 | -11.68% | +----------+-----------------------+---------+------------+------------+----------------+ +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+ | Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval | +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+ | TPCH(30) | TPCH-Q7 | parquet / none / none | 38.87 | 38.44 | +1.13% | 7.17% | * 10.92% * | 20 | +0.72% | 0.72 | 0.39 | | TPCH(30) | TPCH-Q1 | parquet / none / none | 4.28 | 4.26 | +0.50% | 1.92% | 1.09% | 20 | +0.03% | 0.31 | 1.01 | | TPCH(30) | TPCH-Q22 | parquet / none / none | 2.32 | 2.32 | +0.05% | 2.01% | 1.89% | 20 | -0.03% | -0.36 | 0.08 | | TPCH(30) | TPCH-Q15 | parquet / none / none | 3.73 | 3.75 | -0.42% | 0.84% | 1.05% | 20 | -0.25% | -0.77 | -1.40 | | TPCH(30) | TPCH-Q13 | parquet / none / none | 9.80 | 9.83 | -0.38% | 0.51% | 0.80% | 20 | -0.32% | -1.30 | -1.81 | | TPCH(30) | TPCH-Q2 | parquet / none / none | 1.98 | 2.00 | -1.32% | 1.74% | 2.81% | 20 | -0.64% | -1.71 | -1.79 | | TPCH(30) | TPCH-Q6 | parquet / none / none | 1.22 | 1.25 | -2.14% | 2.66% | 4.15% | 20 | -0.96% | -2.00 | -1.95 | | TPCH(30) | TPCH-Q19 | parquet / none / none | 5.13 | 5.22 | -1.65% | 1.20% | 1.40% | 20 | -1.76% | -3.34 | -4.02 | | TPCH(30) | TPCH-Q16 | parquet / none / none | 2.46 | 2.56 | -4.13% | 2.49% | 1.99% | 20 | -4.31% | -4.04 | -5.94 | | TPCH(30) | TPCH-Q9 | parquet / none / none | 81.63 | 85.07 | -4.05% | 4.94% | 3.06% | 20 | -5.46% | -3.28 | -3.21 | | TPCH(30) | TPCH-Q10 | parquet / none / none | 5.07 | 5.50 | I -7.92% | 0.96% | 1.33% | 20 | I -8.51% | -5.27 | -22.14 | | TPCH(30) | TPCH-Q21 | parquet / none / none | 24.00 | 26.24 | I -8.57% | 0.46% | 0.38% | 20 | I -9.34% | -5.27 | -67.47 | | TPCH(30) | TPCH-Q18 | parquet / none / none | 8.66 | 9.50 | I -8.86% | 0.62% | 0.44% | 20 | I -9.75% | -5.27 | -55.17 | | TPCH(30) | TPCH-Q3 | parquet / none / none | 6.01 | 6.70 | I -10.19% | 1.01% | 0.90% | 20 | I -11.25% | -5.27 | -35.76 | | TPCH(30) | TPCH-Q12 | parquet / none / none | 2.98 | 3.39 | I -12.23% | 1.48% | 1.48% | 20 | I -13.56% | -5.27 | -27.75 | | TPCH(30) | TPCH-Q11 | parquet / none / none | 1.69 | 2.00 | I -15.55% | 1.63% | 1.47% | 20 | I -18.09% | -5.27 | -34.60 | | TPCH(30) | TPCH-Q4 | parquet / none / none | 2.42 | 2.87 | I -15.69% | 1.48% | 1.26% | 20 | I -18.61% | -5.27 | -39.50 | | TPCH(30) | TPCH-Q14 | parquet / none / none | 4.64 | 6.27 | I -26.02% | 1.35% | 0.73% | 20 | I -35.37% | -5.27 | -94.07 | | TPCH(30) | TPCH-Q20 | parquet / none / none | 3.19 | 4.37 | I -27.01% | 1.54% | 0.99% | 20 | I -36.85% | -5.27 | -80.74 | | TPCH(30) | TPCH-Q5 | parquet / none / none | 4.57 | 6.39 | I -28.36% | 1.04% | 0.75% | 20 | I -39.56% | -5.27 | -120.02 | | TPCH(30) | TPCH-Q17 | parquet / none / none | 3.15 | 4.71 | I -33.06% | 1.59% | 1.31% | 20 | I -49.43% | -5.27 | -87.64 | | TPCH(30) | TPCH-Q8 | parquet / none / none | 5.25 | 7.95 | I -33.95% | 0.95% | 0.53% | 20 | I -51.11% | -5.27 | -185.02 | +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+ Change-Id: Iabeeab5eec869ff2197250ad41c1eb5551704acc Reviewed-on: http://gerrit.cloudera.org:8080/14538 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
702e6c4fa8 |
IMPALA-7984: Port runtime filter from Thrift RPC to KRPC
Previously the aggregation and propagation of a runtime filter in Impala is implemented using Thrift RPC, which suffers from a disadvantage that the number of connections in a cluster grows with both the number of queries and cluster size. This patch ports the functions that implement the aggregation and propagation of a runtime filter, i.e., UpdateFilter() and PublishFilter(), respctively, to KRPC, which requires only one connection per direction between every pair of hosts, thus reducing the number of connections in a cluster. In addition, this patch also incorporates KRPC sidecar when the runtime filter is a Bloom filter. KRPC sidecar eliminates the need for an extra copy of the Bloom filter contents when a Bloom filter is serialized to be transmitted and hence reduces the serialization overhead. Due to the incorporation of KRPC sidecar, a SpinLock is also added to prevent a BloomFilter from being deallocated before its associated KRPC call finishes. Two related BE tests bloom-filter-test.cc and bloom-filter-benchmark.cc are also modified accordingly because of the changes to the signatures of some functions in BloomFilter. Testing: This patch has passed the exhaustive tests. Change-Id: I11a2f92a91750c2470fba082c30f97529524b9c8 Reviewed-on: http://gerrit.cloudera.org:8080/13882 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-on: http://gerrit.cloudera.org:8080/14974 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com> |
||
|
|
e716e76ccc |
IMPALA-9154: Revert "IMPALA-7984: Port runtime filter from Thrift RPC to KRPC"
The previous patch porting runtime filter from Thrift RPC to KRPC
introduces a deadlock if there are a very limited number of threads on
the Impala cluster.
Specifically, in that patch a Coordinator used a synchronous KRPC to
propagate an aggregated filter to other hosts. A deadlock would happen
if there is no thread available on the receiving side to answer that
KRPC especially the calling and receiving threads are called from the
same thread pool. One possible way to address this issue is to make
the call of propagating a runtime filter asynchronous to free the
calling thread. Before resolving this issue, we revert this patch for
now.
This reverts commit
|
||
|
|
ec11c18884 |
IMPALA-7984: Port runtime filter from Thrift RPC to KRPC
Previously the aggregation and propagation of a runtime filter in Impala is implemented using Thrift RPC, which suffers from a disadvantage that the number of connections in a cluster grows with both the number of queries and cluster size. This patch ports the functions that implement the aggregation and propagation of a runtime filter, i.e., UpdateFilter() and PublishFilter(), respctively, to KRPC, which requires only one connection per direction between every pair of hosts, thus reducing the number of connections in a cluster. In addition, this patch also incorporates KRPC sidecar when the runtime filter is a Bloom filter. KRPC sidecar eliminates the need for an extra copy of the Bloom filter contents when a Bloom filter is serialized to be transmitted and hence reduces the serialization overhead. Due to the incorporation of KRPC sidecar, a SpinLock is also added to prevent a BloomFilter from being deallocated before its associated KRPC call finishes. Two related BE tests bloom-filter-test.cc and bloom-filter-benchmark.cc are also modified accordingly because of the changes to the signatures of some functions in BloomFilter. Testing: This patch has passed the exhaustive tests. Change-Id: I6b394796d250286510e157ae326882bfc01d387a Reviewed-on: http://gerrit.cloudera.org:8080/13882 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
5391100c7e |
IMPALA-7213, IMPALA-7241: Port ReportExecStatus() RPC to use KRPC
This change converts ReportExecStatus() RPC from thrift based RPC to KRPC. This is done in part of the preparation for fixing IMPALA-2990 as we can take advantage of TCP connection multiplexing in KRPC to avoid overwhelming the coordinator with too many connections by reducing the number of TCP connection to one for each executor. This patch also introduces a new service pool for all query execution control related RPCs in the future so that control commands from coordinators aren't blocked by long-running DataStream services' RPCs. To avoid unnecessary delays due to sharing the network connections between DataStream service and Control service, this change added the service name as part of the user credentials for the ConnectionId so each service will use a separate connection. The majority of this patch is mechanical conversion of some Thrift structures used in ReportExecStatus() RPC to Protobuf. Note that the runtime profile is still retained as a Thrift structure as Impala clients will still fetch query profiles using Thrift RPCs. This also avoids duplicating the serialization implementation in both Thrift and Protobuf for the runtime profile. The Thrift runtime profiles are serialized and sent as a sidecar in ReportExecStatus() RPC. This patch also fixes IMPALA-7241 which may lead to duplicated dml stats being applied. The fix is by adding a monotonically increasing version number for fragment instances' reports. The coordinator will ignore any report smaller than or equal to the version in the last report. Testing done: 1. Exhaustive build. 2. Added some targeted test cases for profile serialization failure and RPC retries/timeout. Change-Id: I7638583b433dcac066b87198e448743d90415ebe Reviewed-on: http://gerrit.cloudera.org:8080/10855 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
5c541b9604 |
Add missing authorization in KRPC
In 2.12.0, Impala adopted Kudu RPC library for certain backened services (TransmitData(), EndDataStream()). While the implementation uses Kerberos for authenticating users connecting to the backend services, there is no authorization implemented. This is a regression from the Thrift based implementation because it registered a SASL callback (SaslAuthorizeInternal) to be invoked during the connection negotiation. With this regression, an unauthorized but authenticated user may invoke RPC calls to Impala backend services. This change fixes the issue above by overriding the default authorization method for the DataStreamService. The authorization method will only let authenticated principal which matches FLAGS_principal / FLAGS_be_principal to access the service. Also added a new startup flag --krb5_ccname to allow users to customize the locations of the Kerberos credentials cache. Testing done: 1. Added a new test case in rpc-mgr-kerberized-test.cc to confirm an unauthorized user is not allowed to access the service. 2. Ran some queries in a Kerberos enabled cluster to make sure there is no error. 3. Exhaustive builds. Thanks to Todd Lipcon for pointing out the problem and his guidance on the fix. Change-Id: I2f82dee5e721f2ed23e75fd91abbc6ab7addd4c5 Reviewed-on: http://gerrit.cloudera.org:8080/11331 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
421af4e40a |
IMPALA-6685: Improve profiles in KrpcDataStreamRecvr and KrpcDataStreamSender
This change implements a couple of improvements to the profiles of
KrpcDataStreamRecvr and KrpcDataStreamSender:
- track pending number of deferred row batches over time in KrpcDataStreamRecvr
- track the number of bytes dequeued over time in KrpcDataStreamRecvr
- track the total time deferred RPCs queues are not empty
- track the number of bytes sent from KrpcDataStreamSender over time
- track the total amount of time spent in KrpcDataStreamSender, including time
spent waiting for RPC completion.
Sample profile of an Exchange node instance:
EXCHANGE_NODE (id=21):(Total: 2s284ms, non-child: 64.926ms, % non-child: 2.84%)
- ConvertRowBatchTime: 44.380ms
- PeakMemoryUsage: 124.04 KB (127021)
- RowsReturned: 287.51K (287514)
- RowsReturnedRate: 125.88 K/sec
Buffer pool:
- AllocTime: 1.109ms
- CumulativeAllocationBytes: 10.96 MB (11493376)
- CumulativeAllocations: 562 (562)
- PeakReservation: 112.00 KB (114688)
- PeakUnpinnedBytes: 0
- PeakUsedReservation: 112.00 KB (114688)
- ReadIoBytes: 0
- ReadIoOps: 0 (0)
- ReadIoWaitTime: 0.000ns
- WriteIoBytes: 0
- WriteIoOps: 0 (0)
- WriteIoWaitTime: 0.000ns
Dequeue:
BytesDequeued(500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 700.00 KB, 2.00 MB, 3.49 MB, 4.39 MB, 5.86 MB, 6.85 MB
- FirstBatchWaitTime: 0.000ns
- TotalBytesDequeued: 6.85 MB (7187850)
- TotalGetBatchTime: 2s237ms
- DataWaitTime: 2s219ms
Enqueue:
BytesReceived(500.000ms): 0, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 328.73 KB, 963.79 KB, 1.64 MB, 2.09 MB, 2.76 MB, 3.23 MB
DeferredQueueSize(500.000ms): 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0
- DispatchTime: (Avg: 108.593us ; Min: 30.525us ; Max: 1.524ms ; Number of samples: 281)
- DeserializeRowBatchTime: 8.395ms
- TotalBatchesEnqueued: 281 (281)
- TotalBatchesReceived: 281 (281)
- TotalBytesReceived: 3.23 MB (3387144)
- TotalEarlySenders: 0 (0)
- TotalEosReceived: 1 (1)
- TotalHasDeferredRPCsTime: 15s446ms
- TotalRPCsDeferred: 38 (38)
Sample sender's profile:
KrpcDataStreamSender (dst_id=21):(Total: 17s923ms, non-child: 604.494ms, % non-child: 3.37%)
BytesSent(500.000ms): 0, 0, 0, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 46.54 KB, 46.54 KB, 46.54 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 974.44 KB, 2.82 MB, 4.93 MB, 6.27 MB, 8.28 MB, 9.69 MB
- EosSent: 3 (3)
- NetworkThroughput: 4.61 MB/sec
- PeakMemoryUsage: 22.57 KB (23112)
- RowsSent: 287.51K (287514)
- RpcFailure: 0 (0)
- RpcRetry: 0 (0)
- SerializeBatchTime: 329.162ms
- TotalBytesSent: 9.69 MB (10161432)
- UncompressedRowBatchSize: 20.56 MB (21563550)
Change-Id: I8ba405921b3df920c1e85b940ce9c8d02fc647cd
Reviewed-on: http://gerrit.cloudera.org:8080/9690
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
|
||
|
|
3bfda33487 |
IMPALA-6193: Track memory of incoming data streams
This change adds memory tracking to incoming transmit data RPCs when using KRPC. We track memory against a global tracker called "Data Stream Service" until it is handed over to the stream manager. There we track it in a global tracker called "Data Stream Queued RPC Calls" until a receiver registers and takes over the early sender RPCs. Inside the receiver, memory for deferred RPCs is tracked against the fragment instance's memtracker until we unpack the batches and add them to the row batch queue. The DCHECK in MemTracker::Close() covers that all memory consumed by a tracker gets release eventually. In addition to that, this change adds a custom cluster test that makes sure that queued memory gets tracked by inspecting the peak consumption of the new memtrackers. Change-Id: I2df1204d2483313a8a18e5e3be6cec9e402614c4 Reviewed-on: http://gerrit.cloudera.org:8080/8914 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins |
||
|
|
b4ea57a7e3 |
IMPALA-4856: Port data stream service to KRPC
This patch implements a new data stream service which utilizes KRPC.
Similar to the thrift RPC implementation, there are 3 major components
to the data stream services: KrpcDataStreamSender serializes and sends
row batches materialized by a fragment instance to a KrpcDataStreamRecvr.
KrpcDataStreamMgr is responsible for routing an incoming row batch to
the appropriate receiver. The data stream service runs on the port
FLAGS_krpc_port which is 29000 by default.
Unlike the implementation with thrift RPC, KRPC provides an asynchronous
interface for invoking remote methods. As a result, KrpcDataStreamSender
doesn't need to create a thread per connection. There is one connection
between two Impalad nodes for each direction (i.e. client and server).
Multiple queries can multi-plex on the same connection for transmitting
row batches between two Impalad nodes. The asynchronous interface also
prevents avoids the possibility that a thread is stuck in the RPC code
for extended amount of time without checking for cancellation. A TransmitData()
call with KRPC is in essence a trio of RpcController, a serialized protobuf
request buffer and a protobuf response buffer. The call is invoked via a
DataStreamService proxy object. The serialized tuple offsets and row batches
are sent via "sidecars" in KRPC to avoid extra copy into the serialized
request buffer.
Each impalad node creates a singleton DataStreamService object at start-up
time. All incoming calls are served by a service thread pool created as part
of DataStreamService. By default, the number of service threads equals the
number of logical cores. The service threads are shared across all queries so
the RPC handler should avoid blocking as much as possible. In thrift RPC
implementation, we make a thrift thread handling a TransmitData() RPC to block
for extended period of time when the receiver is not yet created when the call
arrives. In KRPC implementation, we store TransmitData() or EndDataStream()
requests which arrive before the receiver is ready in a per-receiver early
sender list stored in KrpcDataStreamMgr. These RPC calls will be processed
and responded to when the receiver is created or when timeout occurs.
Similarly, there is limited space in the sender queues in KrpcDataStreamRecvr.
If adding a row batch to a queue in KrpcDataStreamRecvr causes the buffer limit
to exceed, the request will be stashed in a queue for deferred processing.
The stashed RPC requests will not be responded to until they are processed
so as to exert back pressure to the senders. An alternative would be to reply with
an error and the request / row batches need to be sent again. This may end up
consuming more network bandwidth than the thrift RPC implementation. This change
adopts the behavior of allowing one stashed request per sender.
All rpc requests and responses are serialized using protobuf. The equivalent of
TRowBatch would be ProtoRowBatch which contains a serialized header about the
meta-data of the row batch and two Kudu Slice objects which contain pointers to
the actual data (i.e. tuple offsets and tuple data).
This patch is based on an abandoned patch by Henry Robinson.
TESTING
-------
* Builds {exhaustive/debug, core/release, asan} passed with FLAGS_use_krpc=true.
TO DO
-----
* Port some BE tests to KRPC services.
Change-Id: Ic0b8c1e50678da66ab1547d16530f88b323ed8c1
Reviewed-on: http://gerrit.cloudera.org:8080/8023
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
|