Commit Graph

10 Commits

Author SHA1 Message Date
stiga-huang
374783c55e IMPALA-10898: Add runtime IN-list filters for ORC tables
ORC files have optional bloom filter indexes for each column. Since
ORC-1.7.0, the C++ reader supports pushing down predicates to skip
unreleated RowGroups. The pushed down predicates will be evaludated on
file indexes (i.e. statistics and bloom filter indexes). Note that only
EQUALS and IN-list predicates can leverage bloom filter indexes.

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't be converted into EQUALS or
IN-list predicates. So they can't leverage the file level bloom filter
indexes.

This patch adds runtime IN-list filters for this purpose. Currently they
are generated for the build side of a broadcast join. They will only be
applied on ORC tables and be pushed down to the ORC reader(i.e. ORC
lib). To avoid exploding the IN-list, if # of distinct values of the
build side exceeds a threshold (default to 1024), we set the filter to
ALWAYS_TRUE and clear its entry. The threshold can be configured by a
new query option, RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT.

Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set) and the lack of codegen. So we disable it at row
level.

For visibility, this patch addes two counters in the HdfsScanNode:
 - NumPushedDownPredicates
 - NumPushedDownRuntimeFilters
They reflect the predicates and runtime filters that are pushed down to
the ORC reader.

Currently, runtime IN-list filters are disabled by default. This patch
extends the query option, ENABLED_RUNTIME_FILTER_TYPES, to support a
comma separated list of filter types. It defaults to be "BLOOM,MIN_MAX".
Add "IN_LIST" in it to enable runtime IN-list filters.

Ran perf tests on a 3 instances cluster on my desktop using TPC-DS with
scale factor 20. It shows significant improvements in some queries:

+-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+
| Workload  | Query       | File Format        | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+
| TPCDS(20) | TPCDS-Q67A  | orc / snap / block | 35.07  | 44.01       | I -20.32%  |   0.38%    |   1.38%        | 10    | I -25.69%      | -3.58   | -45.33 |
| TPCDS(20) | TPCDS-Q37   | orc / snap / block | 1.08   | 1.45        | I -25.23%  |   7.14%    |   3.09%        | 10    | I -34.09%      | -3.58   | -12.94 |
| TPCDS(20) | TPCDS-Q70A  | orc / snap / block | 6.30   | 8.60        | I -26.81%  |   5.24%    |   4.21%        | 10    | I -36.67%      | -3.58   | -14.88 |
| TPCDS(20) | TPCDS-Q16   | orc / snap / block | 1.33   | 1.85        | I -28.28%  |   4.98%    |   5.92%        | 10    | I -39.38%      | -3.58   | -12.93 |
| TPCDS(20) | TPCDS-Q18A  | orc / snap / block | 5.70   | 8.06        | I -29.25%  |   3.00%    |   4.12%        | 10    | I -40.30%      | -3.58   | -19.95 |
| TPCDS(20) | TPCDS-Q22A  | orc / snap / block | 2.01   | 2.97        | I -32.21%  |   6.12%    |   5.94%        | 10    | I -47.68%      | -3.58   | -14.05 |
| TPCDS(20) | TPCDS-Q77A  | orc / snap / block | 8.49   | 12.44       | I -31.75%  |   6.44%    |   3.96%        | 10    | I -49.71%      | -3.58   | -16.97 |
| TPCDS(20) | TPCDS-Q75   | orc / snap / block | 7.76   | 12.27       | I -36.76%  |   5.01%    |   3.87%        | 10    | I -59.56%      | -3.58   | -23.26 |
| TPCDS(20) | TPCDS-Q21   | orc / snap / block | 0.71   | 1.27        | I -44.26%  |   4.56%    |   4.24%        | 10    | I -77.31%      | -3.58   | -28.31 |
| TPCDS(20) | TPCDS-Q80A  | orc / snap / block | 9.24   | 20.42       | I -54.77%  |   4.03%    |   3.82%        | 10    | I -123.12%     | -3.58   | -40.90 |
| TPCDS(20) | TPCDS-Q39-1 | orc / snap / block | 1.07   | 2.26        | I -52.74%  | * 23.83% * |   2.60%        | 10    | I -149.68%     | -3.58   | -14.43 |
| TPCDS(20) | TPCDS-Q39-2 | orc / snap / block | 1.00   | 2.33        | I -56.95%  | * 19.53% * |   2.07%        | 10    | I -151.89%     | -3.58   | -20.81 |
+-----------+-------------+--------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+--------+
"Base Avg" is the avg of the original time. "Avg" is the current time.

However, we also see some regressions due to the suboptimal
implementation. The follow-up JIRAs will focus on improvements:
 - IMPALA-11140: Codegen InListFilter::Insert() and InListFilter::Find()
 - IMPALA-11141: Use exact data types in IN-list filters instead of
   casting data to a set of int64_t or a set of string.
 - IMPALA-11142: Consider IN-list filters in partitioned joins.

Tests:
 - Test IN-list filter on string, date and all integer types
 - Test IN-list filter with NULL
 - Test IN-list filter on complex exprs targets

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Reviewed-on: http://gerrit.cloudera.org:8080/18141
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-03-03 00:21:06 +00:00
Tim Armstrong
f38da0df8e IMPALA-4400: aggregate runtime filters locally
Move RuntimeFilterBank to QueryState(). Implement fine-grained
locking for each filter to mitigate any increased lock
contention from the change.

Make RuntimeFilterBank handle multiple producers of the
same filter, e.g. multiple instances of a partitioned
join. It computes the expected number of filters upfront
then sends the filter to the coordinator once all the
local instances have been merged together. The merging
can be done in parallel locally to improve latency of
filter propagation.

Add Or() methods to MinMaxFilter and BloomFilter, since
we now need to merge those, not just the thrift versions.

Update coordinator filter routing to expect only one
instance of a filter from each producer backend and
to only send one instance to each consumer backend
(instead of sending one per fragment).

Update memory reservations and estimates to be lower
to account for sharing of filters between fragment
instances. mt_dop plans are modified to show these
shared and non-shared resources separately.

Enable waiting for runtime filters for kudu scanner
with mt_dop.

Made min/max filters const-correct.

Testing
* Added unit tests for Or() methods.
* Added some additional e2e test coverage for mt_dop queries
* Updated planner tests with new estimates and reservation.
* Ran a single node 3-impalad stress test with TPC-H kudu and
  TPC-DS parquet.
* Ran exhaustive tests.
* Ran core tests with ASAN.

Perf
* Did a single-node perf run on TPC-H with default settings. No perf change.
* Single-node perf run with mt_dop=8 showed significant speedups:

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 10.14   | -7.29%     | 5.05       | -11.68%        |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval    |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| TPCH(30) | TPCH-Q7  | parquet / none / none | 38.87  | 38.44       |   +1.13%   |   7.17%   | * 10.92% *     | 20    |   +0.72%       | 0.72    | 0.39    |
| TPCH(30) | TPCH-Q1  | parquet / none / none | 4.28   | 4.26        |   +0.50%   |   1.92%   |   1.09%        | 20    |   +0.03%       | 0.31    | 1.01    |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 2.32   | 2.32        |   +0.05%   |   2.01%   |   1.89%        | 20    |   -0.03%       | -0.36   | 0.08    |
| TPCH(30) | TPCH-Q15 | parquet / none / none | 3.73   | 3.75        |   -0.42%   |   0.84%   |   1.05%        | 20    |   -0.25%       | -0.77   | -1.40   |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 9.80   | 9.83        |   -0.38%   |   0.51%   |   0.80%        | 20    |   -0.32%       | -1.30   | -1.81   |
| TPCH(30) | TPCH-Q2  | parquet / none / none | 1.98   | 2.00        |   -1.32%   |   1.74%   |   2.81%        | 20    |   -0.64%       | -1.71   | -1.79   |
| TPCH(30) | TPCH-Q6  | parquet / none / none | 1.22   | 1.25        |   -2.14%   |   2.66%   |   4.15%        | 20    |   -0.96%       | -2.00   | -1.95   |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 5.13   | 5.22        |   -1.65%   |   1.20%   |   1.40%        | 20    |   -1.76%       | -3.34   | -4.02   |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.46   | 2.56        |   -4.13%   |   2.49%   |   1.99%        | 20    |   -4.31%       | -4.04   | -5.94   |
| TPCH(30) | TPCH-Q9  | parquet / none / none | 81.63  | 85.07       |   -4.05%   |   4.94%   |   3.06%        | 20    |   -5.46%       | -3.28   | -3.21   |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 5.07   | 5.50        | I -7.92%   |   0.96%   |   1.33%        | 20    | I -8.51%       | -5.27   | -22.14  |
| TPCH(30) | TPCH-Q21 | parquet / none / none | 24.00  | 26.24       | I -8.57%   |   0.46%   |   0.38%        | 20    | I -9.34%       | -5.27   | -67.47  |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 8.66   | 9.50        | I -8.86%   |   0.62%   |   0.44%        | 20    | I -9.75%       | -5.27   | -55.17  |
| TPCH(30) | TPCH-Q3  | parquet / none / none | 6.01   | 6.70        | I -10.19%  |   1.01%   |   0.90%        | 20    | I -11.25%      | -5.27   | -35.76  |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.98   | 3.39        | I -12.23%  |   1.48%   |   1.48%        | 20    | I -13.56%      | -5.27   | -27.75  |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.69   | 2.00        | I -15.55%  |   1.63%   |   1.47%        | 20    | I -18.09%      | -5.27   | -34.60  |
| TPCH(30) | TPCH-Q4  | parquet / none / none | 2.42   | 2.87        | I -15.69%  |   1.48%   |   1.26%        | 20    | I -18.61%      | -5.27   | -39.50  |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 4.64   | 6.27        | I -26.02%  |   1.35%   |   0.73%        | 20    | I -35.37%      | -5.27   | -94.07  |
| TPCH(30) | TPCH-Q20 | parquet / none / none | 3.19   | 4.37        | I -27.01%  |   1.54%   |   0.99%        | 20    | I -36.85%      | -5.27   | -80.74  |
| TPCH(30) | TPCH-Q5  | parquet / none / none | 4.57   | 6.39        | I -28.36%  |   1.04%   |   0.75%        | 20    | I -39.56%      | -5.27   | -120.02 |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 3.15   | 4.71        | I -33.06%  |   1.59%   |   1.31%        | 20    | I -49.43%      | -5.27   | -87.64  |
| TPCH(30) | TPCH-Q8  | parquet / none / none | 5.25   | 7.95        | I -33.95%  |   0.95%   |   0.53%        | 20    | I -51.11%      | -5.27   | -185.02 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+

Change-Id: Iabeeab5eec869ff2197250ad41c1eb5551704acc
Reviewed-on: http://gerrit.cloudera.org:8080/14538
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-29 00:58:24 +00:00
Fang-Yu Rao
702e6c4fa8 IMPALA-7984: Port runtime filter from Thrift RPC to KRPC
Previously the aggregation and propagation of a runtime filter in Impala is
implemented using Thrift RPC, which suffers from a disadvantage that the number
of connections in a cluster grows with both the number of queries and cluster
size. This patch ports the functions that implement the aggregation and
propagation of a runtime filter, i.e., UpdateFilter() and PublishFilter(),
respctively, to KRPC, which requires only one connection per direction between
every pair of hosts, thus reducing the number of connections in a cluster.

In addition, this patch also incorporates KRPC sidecar when the runtime filter
is a Bloom filter. KRPC sidecar eliminates the need for an extra copy of the
Bloom filter contents when a Bloom filter is serialized to be transmitted and
hence reduces the serialization overhead. Due to the incorporation of KRPC
sidecar, a SpinLock is also added to prevent a BloomFilter from being
deallocated before its associated KRPC call finishes.

Two related BE tests bloom-filter-test.cc and bloom-filter-benchmark.cc are
also modified accordingly because of the changes to the signatures of some
functions in BloomFilter.

Testing:
This patch has passed the exhaustive tests.

Change-Id: I11a2f92a91750c2470fba082c30f97529524b9c8
Reviewed-on: http://gerrit.cloudera.org:8080/13882
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-on: http://gerrit.cloudera.org:8080/14974
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-01-21 00:49:05 +00:00
Fang-Yu Rao
e716e76ccc IMPALA-9154: Revert "IMPALA-7984: Port runtime filter from Thrift RPC to KRPC"
The previous patch porting runtime filter from Thrift RPC to KRPC
introduces a deadlock if there are a very limited number of threads on
the Impala cluster.

Specifically, in that patch a Coordinator used a synchronous KRPC to
propagate an aggregated filter to other hosts. A deadlock would happen
if there is no thread available on the receiving side to answer that
KRPC especially the calling and receiving threads are called from the
same thread pool. One possible way to address this issue is to make
the call of propagating a runtime filter asynchronous to free the
calling thread. Before resolving this issue, we revert this patch for
now.

This reverts commit ec11c18884.

Change-Id: I32371a515fb607da396914502da8c7fb071406bc
Reviewed-on: http://gerrit.cloudera.org:8080/14780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-22 23:10:46 +00:00
Fang-Yu Rao
ec11c18884 IMPALA-7984: Port runtime filter from Thrift RPC to KRPC
Previously the aggregation and propagation of a runtime filter in Impala is
implemented using Thrift RPC, which suffers from a disadvantage that the number
of connections in a cluster grows with both the number of queries and cluster
size. This patch ports the functions that implement the aggregation and
propagation of a runtime filter, i.e., UpdateFilter() and PublishFilter(),
respctively, to KRPC, which requires only one connection per direction between
every pair of hosts, thus reducing the number of connections in a cluster.

In addition, this patch also incorporates KRPC sidecar when the runtime filter
is a Bloom filter. KRPC sidecar eliminates the need for an extra copy of the
Bloom filter contents when a Bloom filter is serialized to be transmitted and
hence reduces the serialization overhead. Due to the incorporation of KRPC
sidecar, a SpinLock is also added to prevent a BloomFilter from being
deallocated before its associated KRPC call finishes.

Two related BE tests bloom-filter-test.cc and bloom-filter-benchmark.cc are
also modified accordingly because of the changes to the signatures of some
functions in BloomFilter.

Testing:
This patch has passed the exhaustive tests.

Change-Id: I6b394796d250286510e157ae326882bfc01d387a
Reviewed-on: http://gerrit.cloudera.org:8080/13882
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-09 01:54:51 +00:00
Michael Ho
5391100c7e IMPALA-7213, IMPALA-7241: Port ReportExecStatus() RPC to use KRPC
This change converts ReportExecStatus() RPC from thrift
based RPC to KRPC. This is done in part of the preparation
for fixing IMPALA-2990 as we can take advantage of TCP connection
multiplexing in KRPC to avoid overwhelming the coordinator
with too many connections by reducing the number of TCP connection
to one for each executor.

This patch also introduces a new service pool for all query execution
control related RPCs in the future so that control commands from
coordinators aren't blocked by long-running DataStream services' RPCs.
To avoid unnecessary delays due to sharing the network connections
between DataStream service and Control service, this change added the
service name as part of the user credentials for the ConnectionId
so each service will use a separate connection.

The majority of this patch is mechanical conversion of some Thrift
structures used in ReportExecStatus() RPC to Protobuf. Note that the
runtime profile is still retained as a Thrift structure as Impala
clients will still fetch query profiles using Thrift RPCs. This also
avoids duplicating the serialization implementation in both Thrift
and Protobuf for the runtime profile. The Thrift runtime profiles
are serialized and sent as a sidecar in ReportExecStatus() RPC.

This patch also fixes IMPALA-7241 which may lead to duplicated
dml stats being applied. The fix is by adding a monotonically
increasing version number for fragment instances' reports. The
coordinator will ignore any report smaller than or equal to the
version in the last report.

Testing done:
1. Exhaustive build.
2. Added some targeted test cases for profile serialization failure
   and RPC retries/timeout.

Change-Id: I7638583b433dcac066b87198e448743d90415ebe
Reviewed-on: http://gerrit.cloudera.org:8080/10855
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-01 21:12:12 +00:00
Michael Ho
5c541b9604 Add missing authorization in KRPC
In 2.12.0, Impala adopted Kudu RPC library for certain backened services
(TransmitData(), EndDataStream()). While the implementation uses Kerberos
for authenticating users connecting to the backend services, there is no
authorization implemented. This is a regression from the Thrift based
implementation because it registered a SASL callback (SaslAuthorizeInternal)
to be invoked during the connection negotiation. With this regression,
an unauthorized but authenticated user may invoke RPC calls to Impala backend
services.

This change fixes the issue above by overriding the default authorization method
for the DataStreamService. The authorization method will only let authenticated
principal which matches FLAGS_principal / FLAGS_be_principal to access the service.
Also added a new startup flag --krb5_ccname to allow users to customize the locations
of the Kerberos credentials cache.

Testing done:
1. Added a new test case in rpc-mgr-kerberized-test.cc to confirm an unauthorized
user is not allowed to access the service.
2. Ran some queries in a Kerberos enabled cluster to make sure there is no error.
3. Exhaustive builds.

Thanks to Todd Lipcon for pointing out the problem and his guidance on the fix.

Change-Id: I2f82dee5e721f2ed23e75fd91abbc6ab7addd4c5
Reviewed-on: http://gerrit.cloudera.org:8080/11331
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-08-30 04:06:09 +00:00
Michael Ho
421af4e40a IMPALA-6685: Improve profiles in KrpcDataStreamRecvr and KrpcDataStreamSender
This change implements a couple of improvements to the profiles of
KrpcDataStreamRecvr and KrpcDataStreamSender:

- track pending number of deferred row batches over time in KrpcDataStreamRecvr
- track the number of bytes dequeued over time in KrpcDataStreamRecvr
- track the total time deferred RPCs queues are not empty
- track the number of bytes sent from KrpcDataStreamSender over time
- track the total amount of time spent in KrpcDataStreamSender, including time
  spent waiting for RPC completion.

Sample profile of an Exchange node instance:

          EXCHANGE_NODE (id=21):(Total: 2s284ms, non-child: 64.926ms, % non-child: 2.84%)
             - ConvertRowBatchTime: 44.380ms
             - PeakMemoryUsage: 124.04 KB (127021)
             - RowsReturned: 287.51K (287514)
             - RowsReturnedRate: 125.88 K/sec
            Buffer pool:
               - AllocTime: 1.109ms
               - CumulativeAllocationBytes: 10.96 MB (11493376)
               - CumulativeAllocations: 562 (562)
               - PeakReservation: 112.00 KB (114688)
               - PeakUnpinnedBytes: 0
               - PeakUsedReservation: 112.00 KB (114688)
               - ReadIoBytes: 0
               - ReadIoOps: 0 (0)
               - ReadIoWaitTime: 0.000ns
               - WriteIoBytes: 0
               - WriteIoOps: 0 (0)
               - WriteIoWaitTime: 0.000ns
            Dequeue:
              BytesDequeued(500.000ms): 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 700.00 KB, 2.00 MB, 3.49 MB, 4.39 MB, 5.86 MB, 6.85 MB
               - FirstBatchWaitTime: 0.000ns
               - TotalBytesDequeued: 6.85 MB (7187850)
               - TotalGetBatchTime: 2s237ms
                 - DataWaitTime: 2s219ms
            Enqueue:
              BytesReceived(500.000ms): 0, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 23.36 KB, 328.73 KB, 963.79 KB, 1.64 MB, 2.09 MB, 2.76 MB, 3.23 MB
              DeferredQueueSize(500.000ms): 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0
               - DispatchTime: (Avg: 108.593us ; Min: 30.525us ; Max: 1.524ms ; Number of samples: 281)
               - DeserializeRowBatchTime: 8.395ms
               - TotalBatchesEnqueued: 281 (281)
               - TotalBatchesReceived: 281 (281)
               - TotalBytesReceived: 3.23 MB (3387144)
               - TotalEarlySenders: 0 (0)
               - TotalEosReceived: 1 (1)
               - TotalHasDeferredRPCsTime: 15s446ms
               - TotalRPCsDeferred: 38 (38)

Sample sender's profile:

        KrpcDataStreamSender (dst_id=21):(Total: 17s923ms, non-child: 604.494ms, % non-child: 3.37%)
          BytesSent(500.000ms): 0, 0, 0, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 34.78 KB, 46.54 KB, 46.54 KB, 46.54 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 58.31 KB, 974.44 KB, 2.82 MB, 4.93 MB, 6.27 MB, 8.28 MB, 9.69 MB
           - EosSent: 3 (3)
           - NetworkThroughput: 4.61 MB/sec
           - PeakMemoryUsage: 22.57 KB (23112)
           - RowsSent: 287.51K (287514)
           - RpcFailure: 0 (0)
           - RpcRetry: 0 (0)
           - SerializeBatchTime: 329.162ms
           - TotalBytesSent: 9.69 MB (10161432)
           - UncompressedRowBatchSize: 20.56 MB (21563550)

Change-Id: I8ba405921b3df920c1e85b940ce9c8d02fc647cd
Reviewed-on: http://gerrit.cloudera.org:8080/9690
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-28 19:21:46 +00:00
Lars Volker
3bfda33487 IMPALA-6193: Track memory of incoming data streams
This change adds memory tracking to incoming transmit data RPCs when
using KRPC. We track memory against a global tracker called "Data Stream
Service" until it is handed over to the stream manager. There we track
it in a global tracker called "Data Stream Queued RPC Calls" until a
receiver registers and takes over the early sender RPCs. Inside the
receiver, memory for deferred RPCs is tracked against the fragment
instance's memtracker until we unpack the batches and add them to the
row batch queue.

The DCHECK in MemTracker::Close() covers that all memory consumed by a
tracker gets release eventually. In addition to that, this change adds a
custom cluster test that makes sure that queued memory gets tracked by
inspecting the peak consumption of the new memtrackers.

Change-Id: I2df1204d2483313a8a18e5e3be6cec9e402614c4
Reviewed-on: http://gerrit.cloudera.org:8080/8914
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-01 08:53:36 +00:00
Michael Ho
b4ea57a7e3 IMPALA-4856: Port data stream service to KRPC
This patch implements a new data stream service which utilizes KRPC.
Similar to the thrift RPC implementation, there are 3 major components
to the data stream services: KrpcDataStreamSender serializes and sends
row batches materialized by a fragment instance to a KrpcDataStreamRecvr.
KrpcDataStreamMgr is responsible for routing an incoming row batch to
the appropriate receiver. The data stream service runs on the port
FLAGS_krpc_port which is 29000 by default.

Unlike the implementation with thrift RPC, KRPC provides an asynchronous
interface for invoking remote methods. As a result, KrpcDataStreamSender
doesn't need to create a thread per connection. There is one connection
between two Impalad nodes for each direction (i.e. client and server).
Multiple queries can multi-plex on the same connection for transmitting
row batches between two Impalad nodes. The asynchronous interface also
prevents avoids the possibility that a thread is stuck in the RPC code
for extended amount of time without checking for cancellation. A TransmitData()
call with KRPC is in essence a trio of RpcController, a serialized protobuf
request buffer and a protobuf response buffer. The call is invoked via a
DataStreamService proxy object. The serialized tuple offsets and row batches
are sent via "sidecars" in KRPC to avoid extra copy into the serialized
request buffer.

Each impalad node creates a singleton DataStreamService object at start-up
time. All incoming calls are served by a service thread pool created as part
of DataStreamService. By default, the number of service threads equals the
number of logical cores. The service threads are shared across all queries so
the RPC handler should avoid blocking as much as possible. In thrift RPC
implementation, we make a thrift thread handling a TransmitData() RPC to block
for extended period of time when the receiver is not yet created when the call
arrives. In KRPC implementation, we store TransmitData() or EndDataStream()
requests which arrive before the receiver is ready in a per-receiver early
sender list stored in KrpcDataStreamMgr. These RPC calls will be processed
and responded to when the receiver is created or when timeout occurs.

Similarly, there is limited space in the sender queues in KrpcDataStreamRecvr.
If adding a row batch to a queue in KrpcDataStreamRecvr causes the buffer limit
to exceed, the request will be stashed in a queue for deferred processing.
The stashed RPC requests will not be responded to until they are processed
so as to exert back pressure to the senders. An alternative would be to reply with
an error and the request / row batches need to be sent again. This may end up
consuming more network bandwidth than the thrift RPC implementation. This change
adopts the behavior of allowing one stashed request per sender.

All rpc requests and responses are serialized using protobuf. The equivalent of
TRowBatch would be ProtoRowBatch which contains a serialized header about the
meta-data of the row batch and two Kudu Slice objects which contain pointers to
the actual data (i.e. tuple offsets and tuple data).

This patch is based on an abandoned patch by Henry Robinson.

TESTING
-------

* Builds {exhaustive/debug, core/release, asan} passed with FLAGS_use_krpc=true.

TO DO
-----

* Port some BE tests to KRPC services.

Change-Id: Ic0b8c1e50678da66ab1547d16530f88b323ed8c1
Reviewed-on: http://gerrit.cloudera.org:8080/8023
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-09 20:05:08 +00:00