impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 21:03:01 -05:00

Files

Henry Robinson 433a86263c Separate failure detection from topic updates in statestore

As originally envisaged, the statestore was not intended to transmit GBs
of data to large clusters. Now that it is, some changes to the design
are required to improve stability and responsiveness.

One of the most acute problems is the way that the topic updates, which
may be extremely large, were responsible for conveying failure detection
information. The delay in sending topic heartbeats (which were often
blocked behind a queue of other updates) meant that large clusters
needed to set large timeouts for every subscriber, and to reduce the
frequency at which topic updates were sent.

This change adds a second heartbeat type to the statestore which is
responsibie only for coordinating failure information between the
subscriber and the statestore. These 'keep-alive' heartbeats are sent
much more frequently, as they have a tiny payload and do very little
work. The statestore now does not use the result of topic updates to
maintain its view of which subscribers are alive or dead, but only the
results of the keep-alive heartbeats.

I've tested these changes with a 500MB catalog on a 73 node
cluster. Keep-alive frequencies are nice and stable (at 500ms out of the
box) even during the initial large topic-update distribution phase, or
after a statestore failure.

Internally, this patch changes the nomenclature for the statestore: we
stay away from 'heartbeat' (except for the failure detector, which is
heartbeat-based), and instead use 'message' as a generic term for both
keep-alive and topic update. Externally, we need to rationalise the flag
names to control both update types. The benefit of this change is that
only the keep-alive frequency matters for cluster stability, and that
depends only on the cluster size, not on the size of the catalog and the
cluster size. So we can set it to, e.g., 1s out of the box with some
confidence that that will work for clusters up to ~200 nodes. The topic
update frequency may actually be set aggressively, because the
statestore will naturally throttle itself during times of heavy traffic
and nothing will time out.

Change-Id: Ia447e4ebefda890a5b810a213e97f00cca499989
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4003
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5138

2014-11-05 14:59:25 -08:00