I’ve been looking at Paxos and Raft and Zab and other algorithms that can loosely be called “consensus” algorithms because we want to see where we can improve distributed system operation with precision time synchronization or where we can offer novel features because of precision timing. Google’s Spanner database is an interesting example in this area but there is also nice work in a number of other places. One obvious question is how to account for the complexity of Paxos, which seems like it should be simple, but as several people have pointed out, is remarkably elusive and complex in practice.
The original Paxos paper is unreadably over-cute. Later versions tried, without a lot of success, to make things clear. Part of the problem is that time and timeouts are fundamental to operation of distributed algorithms, but particularly in Paxos, there has been a laborious attempt to sweep these things under the table so it looks as much as possible like a purely “asynchronous” algorithm (see this alternative explanation of Paxos).
Suppose network elements \(A\) and \(B\) can only communicate by sending/receiving messages over some communications medium that can lose or delay messages, but never corrupt them. That is an optimistic version of a network, but it can be made to be reliable to a high probability. \(A\) sends a message \(m\) to \(B\) and waits for a reply message \(r\). \(B\) can fail. The original transmit might have failed. It may be that \(A\) can deduce \(B\) has failed or is unreachable if \(A\) sends \(k\) messages to \(B\) and has not received any reply by the time the \(k\)th message has been transmitted. But it’s more likely that \(A\) will use a combination of the count of messages sent and the time that has passed since a reply in order to conclude that B is dead or unreachable (at least for the moment). This obvious fact of life in distributed systems is something, for some reason, that academic researchers in distributed systems don’t like but it’s actually really interesting. The current time is information that is shared between nodes on a network with no communication delay as long as clocks are synchronized properly. Even without synchronized clocks, syntonized clocks (same or similar frequency) permit use of expiring “leases” with zero information transfer time. The Paxos Made Simple paper begins with the usual completely asynchronous model – where network elements cannot rely on timeouts because there is no common time base, but after strenuous attempt to get the algorithm to work in this model, gives up by page seven (my bold):
If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer, Lynch, and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time—for example, by using timeouts. However, safety is ensured regardless of the success or failure of the election.
Paxos, like any other consensus algorithm that can tolerate failures, then has to rely on real-time measurements, but these have been marginalized and pushed into leader election and caveats about liveness in the Paxos papers. And that, I believe, accounts for a great deal of the obscurity of the presentations. The Paxos made simple paper even cites the academically beloved FLP “theorem” which gives the not so shocking result that on a completely asynchronous network there is no way for one network site to be able to distinguish between a network/site failure and a slow response. Equipped with that trivial fact, it seems perverse to embark on the development of asynchronous algorithms.
Curiously, when it comes time to build a working Paxos implementation, the necessity of time based algorithms becomes clear. The Google developers note:
In our implementation, all replicas implicitly grant a lease to the master of the previous Paxos instance and refuse to process Paxos messages from any other replica while the lease is held. The master maintains a shorter timeout for the lease than the replicas – this protects the system against clock drift. The master periodically submits a dummy “heartbeat” value to Paxos to refresh its lease
The Raft paper also begins with a claim about asynchronous operation of consensus algorithms which, supposedly:
[…] do not depend on timing to ensure the consistency of the logs: faulty clocks and extreme message delays can, at worst, cause availability problems.
but later explains:
• Leader election: Raft uses randomized timers to elect leaders. This adds only a small amount of mechanism to the heartbeats already required for any consensus algorithm, while resolving conflicts simply and rapidly
Ok, maybe it needs clocks to elect leaders, but not for consistency of logs:
Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected). Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader could rely on the heartbeat mechanism to provide a form of lease [9], but this would rely on timing for safety (it assumes bounded clock skew).
But:
If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.
If there is unlimited skew, I don’t see how this works at all.
Which naturally brings up the question: if you are going to use time, e.g. as in Raft and Paxos, and on top of that, select a single coordinator, why go to all the trouble of and overhead of generic, mostly asynchronous algorithm?
Another interesting question is the relationship of Paxos to the well-known (but not in distributed systems) Chang-Maxemchuk protocol (1983). CM is basically a reliable broadcast – designed to have a set of receiver sites commit messages from a single transmitter in order. The reformation phase essentially solves the same problem Paxos is attempting to solve – forcing a consensus on a new list of receivers after some failure.
Any site that detects a failure or recovery initiates a reformation and is called an originator. It invites other sites in the broadcast group, the slaves, to form a new list. The reformation process can be described in terms of the activities of sites joining and committing a valid list. A valid list satisfies a set of specific requirements, as explained below. When the reformation starts, a site is invited to join a new list and eventually commits to a valid list. When all of the sites in a valid list are committed to this list, the list will be authorized with a token and the reformation terminates. This list becomes the new token list. Multiple originators can exist if more than one site discovers the failure or recovery. During the reformation, it is possible that acknowledged messages from the old token list have been missed by all sites that join a new list.
To guarantee that there is only one new list and that this list has all of the committed messages, the list must be tested before it can be considered a valid list. Specifically, a list becomes valid if it passes the majority test, the sequence test, and the resiliency test.
Majority Test. The majority test requires that a valid list has a majority of the sites in the broadcast group. During the reformation, a site can join only one list. The majority test is necessary to ensure that only one valid list can be formed.
Sequence Test. The sequence test requires that a site only join a list with a higher version number than the list it previously belonged to. The version number of a token list is in the form of (version #, site number). Each site has a unique site number. When a new list is formed, the originator chooses the new version # to be the version # of the last list it has joined plus one. Therefore, token lists have unique version numbers.
The originator always passes the sequence test. If any of the slaves fail the sequence test, it tells the originator its version number. The originator increments the higher version # the next time it tries to form a new list. The combination of the majority and the sequence test ensures that all valid lists have increasing version numbers. This is true because any two valid lists must have at least one site in common, and a site can join a second list only if the second list has a higher version number. Therefore, the version numbers indicate the sequence in which token lists were formed.
Pingback:The replicated state machine method of fault tolerance – keeping simple