Impossibility of Distributed Consensus with One Faulty Process - Paper Review

This is a review of Impossibility of Distributed Consensus with One Faulty Process by M. J. Fischer, N. A. Lynch, and M. S. Patterson (referred to as the FLP result). I restate the paper's proof no consensus algorithm running in an asynchronous network can be totally correct. I've filled in the use of "by easy induction" with induction, use slightly different notation, and you can hover or click underlined terms to show their definition. I likely guild the lily.


Introduction

We are presented with a set of processes, PP, which implement a consensus algorithm. Each pPp \in P may communicate with other processes with messages. Pending messages are stored in a message buffer. When pp attempts to receive a message, it receives a random pending message addressed to it, or \varnothing. A process receiving infinitely many times is guaranteed to receive all pending messages. Though, for a single receive, it is valid for the system to return \varnothing despite a pending message being available.

At any time, pp may take a step where it attempts to receive a message. A step by pp is an event, denoted epEe_p \in E. A run, denoted σΣ\sigma \in \Sigma, is a sequence of events. We assume processes don't have clocks (no timeouts), and thus only change their state upon receiving a message.

The set of possible configurations of the system is denoted CC. Each configuration contains a possible system state. The configuration after applying an event is denoted e(C)e(C), and after applying a run σ(C)\sigma(C). A configuration CC' is reachable from CC if σΣ:C=σ(C)\exists \sigma \in \Sigma: C'=\sigma(C). Accessible configurations are reachable from an initial configuration.

Lemma 0: If an event is applicable to CC it is applicable to all CC' reachable from CC.

We say PP correctly implements a consensus algorithm, insofar as all messages sent can be received. Messages may be delated infinitely and are ordered randomly; thus, messages are applicable to all configurations reachable from the one in which they are sent.

Lemma 1: If the set of processes taking steps in σ1\sigma_1 and σ2\sigma_2 are disjoint, σ1(σ2(C))=σ2(σ1(C))\sigma_1(\sigma_2(C)) = \sigma_2(\sigma_1(C)).

This follows, as the runs modify disjoint parts of the system's state.

Consensus

Each pPp \in P has a decision value yp{undecided,0,1}y_p \in \{\text{undecided}, 0, 1\}. As PP is correct, if any pPp\in P has a decision value, all pPp\in P will eventually have the same one. The goal of consensus is for all processes to agree on a decision value.

An initial configuration is a configuration with an empty message buffer, for which no process has a decision value, i.e. pP:yp=undecided\forall p \in P: y_p = \text{undecided}.

Partially correct consensus algorithms satisfy two requirements. (1) No accessible configuration has more than one decision value. (2) Both 0 and 1 are accessible decision values.

These correctness requirements seem reasonable. (1) requires processes come to agreement, and (2) they can come to agreement on all options.

A configuration is deciding if a process has a decision value. A consensus algorithm is totally correct if it is partially correct, and every admissible run is deciding. The difference between totally and partially correct consensus algorithms is, partially correct ones may never make a decision.

A configuration with a decision value of 1 is 1-valent, and 0 is 0-valent. If there are 1-valent and 0-valent configurations reachable from CC, CC is bivalent. If there are only 1-valent or only 0-valent configurations reachable from CC, CC is univalent.

Lemma 2: All totally correct consensus algorithms have a bivalent initial configuration.

Imagine each initial configuration arranged in a graph. Two configurations are connected if they differ by the state of only one process. As we've placed no constraints on initial configurations, other than the message buffer be empty and no decision value be reached, each configuration is connected to every other configuration by a chain of configurations.

We prove Lemma 2 by contradiction. Assume there is no bivalent initial configuration. For a totally correct consensus algorithm, each configuration must be deciding, and, as there are no bivalent configurations, must be one of 0-valent or 1-valent. Thus, our graph must contain a 0-valent configuration connected to a 1-valent one. Call these two configurations C0C_0 and C1C_1 respectively.

Let pp be the process who's state differs between C0C_0 and C1C_1, and σ\sigma be a run from C0C_0 in which pp takes no steps. As C0C_0 and C1C_1 differ only by the state of pp, and pp takes no steps in σ\sigma, Cσ=σ(C0)=σ(C1)C_{\sigma} = \sigma(C_0) = \sigma(C_1). As CσC_{\sigma} is reachable from C0C_0, it must be 0-valent. As CσC_{\sigma} is reachable from C1C_1, it must be 1-valent. Thus, we have arrived at a contradiction and one of C1C_1 or C0C_0 is bivalent, proving Lemma 2.

Lemma 3: Let CC be a bivalent configuration of a partially correct consensus algorithm, and epe_p be an event applicable to CC. Let Ψ\Psi denote the set of configurations reachable from CC without applying epe_p, and Ω\Omega the set of configurations reachable from Ψ\Psi by applying epe_p: Ω={e(ψ)ψΨ}\Omega = \{e_(\psi) \mid \psi \in \Psi\}. Ω\Omega contains a bivalent configuration.

By Lemma 0, as epe_p is applicable to CC, epe_p is applicable to all configurations in Ψ\Psi. This validates our construction of Ω\Omega.

We prove Lemma 3 by contradiction. Assume there is no bivalent configuration in Ω\Omega. Let EiE_i be an ii-valent configuration reachable from CC, as CC is bivalent i{0,1}i \in \{0, 1\}. If EiΨE_i \in \Psi, ep(Ei)e_p(E_i) is in Ω\Omega and ii-valent, as EiE_i is ii-valent. Otherwise, EiΩE_i \in \Omega. In either case ii can be 00 or 11, so Ω\Omega contains both 0-valent and 1-valent configurations.

Call two configurations neighbors if one is reachable from the other via a single event. Let ep(C)e_p(C) be an ii-valent configuration Ω\in \Omega. As Ω\Omega contains both 0 and 1 valent configurations, CΨ:\exists C' \in \Psi: ep(C)e_p(C') is a (1i)(1-i)-valent configuration. Let σ\sigma be a run where C=σ(C)C'=\sigma(C). As the valence of ep(C)e_p(C) and ep(C)e_p(C') is different, there must exist neighbor configurations CiC_i and C1iC_{1-i} and an event epσe_{p'} \in \sigma, such that C1i=ep(Ci)C_{1-i} = e_{p'}(C_i), ep(C1i)e_p(C_{1-i}) is (1i)(1-i)-valent, and ep(Ci)e_p(C_i) is ii-valent.

Case 1, ppp' \neq p. By Lemma 1 ep(ep(Ci))=ep(ep(Ci))e_p(e_{p'}(C_i)) = e_{p'}(e_p(C_i)); however, this is a contradiction, as ep(ep(Ci))=ep(C1i)e_p(e_{p'}(C_i)) = e_p(C_{1-i}) is (1i)(1-i)-valent, and ep(ep(Ci))e_{p'}(e_p(C_i)) is ii-valent (as Ω\Omega is assumed to contain only univalent configurations, and ep(Ci)Ωe_p(C_i) \in \Omega is ii-valent).

Case 2, p=pp = p'. Consider any deciding run from CiC_i in which pp takes no steps, denoted σ\sigma. By Lemma 1, ep(σ(Ci))=σ(ep(Ci))e_p(\sigma(C_i)) = \sigma(e_p(C_i)), which is ii valent per the definition of CiC_i. Note also, ep(ep(σ(Ci)))=σ(ep(ep(Ci)))e_p(e_{p'}(\sigma(C_i))) = \sigma(e_p(e_{p'}(C_i))), which is (1i)(1-i)-valent, by the definition of epe_{p'} and CiC_i. Thus, σ(Ci)\sigma(C_i) is bivalent; however, this is a contradiction, as σ\sigma is deciding (and thus univalent).

Consensus Is Impossible

Lemma 3 tells us, for any event epe_p, and bivalent configuration CC, there is an admissible run, σ\sigma, such that ep(σ(C))e_p(\sigma(C)) is bivalent. Lemma 1 tells us there is always a bivalent initial configuration. We can use these to construct an algorithm for advancing a partially correct consensus algorithm, such that it never reaches a decision value.

Starting from a bivalent configuration CC, when a process sends a message to be received in epe_p, apply σ\sigma, then apply epe_p. ep(σ(C))e_p(\sigma(C)) is bivalent, so this can be repeated indefinitely.

A previous essay described RAFT (a contemporary consensus algorithm), even RAFT is not totally correct. If the network is partitioned, and no single partition contains the majority of nodes, RAFT will stop advancing.