RAFT Elections

This essay describes the RAFT consensus algorithm's leader election process as an asyncronous message passing state machine. I then breifly walk through my implementation of the state machine described.


in RAFT consensus a node can either be a follower, a candidate, or a leader. leaders are responsible for sending periodic heartbeats to their followers. if a follower doesn't receive a heartbeat in time, it will run for election and enter the candidate state.

if a candidate wins an election, it becomes a leader and starts sending heartbeats to other nodes. if a different node wins the election that round, or if there has been a network partition and the candidate discovers a leader while running for election, the candidate becomes a follower.

conceptually, what this all means is that if a node doesn't hear from the leader for a while, it will assume the leader has died and run for election in order to fill its shoes.

Terms

time in RAFT is discretized into terms. when an election is started, a new term begins. nodes serve single terms as leaders. so, a node starting an election begins a new term and usurps the old leader.

Understanding the state machine

two events may trigger a state change: a timeout, or an incoming message from another node.

for timeouts, our state machine logic is pretty straightforward:

on timeout:
 set_state(candidate)

there are two messages that may be sent by another node: `AppendEntries`, and `RequestVote`. the former is used for heartbeats and to update the node's log (the log is what is distributed and synced in RAFT). the later is used to request a vote during an election.

for leader elections, both messages can be treated identically. when a node sends a message, it includes its current term. if our term is out of date, we know a new round has begun and we become a follower so that we may cast votes or run for leader in the new round. this makes our message handling logic:

on message receive:
 heartbeat()
 if message.term > current_term:
 set_term(message.term)
 set_state(follower)

external inputs understood, we can now write out the rest of our state machine:

on -> candidate:
 stop_listening_for_timeouts()
 run_election()
 
on -> follower:
 stop_sending_heartbeats()
 listen_for_timeouts()
 
on -> leader:
 send_heartbeats()

from this, we see that there are four actors in our state machine:

  1. timeouts
  2. transitions
  3. messages
  4. heartbeats

the timeouts handler is responsible for listening for heartbeats and causing a transition to the candidate state when a timeout occurs. it needs to be toggable on/off. it receives heartbeats and outputs state changes. the transition actor listens for state updates and executes transition logic. per our earlier transition rules, a state change can cause timeouts and heartbeats to turn on and off. the messages actor takes messages from the RPC which can cause state changes and results in heartbeats being sent: finally, the heartbeats actor can be turned on and off and sends heartbeats to peers.

all together:

Implementing the state machine

it turns out that this state machine is highly parallizable. each of those inputs to an actor can be represented as a multi-producer, single-consumer queue. using async rust, we can run each actor in its own green thread and send messages to other actors across those threads.

i've written such an implementation here.

for example, here is the signature for the heartbeats actor located in `src/heartbeats.rs`:

pub(crate) async fn heartbeats(
 logger: Logger,
 consensus_state: Arc<Mutex<ConsensusState>>,
 rpc: mpsc::Sender<RpcRequest>,
 state_tx: mpsc::Sender<StateUpdateMsg>,
 mut toggles: mpsc::Receiver<ToggleMsg>,
) {
 // ...
}

`conenesus_state` contains information shared across each thread. specifically: the current set of peers, the current term, and the vote cast during the current term (if any). the other arguments should be familiar as they are the same channels mocked out above.

State machine in action

to see the state machine in action, clone the respository and run `cargo t test_election -- --nocapture`. here's an example run split up and annotated:

4: ⏲: -> candidate (timeout)
4: Follower -> Candidate
1: ⏲: -> candidate (timeout)
1: Follower -> Candidate
0: 📩: grant vote for 4
1: 📩: deny vote for 4
2: 📩: grant vote for 4
4: 🗳: -> leader (won election)
4: Candidate -> Leader 👑

every node starts in the follower state. here we see nodes four and one time out around the same time and attempt to run an election. nodes zero and two vote for node 4 giving node 4 a 3/5 majority and allowing it to become leader.

3: 📩: grant vote for 4
0: 📩: deny vote for 1
1: Candidate -> Follower

shortly after, node three gets in its vote for node 4 late and node 1 learns it has lost the election causing it to return to the follower state.

2: ⏲: -> candidate (timeout)
2: Follower -> Candidate
4: 💗: -> follower (out of sync)
4: Leader -> Follower
0: 📩: grant vote for 2
1: 📩: grant vote for 2
2: 🗳: -> leader (won election)

later, node four fails to send heartbeats fast enough and node two times out and begins an election. four learns of the election from node two, and realizing its mistake prints a heart emoji and becomes a follower. soon after, nodes zero and one vote for node two and, having a 3/5 majority, node two becomes leader.