Time, Clocks, and the Ordering of Events in a Distributed System - Paper Review

This is a review of Leslie Lamport's paper Time, Clocks, and the Ordering of Events in a Distributed System. I focus on the paper's proof of conditions under which a distributed system of physical clocks can produce an ordering of events, which does not violate causality. The proof lies in the appendix of the paper, and is extremely brief: Not unlike reading a mystery novel. I expand on the paper by presenting the proof, all the math worked out, explained with a less mysterious level of brevity.


We are presented with a set of processes, PP, each PiPP_i \in P has a clock CiC_i. CiaC_i\langle a \rangle returns the timestamp of event aa occurring on PiP_i; Ci(t)C_i(t) returns the timestamp of PiP_i's clock at physical time tt. We assume all processes are moving at the same speed, so there's no need to worry about relativity.

We define a relation \rightarrow, or "happens before", as a relation on the set of events such that:

  1. If aa and bb are events occurring on PiP_i, and Cia<CibC_i\langle a \rangle < C_i \langle b \rangle, then aba \rightarrow b.

  2. If aa is the sending of a message, and bb the receiving, then aba \rightarrow b.

A nice property of \rightarrow is, aa can causally effect bb if and only if aba\rightarrow b. Because, for events aa and bb occurring on PiP_i and PjP_j, aa can only effect bb if PjP_j has learned about aa by receiving a message from PiP_i.

To break ties, we order events according to \prec.

ab={i<j,if Cia=CjbCia<Cjb,otherwise\begin{equation} a \prec b = \begin{cases} i < j, & \text{if \( C_i\langle a\rangle = C_j\langle b\rangle \)} \\ C_i\langle a\rangle < C_j\langle b\rangle, & \text{otherwise} \end{cases} \end{equation}

\prec will never violate causality, as two events occurring at the same time can't cause each other.

For our system of clocks to create an ordering of events satisfying \rightarrow, we define two clock conditions.

C1: If aa and bb are events on PiP_i, and aa comes before bb, then Cia<CibC_i\langle a\rangle < C_i\langle b\rangle.

C2: If aa is the sending of a message by process PiP_i, and bb the message's receipt by process PjP_j, then Cia<CjbC_i\langle a\rangle < C_j\langle b\rangle.

C1 ensures the first rule of \rightarrow is satisfied, and C2 the second. Thus, a system of clocks satisfying the clock conditions will order events with a \rightarrow relation.

Logical Clocks

Before considering physical clocks, we consider a simpler clock. Logical clocks satisfy the clock conditions, but don't order events in a way compatible with our notion of time.

To order events with a logical clock, processes follow two rules.

IR1: Each process PiP_i increments CiC_i between any two successive events.

IR2: (a) If an event aa is the sending of a message mm by process PiP_i, then the message contains a timestamp Tm=CiaT_m = C_i\langle a\rangle. (b) Upon receiving a message mm in event bb, a process PjP_j sets CjC_j greater than or equal to its present value, and greater than TmT_m.

Processes implementing these rules satisfy the clock conditions C1 and C2, by IR1 and IR2 respectively.

Time, and Logical Clock Time

In a sense, logical clock time and the time we are used to line up. \prec guarantees events happening before a message is sent, are ordered before events on the receiving process after the message is received. This preserves causality, as in order for an event on PiP_i to cause an event on PjP_j a message must be sent between the processes.

When events can not cause eachother, our logical clocks order events according to the number of events preceding them. This has no basis in our typical notion of time. To address this, we turn to physical clocks.

Physical Clocks

Physical clocks track time as we perceive it. A watch is a physical clock, and like a watch, physical clocks can become out of sync. This can cause causality to break; If one clock is so far ahead of another the receiving timestamp is smaller than the sending one, then receive \rightarrow send.

This section models a physical clock, derives the conditions under which causality can break, and proves what conditions will ensure it does not.

Physical Clock Conditions

To model a physical clock, we put two conditions on its behavior.

PC1: There exists a constant k1k \ll 1, such that for all ii: dCi(t)dt1<k|\frac{dC_i(t)}{dt} - 1| < k.

PC2: There exists a constant ϵ\epsilon such that, for all PiP_i, PjP_j: Ci(t)Cj(t)<ϵ|C_i(t) - C_j(t)| < \epsilon.

There are two important things to note about these conditions. First, physical clocks are not naturally imbued with them. In a real system, a process might turn off indefinitely, in which case the clock ceases advancing, and ϵ\epsilon is no longer bounded. In this sense, the math in this paper isn't easily mapped to reality. Second, while physical clocks will generally run at rates close to correct (making PC1 reasonable), they will not automatically synchronize (making PC2 unreasonable, without a synchronization algorithm).

Anomalous Clocks

Clocks become out of sync when they run at the wrong rate. Mathematically speaking we look to the derivative. For a clock running at the correct speed dC(t)dt=1\frac{dC(t)}{dt} = 1, too fast dC(t)dt>1\frac{dC(t)}{dt} > 1, and too slow dC(t)dt<1\frac{dC(t)}{dt} < 1.

Breaking Causality

For any two processes communicating, there is a lower bound on how quickly they may exchange messages. At the limit, this is the distance between the processes, divided by the speed of light. We call this lower bound μ\mu.

Our goal is to find values of ϵ\epsilon for which, for any message, sent by PiP_i at tt, and received by PjP_j at tt', Ci(t)<Cj(t)C_i(t) < C_j(t'). If this is true, \prec will satisfy C2. As μ\mu is a lower bound on message send time, t>t+μt' > t + \mu. Thus, to satisfy C2:

Ci(t)Cj(t+μ)\begin{equation} C_i(t) \leq C_j(t + \mu) \end{equation}

By PC1, the slowest possible clock advances at a rate dCj(t)dt>(1k)\frac{dC_j(t)}{dt}>(1-k). Thus, Cj(t+μ)>Cj(t)+(1k)μC_j(t+\mu) > C_j(t) + (1-k)\mu. As we are interested in solving the innequality for all possible values of Cj(t+μ)C_j(t+\mu), we plug in the lower bound into the right side of (2), yielding a bound on ϵ\epsilon.

Ci(t)Cj(t+μ)Ci(t)Cj(t)+(1k)μCi(t)Cj(t)(1k)μϵμ(1k)\begin{equation} \begin{aligned} C_i(t) &\leq C_j(t + \mu) \\ C_i(t) &\leq C_j(t) + (1-k)\mu \\ C_i(t) - C_j(t) &\leq (1-k)\mu \\ \epsilon &\leq \mu(1-k) \end{aligned} \end{equation}

If this inequality is true, \prec will satisfy C2. And thus, for software satisfying C1, \prec will preserve causality. If the inequality is false, it will be possible to order the receiving of a message before its sending, causing \prec to break causality.

Preserving Causality

Having discovered conditions under which causality breaks, we set about describing an algorithm which preserves causality, and prove it works. Our algorithm is the same one used for logical clocks, slightly modified for physical time.

IRP1: The physical clocks for each process are always advancing: dCi(t)dt>0\frac{dC_i(t)}{dt} > 0.

IRP2: (a) For each ii, if PiP_i sends a message mm at physical time tt, then mm contains a timestamp Tm=Ci(t)T_m = C_i(t). (b) Upon receiving a message mm at tt', process PjP_j sets Cj(t)max(Cj(t),Tm+μ)C_j(t') \geq \max(C_j(t'), T_m + \mu).

We model our processes as being arranged in a graph, the edges of which represent communication links. We make three assumptions about our graph and the behavior of its processes.

  1. The diameter, dd, of the graph is the shortest number of links which connects any two processes.

  2. For a message sent at physical time tt, and received at tt', we define the total delay vm=ttv_m = t' - t. vm<ξm+μmv_m < \xi_m + \mu_m where ξm\xi_m is the maximum unpredictable delay of the message, and μm\mu_m the minimum predictable delay for the message. For speaking about the entire graph, we define v=max(vm)v=\max(v_m), ξ=max(ξm)\xi = \max(\xi_m), and μ=max(μm)\mu = \max(\mu_m), for all possible mm.

  3. Every τ\tau seconds a message is sent along every communication link.

With these assumptions, we set about finding a value for ϵ\epsilon. To do so, we will find lower and upper bounds on Ci(t)C_i(t), then subtract them to find a value for ϵ\epsilon.

The Slowest Clock

Say process P1P_1 sends a message to P2P_2 at time t1t_1, which is received at time t2t_2. Per PC1, C2(t)C_2(t) must be at least as fast as the slowest clock.

C2(t)C2(t2)+(1k)(tt2)for tt2\begin{equation} \begin{aligned} C_2(t) \geq C_2(t_2) + (1-k)(t-t_2) \\ \quad \text{for } t \geq t_2 \end{aligned} \end{equation}

Per IRP2, after the message has been received, the clock of the receiving process must satisfy C2(t2)C1(t1)+μmC_2(t_2) \geq C_1(t_1) + \mu_m. Combining this with (4).

C2(t)C1(t1)+μm+(1k)(tt2)\begin{equation} C_2(t) \geq C_1(t_1) + \mu_m + (1-k)(t-t_2) \end{equation}

Turning our attention to the second term on the right side of (5).

(1k)(tt2)=(1k)(tt1+t1t2)=(1k)(tt1)+(1k)(t1t2)=(1k)(tt1)+(k1)(t2t1)\begin{equation} \begin{aligned} (1-k)(t-t_2) &= (1-k)(t-t_1+t_1-t_2) \\ &= (1-k)(t-t_1) + (1-k)(t_1-t_2) \\ &= (1-k)(t-t_1) + (k-1)(t_2-t_1) \end{aligned} \end{equation}

Combining (5) and (6).

C1(t1)+μm+(1k)(tt2)=C1(t1)+μm+(1k)(tt1)+(k1)(t2t1)=C1(t1)+(1k)(tt1)[(t2t1)μm]+k(t2t1)\begin{equation} \begin{aligned} & C_1(t_1) + \mu_m + (1-k)(t-t_2) \\ &= C_1(t_1) + \mu_m + (1-k)(t-t_1) + (k-1)(t_2-t_1) \\ &= C_1(t_1) + (1-k)(t-t_1) - [(t_2 - t_1) - \mu_m] + k(t_2-t_1) \end{aligned} \end{equation}

We now briefly digress into delay.

ξm=vmμm=(t2t1)μmξ\begin{equation} \begin{aligned} \xi_m &= v_m - \mu_m \\ &= (t_2-t_1) - \mu_m \\ &\leq \xi \end{aligned} \end{equation}

Combining (7) and (8).

C2(t)C1(t1)+(1k)(tt1)[(t2t1)μm]+k(t2t1)C1(t1)+(1k)(tt1)ξ+k(t2t1)C1(t1)+(1k)(tt1)ξ\begin{equation} \begin{aligned} C_2(t) &\geq C_1(t_1) + (1-k)(t-t_1) - [(t_2 - t_1) - \mu_m] + k(t_2-t_1) \\ &\geq C_1(t_1) + (1-k)(t-t_1) - \xi + k(t_2-t_1) \\ &\geq C_1(t_1) + (1-k)(t-t_1) - \xi \end{aligned} \end{equation}

(9) sets a lower bound on the receiving clock's time, in terms of the sending clock. Now, imagine a series of processes [P1,,PN][P_1,\dots,P_N]. At time tit_i, process PiP_i receives a message from process Pi1P_{i-1}, sent at time ti1t_{i-1}', where ti1titit_{i-1}' \leq t_i \leq t_i'. Consider what happens when we repeatedly apply (9) to C3C_3.

C3(t)C2(t2)+(1k)(tt2)[C1(t1)+(1k)(t2t1)ξ]+(1k)(tt2)ξ=C1(t1)+(1k)(tt1)2ξ\begin{equation} \begin{aligned} C_3(t) &\geq C_2(t_2') + (1-k)(t-t_2') \\ &\geq [C_1(t_1') + (1-k)(t_2'-t_1') - \xi] +(1-k)(t-t_2') -\xi \\ &= C_1(t_1') + (1-k)(t-t_1') - 2\xi \end{aligned} \end{equation}

Every time we apply the inequality, it subtracts ξ\xi, and decreases the time and clock subscript by one. For a process PiP_i, there is a chain of (i1)(i-1) processes leading up to it, so (9) can be applied (i1)(i-1) times to find a lower bound on PiP_i's clock.

Cn+1(t)C1(t1)+(1k)(tt1)nξfor t>tn\begin{equation} \begin{aligned} C_{n+1}(t) \geq C_1(t_1')+(1-k)(t-t_1')-n\xi \\ \quad \text{for } t > t_n \end{aligned} \end{equation}

Finally, as a cleanup step, as C1(t1)(1k)(t1t1)C_1(t_1') \geq (1-k)(t_1'-t_1), we can write (11) in terms of t1t_1.

Cn+1(t)C1(t1)+(1k)(tt1)nξfor t>tn+1\begin{equation} \begin{aligned} C_{n+1}(t) \geq C_1(t_1)+(1-k)(t-t_1)-n\xi \\ \quad \text{for } t > t_{n+1} \end{aligned} \end{equation}

You may want to take a bong rip at this point.

Now imagine this scenario, applied to our graph of processes. Recall assumptions (1) and (3), which say our graph has diameter dd, and every process sends a message to every other process every τ\tau units of time. If we imagine timestamps propagating through the graph as messages are sent, every d(τ+v)d(\tau + v) units of time, a timestamp from every process will have had the opportunity to propagate to every other process.

We can thus apply our earlier inequality for the minimum value of a clock in a series of processes, for a series of processes of length dd. For any two processes, when tt1+d(τ+v)t\geq t_1+d(\tau + v):

Ci(t)Cj(t1)+(1k)(tt1)dξfor tt1+d(τ+v)\begin{equation} \begin{aligned} C_i(t) \geq C_j(t_1)+(1-k)(t-t_1)-d\xi \\ \quad \text{for } t \geq t_1 + d(\tau + v) \end{aligned} \end{equation}

This sets a lower bound on the value of a clock, in terms of some known value of another clock and the amount of time that has passed. We are now half way to finding ϵ\epsilon. In the next section, we look for an upper bound.

The Fastest Clock

Say at time txt_x, CxC_x is the clock with the highest value. By PC1:

Ci(t)Cx(tx)+(1+k)(ttx)for ttx\begin{equation} \begin{aligned} C_i(t) \leq C_x(t_x) + (1+k)(t - t_x) \\ \quad \text{for } t \geq t_x \end{aligned} \end{equation}

However, this argument is subtly incomplete. When a process receives a message, it sets its clock's value to be Tm+μm\geq T_m + \mu_m (IRP2). Imagine PxP_x sends a message at txt_x, setting Tm=Cx(tx)T_m = C_x(t_x). Upon receiving the message at time tt', the receiver sets their clock to Tm+μmT_m + \mu_m. Is it possible for Tm+μm>Cx(tx)+(1+k)(ttx)T_m + \mu_m > C_x(t_x) + (1+k)(t'-t_x)? If so, (14) would be incorrect.

Let's examine the inequality.

Cx(tx)+(1+k)(ttx)<Tm+μmCx(tx)+(1+k)(ttx)<Cx(tx)+μm(1+k)(ttx)<μm\begin{equation} \begin{aligned} C_x(t_x) + (1+k)(t'-t_x) &< T_m + \mu_m \\ C_x(t_x) + (1+k)(t'-t_x) &< C_x(t_x) + \mu_m \\ (1+k)(t'-t_x) &< \mu_m \end{aligned} \end{equation}

The smallest value of kk is 00 which occurs if a clock is running at exactly the correct rate. We can plug this into the inequality, and say it is satisfied if, ttx<μmt'-t_x < \mu_m. This is false, as by definition μm\mu_m is less than the smallest possible transmission time for the message. Hence, (14) remains true, even when considering clocks settings themselves ahead upon receiving messages from the fastest clock.


Equation (13) holds for any i,ji, j, so setting j=xj=x, t1=txt_1=t_x, and combining (13) and (14).

Cx(tx)+(1k)(ttx)dξCi(t)Cx(tx)+(1+k)(ttx)for ttx+d(τ+v)\begin{equation} \begin{aligned} C_x(t_x) + (1-k)(t-t_x)-d\xi \leq \\ C_i(t) \leq C_x(t_x) + (1+k)(t - t_x) \\ \quad \text{for } t \geq t_x + d(\tau + v) \end{aligned} \end{equation}

This holds for all ii, and thus bounds the value of a clock. The farthest off two clocks can be, is the gap between the fastest and slowest possible clocks. That is, the difference between the upper and lower bound of (16).

Cx(tx)+(1+k)(ttx)[Cx(tx)+(1k)(ttx)dξ]=(1+k)(ttx)(1k)(ttx)+dξ=2k(ttx)+dξ\begin{equation} \begin{aligned} &C_x(t_x) + (1+k)(t - t_x) - [C_x(t_x) + (1-k)(t-t_x)-d\xi] \\ &= (1+k)(t-t_x) - (1-k)(t-t_x) + d\xi \\ &= 2k(t-t_x) + d\xi \end{aligned} \end{equation}

Thus, the maximum difference between clocks in our system is:

Ci(t)Cj(t)2k(ttx)+dξfor ttx+d(τ+v)\begin{equation} \begin{aligned} |C_i(t) - C_j(t)| \leq 2k(t-t_x) + d\xi \\ \quad \text{for } t \geq t_x + d(\tau + v) \end{aligned} \end{equation}

Notice how (18) applies to all tt, txt_x, where ttx+d(τ+v)t \geq t_x + d(\tau + v). This means, for any tt, we can find a txt_x, such that ttx=d(τ+v)t-t_x = d(\tau + v). This is the lowest value for ttxt-t_x, so we can replace the ttxt-t_x term in (18).

Ci(t)Cj(t)2k(ttx)+dξ2kd(τ+v)+dξ=d(2k(τ+v)+ξ)for tt0+d(τ+v)\begin{equation} \begin{aligned} |C_i(t) - C_j(t)| &\leq 2k(t-t_x) + d\xi \\ &\leq 2kd(\tau + v) + d\xi \\ &= d(2k(\tau + v) + \xi) \\ &\text{for } t \geq t_0 + d(\tau + v) \end{aligned} \end{equation}

This concludes our proof for a bound on ϵ\epsilon. Our algorithm takes d(τ+v)d(\tau + v) units of time to synchronize, and then the amount two clocks may be out of sync is bounded.

ϵd(2k(τ+v)+ξ)\begin{equation} \begin{aligned} \epsilon \leq d(2k(\tau + v) + \xi) \\ \end{aligned} \end{equation}


In equation (3), we found the maximum ϵ\epsilon that preserves causality. We can combine (3) and (20) investigate this further.

ϵ(1k)μd(2k(τ+v)+ξ)(1k)μ\begin{aligned} \epsilon &\leq (1-k)\mu \\ d(2k(\tau + v) + \xi) &\leq (1-k)\mu \\ \end{aligned}

All of these variables, except τ\tau describe physical properties of the system. τ\tau, the rate at which messages are sent, is a property of the software. So, we rearrange the equation in terms of τ\tau.

τ(1k)μ/dξ2kv\tau \leq \frac{(1-k)\mu/d - \xi}{2k} - v

Selecting a value for τ\tau which satisfied this inequality, would guarantee ordering events such that, if aa caused bb, aba \prec b.

Thanks to M for reviewing.