Distributed systems are an increasingly important topic in Computer Science. The difficulty and immediate applicability of this topic is what makes distributed systems rewarding to study and build.
The goal of this post (and future posts on this topic) is to help the reader develop a basic toolkit they could use to reason about distributed systems. The toolkit should help the reader see the well known patterns in the specific problems they’re solving, to identify the cases where others have already solved the problems they’re facing and to understand the cases where solving hundred percent of the problem may not be worth the effort.
For the most part, a single machine is a Newtonian universe: that is, we have a single frame of reference. As a result, we can impose a total Happened-Before order on events i.e., we can always tell that one event happened before another event. Communication can happen over shared memory, access to which can be synchronized through locks and memory barriers1.
When we move to a client and server architecture, message passing architecture is required. In the case of a single server (with one or more clients), we can still maintain an illusion of a Newtonian universe: TCP (the transport layer used by popular application protocols) gives a guarantee that packets will be delivered to the server in the order sent by the client. As we’ll later see, this guarantee can be used as a powerful primitive upon which more complex guarantees can be buit.
However, there are reasons why we no longer want to run an application on a single server: in recent times it has become consensus that reliability, availability and scalability are best obtained using multiple machines. Mission critical applications must at least maintain reliability and availability; in the case of consumer (and even many enterprise) web applications, with success often come scalability challenges. Thus, it’s inevitable that we leave Newton’s universe and enter Einstein’s2.
1 This is not to belittle the fascinating challenges of building parallel shared memory systems: the topic is merely very well covered and outside of this post. I highly recommend The Art of Multiprocessor Programming (by Maurice Herlihy) and Java Concurrency In Practice (Goetz, Lea et al) to those interested in shared memory concurrency.
2 The comparison with theory of relativity is not original: Leslie Lamport and Pat Helland have used this comparison. Several concepts in distributed systems such as Vector Clocks and Lamport Timestamps are explicitly inspired by relativity.
Suppose we have a group of (physical or logical) nodes: perhaps replicas of a partition (aka a shard) of a shared nothing database, a group of workstations collaborating on a document or a set of servers running a stateful business application for one specific customer. Another group of nodes (which may or may not overlap with the first group of nodes) is sending messages to the first group. In the case of a collaborative editor, a sample message could be “insert this line into paragraph three of the document”. Naturally, we would like these messages delivered to all available machines in the first group.
Question is, how do we ensure, that after the messages are delivered to all machines, that the machines remain in the same state? In the case of our collaborative editor application, suppose Bob is watching Alice type over the shoulder and sees her type “The” and types “quick brown fox” after: we’d like all instances of the collaborative editor to say “The quick brown fox” and not “quick brown fox The”; nor do we want messages delivered multiple times e.g., not “The The quick brown fox” and especially not “The quick brown fox The”!
We’d like (or, in many cases, require) that if one of the servers goes down, the accumulated state is not lost (reliability). We’d also like to be able to view the state in the case of server failures (read availability) as well as continue sending messages (write availability). When a node fails, we’d also like to be able to add a new node to take its place (restoring its state from other replicas). Ideally, we’d like the later process to be as dynamic as possible.
All of this should have reasonable performance guarantees. In the case of the collaborative editor, we’d like characters to appear on the screen seemingly immediately after they are typed; in the case of the shared nothing database, we’d like to reason about performance not too differently from how we reason about single node database performance i.e., determined (in terms of both throughput and latency) primarily by the CPU, memory, disks and ethernet. In many cases we’d like our distributed systems to even perform better than analogous single node systems (by allowing operations to be spread across multiple nodes), especially under high load.
Problem is, however, that these goals are often contradictory.
An approach commonly used to implement this sort of behavior is state machine replication. This was first proposed by Leslie Lamport (also known as the author of LaTeX), in the paper Time, Clocks and the Ordering of Events in a Distributed System. The idea is that if we model each node in a distributed system as a state machine, and send the same input (messages) in the same order to each state machine, we will end up in the same final state.
This leads to our next question: how do we ensure that the same messages are sent to each machine, in the same order? This problem is known as atomic broadcast or more generally atomic multicast. We should take special care to distinguish this from the IP multicast protocol which makes no guarantees about order or reliability of messages: UDP, rather than TCP is layered on top of it.
A better way to view atomic multicast is a as a special case of the publish subscribe pattern (used by message queing systems such as ActiveMQ, RabbitMQ, Kafka and Virtual Synchrony based systems such as JGroups and Spread 3).
A generalization of this problem is the distributed transaction problem: how we do ensure that either all the nodes execute the exact same transaction (executing all operations in the same order), or none do?
Traditionally two phase commit (2PC) algorithm has been used for distributed transactions. The problem with two phase commit is that it isn’t fault tolerant: if the coordinator node fails, the process is blocked until the coordinator is repaired (Consensus on Transaction Commit)
Consensus algorithms solve the problem of how multiples nodes could arrive at a commonly accepted value in the process of failures. We can use consensus algorithm to build fault tolerant distributed commit protocols by (this is somewhat of an over-simplification) having nodes “decide” whether or not a transaction has been committed or aborted.
3 Virtual synchrony (making asynchronous systems appear as synchronous) is itself a research topic that is closesly related to and at times complemented by consensus work. Ken Birman’s group at Cornell has done a great deal of work on it. Unfortunately, it was difficult to work much of this fascinating research into a high level blog post.
Problem is that it’s impossible to construct a fault tolerant consensus algorithm that will terminate in a guaranteed time-bound in an asynchronous system lacking a common clock: this is known (after the Fisher, Lynch, Patterson) as the FLP impossibility result. Eric Brewer’s CAP theorem (a well covered topic) can be argued to be an elegant and intuitive re-statement of the FLP.
In practice, however, consensus algorithms can be constructed with reasonable liveness properties. It does, however, imply that consensus should be limited in its applications.
One thing to note is that consensus protocols can typically handle simple or clean failures (failures of minority of nodes), at the cost of greater latency: handling more complex (split brain scenarios) where a quorum can’t be reached is more difficult.
The Paxos Consensus and Commit protocols are well known and are seeing greater production use. A detailed discussions of these algorithms is outside the scope of this post, but it should be mentioned that practical Paxos implementations have somewhat modified the algorithms to allow for greater liveness and performance.
Google’s Chubby service is a practical example of a Paxos based system. Chubby provides a file system-like interface and is meant to be used for locks, leases and leader elections. One example of use of Chubby (that will be discussed in further detail in the next post) is assigning mastership of partitions in a distributed database to individual nodes.
Apache ZooKeeper is another practical example of a system built on a Paxos-like distributed commit protocol. In this case, the consensus problem is slightly modified: rather than assume a purely asynchronous network, the TCP ordering guarantee is taken advantage of. Like Chubby, ZooKeeper exposes a file-system like API and is frequently used for leader election, cluster membership services, service discovery and assigning ownership to partitions in shared nothing stateful distributed systems.
A question arises: why is transactional replication only used for applications such as cluster membership, leader elections and lock managers? Why aren’t these algorithms used for building distributed applications e.g., databases themselves? Wouldn’t we all like a fully transactional, fault tolerant, multi-master distributed database? Wouldn’t we like message queues that promise to deliver exactly the same messages, to exactly the same nodes, in exactly the same order, delivering each message exactly once at the exact same time?
The above mentioned FLP impossibility result provides one limitation of these systems: many practical systems require tight latency guarantees in even in the light of machine and network failures. The Dangers of Replication and a Solution also discusses scalability issues such as increases in network traffic, potential deadlocks in what the authors called “anywhere-anytime-anyway transactional replication”.
In the case of Chubby and ZooKeeper, this is less of an issue: in a well designed distributed system, cluster membership and partition ownership changes are less frequent than updates themselves (much lower throughput, less of a scalability challenge) and are less sensitive to latency. Finally, by limiting our interaction with consensus based systems, we are able to limit the impact of scenarios of where consensus can’t be reached due to machine, software or network failures.
The next post will look at common alternatives to total transactional replication as well as several (relatively recent) papers and systems that do apply some transactional replication techniques at scale.