Reliability, availability and scale – an interlude

An interlude

My last post on distributed systems was dense with concepts. Before continuing with much more discussion, let’s take a quick detour and define several frequently used, but often confused, terms in distributed computing.

The term scalability is often conflated with other related, important concepts. See for example an article by 37Signals “Don’t scale: 99.999% uptime is for Wal-Mart” — in the article, the notions of scalability and an availability SLA (which are typically stated as percentages) are used as if they were interchangeable.

However, as we’ll see in this post, meeting one or more of these related non-functional (i.e., ones which often come after the core functionality has been implemented) requirements does not imply meeting the others.

The non-functional requirements (or “ilities”) will be separated into three “buckets”: reliability, availability and scalability. It’s very difficult to agree on what these terms mean, but based on systems engineering practice, here’s the way that I approach it.

Reliability

In the previous post, the term “reliability” was used informally and the term “fault tolerance” was used more formally, e.g., in discussion of fault tolerance properties of algorithms. Rigorously speaking, fault tolerance is only a part of the reliability story: in a fault tolerant multi-component system, it is sufficient that failure of one component doesn’t cause failure of other components. A system that continues to function in a degraded state is fault tolerant, but unless the full functionality of the previous state can be restored, it’s not fully reliable. In other words, a reliable system requires fault tolerance, but a fault tolerant system may not require reliability.

Recovery

“Recovery” refers to restoring full functionality (defined to be the previous state in this context) when a failure occurs. Recovery is not often an explicitly stated goal, and is sometimes not included in formal definitions of reliability. However, recovery is an important consideration in the the discipline of deploying and maintaining production systems. Certain design choices (e.g., not maintaining a transaction log) can hurt a system’s recovery profile despite helping scalability and improving the availability of a system.

“MTTR” stands for Mean Time To Recovery: the average time from when a failure is encountered to when the previous state is restored, i.e., a system’s recovery time.

Availability

In Tannenbaum, Steen Distributed Systems: Principles and Paradigms availability is defined as

[The] property that a system is ready to be used immediately. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time.

Here we see two definitions — first sentence defines availability at a specific point in time, while rest of the paragraph gives a way to characterize the overall availability of a system. Enterprise vendors frequently talk about high availability of their solutions, however, this could mean different things.

For example, a system that goes down for a minute in the case of failure and then recovers can still be marketed as “highly available”: this could be honest marketing if the system is designed such that the failures are rare, i.e., the MTBF is particularly high in relation to MTTR.

Recently the trend has become to build systems that either maintain availability in the face of failure or recover it quickly, rather then systems with especially high MTBF. This systems engineering view is well summarized by John Allspaw in “MTTR is more important than MTBF (for most types of F)”.

For the purpose of this blog, a “median” definition will be used: a system is highly available if, in the case of failure, it can still respond within a reasonable (acceptable to the end-user) timeout.

Scalability

Scalability is a property of systems that are able to handle an increase in requests without performance degradation, e.g., in terms of latency and/or throughput. In the context of a distributed system, scalability requires that requests are handled in parallel by multiple nodes.

Note that there are multiple ways to distribute load across nodes. With a stateless system (or a system whose state can fit within a single machine’s main memory), a simple way to increase scalability would be to use a high degree of replication (replicating the full instance of the service, allowing it to take both reads and writes) and round-robin requests between multiple machines. In a system where state does not fit in a single machine’s main memory, scalability generally requires partitioning the data, i.e., a shared-nothing architecture.

Soft state

In addition to stateful and stateless services, there are services that maintain soft state. “Soft state” is loosely defined as state that has several properties including relaxed consistency semantics , and is not critical to the core of the service (although soft state may often be required for optional functionality) (Chiappa, “Soft and Hard State”). In this case, there are several options of where the soft state could be stored: in memory of local machines (which frequently implies using sticky sessions) or in a separate system, e.g., in a distributed cache. The former may imply certain scalability and availability characteristics, e.g., possibility of hot spots in the load balancer and need for sessions to be restarted when service nodes fail; in the later case, the availability and scalability properties of the separate stateful system carry over to the service itself.

Elasticity

Elasticity is a concern closely related to scalability: the ability to add or remove resources (in our case, nodes) to change a system’s capacity without downtime. A scalable system may not always be elastic, e.g., if adding a node requires taking the system down, manually moving data around, reconfiguring the system, and then starting the system up again. In other words, a scalable system without elasticity would be taking a hits to its availability when nodes need to be added or removed.

Case study: a shared nothing database

Now that we’ve looked at these concepts in abstract, let’s use an example: a shared nothing database. Shared nothing architecture means the nodes in the system don’t share memory or disk: data resides independently on the nodes which communicate over a network (Stonebraker, The Case for Shared Nothing). The space of all possible primary keys is partitioned (a frequently used synonym for partitioning, especially when done at the application level is “sharding”) by using either hashing or range based partitioning, such that one or more partitions could be assigned to a primary physical location.

Since data is spread across several nodes, assuming a uniform key and request distribution, the system scales linearly to multiple nodes. It could be also made elastic by using consistent hashing and/or virtual partitions. For availability and reliability, different types of replication can be used, placing the data at multiple physical locations.

In case of independent failures, partitioning also provides fault isolation: provided the system knows how to serve results from a partial dataset, only the partitions held by the failed nodes are affected.

What’s next?

We’re now left with an important series of questions, related to maintenance or recovery of availability (including maintaining latency) for the affected partitions in case of various failure and high-load scenarios.

Various approaches and the systems that take them will be discussed in the next post: “Alternatives to total transactional replication”. As this detour ends and the journey continues, pay attention to how the various theoretical approaches and real-world systems work in situations such as:

Providing availability under failure. This shouldn’t be seen as simple either/or trade-off, but rather on a sliding scale, ranging from responses to simpler (non-correlated, of individual nodes) to more complex (correlated failures, potentially of majority of nodes, split-brain scenarios) failures
Adding a new node to either expand capacity (elasticity) or take place of a node that failed (recovery), or recovering a node from a temporary failure
Handling high write throughput and contention

The next post will also look at impact (or, at times, non-impact) of scalability, atomicity and reliability (non-functional requirements) upon functional requirements such as support for ordered operations and atomicity.

Contributions

Thanks to Ted Nyman (@tnm), Jeff Hodges (@jmhodges), Justin Sheehy (@justinsheehy), Daniel Weinreb, Peter Alvaro, Dave Fayram (@KirinDave), Anil Madhavapeddy, Neil Conway and C. Scott Andreas (@cscotta) for proof-reading and editing this post.