High-Availability 101

What is Availability?

The availability is the ratio of the total time that the application is up and running. There is a counterpart concept, unavailability, to describe the opposite. Mathmatically, the availability is equal to (100% - unavailability). We often use availability more over unavailability.

Generally, when we say up and running, we want the application to be fulfil below requirements.

  • Every short-live request gets a response in a short time manner.
  • Every time-consuming task is guaranteed to complete at a given time period.

In other words, the application is capable of being used, even if it has some internal errors.

It's essential to differentiate available and functional. Being functional means the application should provide its functions as normal. For example, if the customer tries to login but receives an error after submitting username and password, the application is not functional, but it's still up and running.

Sometimes people mix using availability and reliability. Both the availability and reliability are in the form of a percentage number. The availability is about the uptime, while the responsibility is a distribution of random failures. If there is no downtime or no failures, then they're the same. Failures are strongly related to the downtime. If you accumulate the failures by time, then the sum of the time is the downtime. As the reliability increases, the downtime goes down and the availability goes up. However, the availability sometimes refer to more things than the reliability. A reliable software can still contribute downtime, such as human accidentally shutting down the application.

What Are High-Availability Applications?

Today, most server-side applications need to run in 24 hours and 7 days a week. Meanwhile, they need to handle massive numbers of concurrent visitors and huge amounts of traffic. The service providers have commitment to the customers that the service should be up and running at a high profile, which is a great challenge in engineering.

We specifically refer this kind of applications to high-availability applications, that is, applications that follows a set of engineering principles and practices that keeps the availability at a high level of operational performance for a given period of time.

There are some characters of the high-availability applications. They must be easy to start, stop, force-stop and monitored. These requirements make the applications fast to recover from potential failures. It also implies the high-availability applications must persist states to shared storages. When restarting the applications, they can recover from the last states before taken down. The persisted data should not be corrupted and is able to be reliably re-read as-is next time. We'll introduce more in the later discussions.

Why Does High-Availability Matter?

Being high-availability means the application can maximize its uptime for the users. There are many reasons why high-availability is crucial to the business owners.

The high-availability application is highly reliable. It exposes less failures to customers, and provides a better user experience. Customers are then happier. And Happy customers contribute more revenue to the business.

Being high-availability increases the confidence from customers that the services they bought are trustworthy. It adds your customers' loyalty.

Being high-availability reduces the money loss. Imagine customers are shopping online and the application is crashed just before the moment he/she pay the money. In the next few hours, he/she cannot access the website and buys things from other websites.

Being high-availability enhances the ability to auto-heal. You don't need to fully shutdown the application for maintenance. Instead, you can replace the malfunctioning components one by one while keep the application available.

However, things have the other side. improving the availability increases development and infrastructure cost.

You might want the application to provide 100% reliable services to customers, but I doubt if you will ever make it. Improving the availability from 90% to 99% is relatively easy, but improving it from 99% to 99.999% is so hard. It's still doable, however, the cost gets increases drastically.

Besides, can a user realy tell the difference between 99.9% and 99.999%? Don't forget that most people have aound 8 hours downtime on the bed every day (zzzZZZ).

For certain systems, a high bar like 99.999% or 99.9999% is needed, such as air traffic control, health monitoring, etc. But not every application requires that much high availability. It doesn't worthwhile to do so if the criteria is not strict. Before setting an availability number as a goal, think of your budget, your human resource, your system complexity, etc.

To sum up, we seek for the availability at a high and reasonable number.

How Do We Characterize The Availability?

You might be confused at out goal. What number is high enough but still reasonable? Before answering this question, let's get familiar with some basic concepts: SLI, SLO and SLA.

Service Level Indicator (SLI) is an availability measurement the service provider uses for the availability goal. Typical metrics for the SLIs include ping up/down, latency, throughput, error rates, etc. These metrics are often aggregated by time windows, either fixed windows or sliding windows. The aggregation algorithms could be average, median, percentile, etc.

Service Level Objective (SLO) is an availability goal that service provider wants to reach. It has a form like SLI <= Target, or Min <= SLI <= Max, etc.

Service Level Agreement (SLA) is a contract that the service provider promises to the customers on the service availability, performance, etc. The contract includes the consequences when violating the contract terms, possibly the financial compensation but not limited to that. In short, SLAs are a public-facing SLOs with explicit consequence.

Here is the relationships of the three S-concepts.

  • The service providers need to collect metrics based on SLI.
  • Then, they define the thresholds of measurements based on SLO.
  • At last, they monitor the thresholds so that it won't break SLA.

Practically, the SLI yields to the time series data points by the monitoring system. The SLO is the alerting rules (Most monitoring system ships with alerting functionality by default). And the SLA is the number of uptime that monitoring metrics is under the threshold.

Usually the SLO and the SLA are similar while the SLO is tighter than the SLA. The SLOs are generally used for internal purpose only, while the SLA is used externally. Making an SLA requires to think for your business and the legal impact. Picking an appropriate availability number will keep the business at a high standard yet incur less penalties for the SLA breach. If the service availability violates the SLO, site operators need to react quickly to restore the application. Otherwise, the service provider will violate the SLA and might need to refund to the customers.

The SLA, SLO, and SLI also implies the application will not be available all the time. In other words, the availability will not be 100%. Instead, people tends to guarantee that the system will be available greater than a certain number, for example, 99.5%.

To sum up, we observe the application by the SLIs; we monitor the application by the SLOs; and we promise the availability at a high profile to the customers by the SLAs.

Why Do Failures Reduce The Availability?

Before answering this question, let's get familiar with yet another three concepts.

  • Fault A fault is an event. It can be hardware malfunctioning, a software bug, or even unintended data inputs.
  • Error An error is a subset of fault within an application's runtime. Sometimes, it's called an exception. Errors can be caught and handled by lines of code, such as try-catch statement.
  • Failure A failure is an unexpected error that is not caught or handled. If the error reaches to the service boundary, it prevents the application responding normally and becomes a failure.

In short, a fault is first introduced. Then a trace of errors pop up within the application. If the errors pop up to the application API or UI, then it becomes a failure. If the failure makes the application down, then the availability is reduced.

Therefore, to avoid that, we must monitor failures. Below summarizes several kinds of commonly seen failures.

  • Hardware crash or power outage.
  • Networking problem.
  • Problematic config.
  • Software bug.
  • System Overloaded.
  • Human wrong operations.

How Does The Availability Calculated?

Uptime

Usually, we measure it by calculating the percentage of uptime (or downtime) at a given time period. Below is the formula.

Availability = Uptime / (Uptime + Downtime)

Mean Time Between Failures

Another way to measure it is by Mean Time Between Failures (MTBF). MTBF is the predicted elapsed time between failures It's calculated as the arithmetic mean (average) time between failures of a system. Below is the formula.

Availability = Mean Time Between Failures / (Mean Time Between Failures + Mean Time To Recover)

In general, it's pretty the same as the uptime. The minor difference is that MTBF is a statistical value. We have below formula that can convert up to MTBF.

MTBF = Sum(start of downtime - start of uptime) / Count(failures)

Nines

Availability is generally measured in number of 9s. For example, a service with 99.99% availability is described as four nines.

Reliability Theory

Reliability theory studies the probability of a system will function, which can be consisted of many components. Different structures of components influence reliability and availability of the system.

I will not mention the scary mathematical stuff here, leaving only the overall introduction. If you are interested in learning more, I recommend reading chapter 9 - Reliability Theory from the book "Introduction to Probability Models" written by Sheldon M.Ross.

An application that is comprised of several internal components will follows some simple rules to determine its availability. Different structures of these components follows different rules.

Series Structure

If the components are in sequence, then the overall availability becomes:

Availability (Total) = Availability (a) * Availability (b) * ... * Availability (z)

For example, if a web application runs on a single hardware; the overall availability becomes availability (web process) * availability (hardware).

Parallel Structure

If the components are in parallel, then the overall availability becomes:

Availability (Total) = 1 - (1-Availability(a))*(1-Availability(b))*...*(1-Availability(z))

For example, if a web application runs as two replicas on two hardware; then the overall availability becomes 1- (1-availability(instance 1)) * (1-availability(instance 2)).

K-Out-Of-N Structure

The series and parallel structure are both special cases of a K-Out-Of-N structure. A K-Out-Of-N structure is when a system is functional if and only if at least k of the n components are functioning.

For example, etcd is a distributed key-value store that uses Raft consensus algorithm. A minimum setup requires 3 nodes; it can tolerate at max 1 node failure. Therefore, it's a 1-out-of-3 structure. Similarly, a setup of 5 nodes can tolerate at max 2 nodes failure, so it's a 2-out-of-5 structure.

Compound Structure

A compound structure is when a system is mix using the above structures.

Why Does The Reliability Theory Matter?

When we design and build a high-availability application, the reliability theory provides us some useful information.

  • Parallel running components are robust.
  • A chain of components in sequence is fragile. If we are to deliver a high-availability application in series structure, then the availability for every component in the chain should be at a higher profile. For example, An application with 99% will require a frontend of 99.5% availability and a backend of 99.5% availability.
  • It's not crazy at all that we build high-availability application in some unreliable infrastructure. It's possible we build reliable software even if the foundation layer is unreliable. We can make the underlying layer high-available even if the vendor doesn't provide us high-availability devices or services (it never will.)

References

Availability Reliability, http://www.barringer1.com/ar.htm

Scalability, Availability & Stability Patterns, https://www.slideshare.net/jboner/scalability-availability-stability-patterns/

Post mortems, https://github.com/danluu/post-mortems

Reactive manifesto, https://www.reactivemanifesto.org/glossary#Failure

Resilient Software Design in a Nutshell Presentation, https://cdn.oreillystatic.com/en/assets/1/event/263/Resilient software design in a nutshell Presentation.pdf

Towards resilient software design, https://www.infoq.com/articles/towards-resilient-software-design/

The Secret to Building Highly Available Systems, http://www.wmrichards.com/ha.pdf

An introduction to High Availability Architecture, https://www.getfilecloud.com/blog/an-introduction-to-high-availability-architecture/

How to design a high-availability application, https://softwareengineering.stackexchange.com/questions/338680/how-to-design-a-high-availability-application

High Availability Concepts and Best Practices, https://docs.oracle.com/cd/A91202_01/901_doc/rac.901/a89867/pshavdtl.htm

What is High Availability?, https://www.digitalocean.com/community/tutorials/what-is-high-availability

Four golden rules of high availability, https://medium.com/@jaredwray/four-golden-rules-of-high-availability-6ca64de57b57

High-Availability Cluster, https://en.wikipedia.org/wiki/High-availability_cluster

High-Availability, https://en.wikipedia.org/wiki/High_availability

Availability, https://en.wikipedia.org/wiki/Availability

Mean time between failures, https://en.wikipedia.org/wiki/Mean_time_between_failures