The decay of clustersPosted: March 6, 2011
This post follows an earlier post motivating a statistical comparison of the performance of MapReduce and parallel databases in the face of failures.
I rarely paid attention in 10th grade Chem, but radioactivity was too cool to sleep through and so I still remember this: The life of a radioactive, Carbon-14, nucleus is unstable and unpredictable. Eventually, it disintegrates (into a more a stable nucleus), but always with a bang (it emits radiation). We can’t predict when an atom in a lump of Carbon-14 will decay but we can predict the collective decay rate of that lump and that in about 57 hundred years, half of the Carbon-14 atoms in the lump will disintegrate. Ten years later, I see the relevance of high-school Chem to cluster computing: A cluster of machines is not too different from a Carbon-14 lump. System admins can’t really predict which machine will fail in the next second. With experience, they can say how many machines might fail in a day; they can estimate the cluster’s decay rate.
At the heart of machine decay is an exponential random process that decides every second whether a machine dies or lives for another second, independently of other machines. This random process assumption might hold for a single machine where failures are caused by hardware faults (disk crashes, memory corruption, etc) or software bugs. There are, however, larger, correlated failures that affect more than one machine in a cluster: an intern trips over a wire and disconnects an entire rack of machines from the remaining cluster, a cooling fan breaks down and a whole warehouse section is switched off, its time to upgrade the software and several machines go down for scheduled maintenance. For now lets keep things simple and radioactive and ignore these correlated failures.
A machine i has a failure probability at every second†: λi. Sys-admins know that some brands are more likely to fail than others and so each machine has a different λi. Older models have, surprisingly, lower failure probabilities (survival of the fittest works here). Every two to five years, however, machines are upgraded to newer models. Out-of-the-box faulty machines die right away in the cluster pre-integration testing phase and so never make it to the cluster. So overall, λi falls in a tight range and we could assume all machines have identical per-second failure probabilities of λm. What is the probability distribution of time to failure? Here we use the exponential decay distributions used to model radioactive decay.
f(t) = λme-λmt (1)
This equation says that the probability of failure at this second is λm, and at exactly 10 seconds from now it is λme-λm*10. The further we peer into the future, failure becomes less likely than now but every second, we wipe the slate clean and start all over again. Exponential distributions have no memory. Every second is independent from the previous one: the probability of failure is constant.
The probability of a machine failing before t seconds from now:
F(t) = P(T ≤ t) = 0∫t λme-λmx dx = 1 – e-λmt (2)
and the probability the machine stays alive for t seconds from now is:
R(t) = P(T > t) = 1 – F(t) = e-λmt (3)
From these equations, we could derive properties like the expected time to failure of each machine, τm:
τm = E[Tf] = 0∫∞ λme-λmx * x dx = 1/λm. (4)
Generally, we estimate τm by measuring the average interval between failures or the mean time between failure (MTBF). Disks have a MTBF of 60 years. There is a direct relation between τm and half-life. In a disk population of 1000 disks, after 41 years half the disks will be non-functional. Disk failure is one of the many reasons that lead to machine failure. The beauty of exponential distributions, is that if different failure sources (disk, memory, software, etc) have an exponential time to failure distribution, then the minimum time to failure when combining all these sources is also an exponential distribution. We use this property to extend the failure model from machines to clusters.
What is the probability that no machine in a cluster of N machines fails in the next t seconds?
Rc(t ) = ∏N Pi(T > t) = ∏N e-λmt = e-∑Nλmt (5)
From (5), we see that the per second failure rate of the cluster, λc is given by: λc = ∑Nλm = Nλm . The expected time to failure in a cluster, τc, is then 1/λc. We could reverse engineer an exponential failure model for Google’s clusters from their 2005 MapReduce paper. They estimate the failures per query to be 1.2 failures. Each query runs for 632 seconds over 157 machines.
N = 157
τc = 632/1.2 ≈ 9 minutes
λc = 1/τc = 1.2/632 s-1
λm = λc/N ≈ 1.20 × 10-5 s-1
τm ≈ 22 hours
With Google’s clusters, a machine fails on average once a day! If there are 1000 machines in the cluster, N = 1000, τc is around 60 seconds, or a failure every minute. It is important to realize that most of these failures are transient in nature (e.g. the machine becomes unreachable because of an overload, a reboot or a loose connection) and usually the machine recovers in a matter of seconds to hours‡.
Another cluster property we could derive is the probability that v failures occur within t seconds. We use a discrete poisson distribution to count the number of failures:
P(v failures in t seconds) = (tλc)ve-tλc/v!
Coming up next, I put this simple cluster failure model to good use: I compare the expected query execution time of Parallel databases vs. MapReduce in the face of failures.
[†] You could use any time denomination here. Seconds are practical in that many systems detect failures within a second. The exponential probability distribution is continuous. It is more appropriate to have the probability of failure in an infinitesimal small amount of time.
[‡] This is where the radioactive substance analogy no longer helps us. We will assume that once a machine fails it is instantaneously replaced by a working one. This assumption doesn’t hold for non-transient failures in small clusters (100 machines or less), but somewhat holds in large clusters with an excess of backup machines.