Data processing in a failing worldPosted: March 5, 2011
With petabytes of data to process, we are limited to using clusters of shared-nothing parallel machines. No single machine has the memory or processing capacity to handle such amounts of data. So we divide and conquer: we divide the processing work and data across many machines. The more data we have, the more machines we throw in: we scale horizontally instead of scaling vertically (i.e. we add more commodity machines instead of using a more powerful machine with more memory, more CPU power, more disk, etc). Database systems have done this since 1990, when the first horizontally-scalable parallel database, Gamma, was created. Many commercial systems followed. Database systems however never scaled past a 100 machines. They didn’t need to …
… until 2004. In a paper by Dean and Ghemawat, Google discussed its vision of data processing: massive clusters with thousands of machines that process petabytes of data using a simple two-step programming paradigm, MapReduce. This simple idea became today’s major parallel database competitor. MapReduce’s simplicity, charm and Google-appeal are not the reasons for its rise to fame. Rather its the naive fault-tolerance model of databases.
Machines fail. Their hardware fail, their software fail and connections to them fail. The hardware of an average commodity machine fail once every two to five years. If you have a hundred well-maintained machines, one may fail once every month. With a thousand machines, one fails every hour. To see this, think about taking a sample every hour of all machines in the universe. What are the chances you find a dead (failed) machine in a sample of 100 machines. What are the chances you find a dead one in a sample of 1000 machines? The larger the sample, the larger your chances of finding a fault.
When a parallel database executes a query, all the machines in the cluster must process their data to completion. Even if one machine fails, the entire query is aborted and restarted from scratch on the remaining machines (so long as one of the remaining machines has a complete replica of the failed machine). A parallel database plays a game of Russian roulette whenever it executes a query. Every second, there is a probability of failure, a probability that the revolver fires a bullet. If unlucky, the query is killed and starts all over again with a gun to it. With a hundred machines this probability is small enough that we are willing to watch the database play this game. At 1000 machines we shouldn’t be.
In a way, Google created a problem — data processing on massive clusters — and had to come up with a solution: MapReduce. It took four years for the database community to realize that this solution is a threat to the parallel database industry and retaliate. In 2009, the community argued: “MapReduce needs thousands of machines because its fault-tolerance model leads to poor performance, this means more machines to get good performance, hence more faults, hence their fault-tolerance model. We (parallel databases) however have way better performance, we only need 100 machines to do the work of MapReduce’s 1000 machines.” The community never bluntly said this, but it is implied.
MapReduce unlike parallel databases, breaks down data processing into many small processing tasks over many small blocks of data. The blocks of data are replicated and distributed uniformly over the cluster: no machine holds an entire replica of all the blocks on another machine. When a machine fails, the remaining machines that have a replica of one or more of the failed machine’s data blocks take over. No complete query restart occurs, only the failed machine’s blocks are reprocessed and the work is distributed over many machines. Machines in MapReduce have a mutual aid agreement between them, similar to agreements that exist between fire stations or utility companies: when a power station fails, surrounding power stations take over. We rarely black-out and restart the entire grid. We, also, don’t overwhelm one backup station with all of the failed station’s load. Mutual aid against risk is a humanity-old concept. Islamic societies have based insurance and welfare on it for centuries. Prince Kropotkin considered mutual aid an important factor in human evolution (Mutual Aid: A Factor of Evolution, 1902). There is at least some sociological, evolutionary evidence for favoring MapReduce’s mutual-aid based fault-tolerance scheme.
Russian roulette or mutual aid? The answer hinges on (i) the probability of failure and (ii) the performance or the response-time of each technology for a given data processing task in the face of failures. What’s next in this series is a lot of probability theory. I will model failures in a cluster. I will compare the performance of MapReduce vs. databases when computing entire data aggregates and use expected job completion time (i.e. on average how long does it take a job to finish when failures slow it down) as a metric to judge the technologies. Which technology wins out? Until I answer that question, take the poll on what technology you think will eventually win out.