Parallel databases or MapReduce, your technology of choice for data processing on clusters will depend on performance. The technology that delivers a response to your information quests first wins. We could argue on the relative importance of performance vs. ease of use, but at the end of the day, faster is stronger. This is the zeitgeist of the 21st century. As Daft Punk sings it: “Work it harder, make it better; Do it faster, makes us stronger; More than ever, Hour after; Our work is never over.” So putting this debate to rest by quoting lyrics from a pop-album, lets settle which technology is faster. Read the rest of this entry »
Who knew that coming up with a statistical model that fits your problem requires a bit of daydreaming, coffee and people-watching. I got my inspiration for modeling MapReduce behavior in the face of failures from my grocer, Raj. I was trying to figure out, how much free time Raj gets in between customers and is that idle time enough to say read a few pages from a book. Read the rest of this entry »
This post follows an earlier post motivating a statistical comparison of the performance of MapReduce and parallel databases in the face of failures.
I rarely paid attention in 10th grade Chem, but radioactivity was too cool to sleep through and so I still remember this: The life of a radioactive, Carbon-14, nucleus is unstable and unpredictable. Eventually, it disintegrates (into a more a stable nucleus), but always with a bang (it emits radiation). We can’t predict when an atom in a lump of Carbon-14 will decay but we can predict the collective decay rate of that lump and that in about 57 hundred years, half of the Carbon-14 atoms in the lump will disintegrate. Ten years later, I see the relevance of high-school Chem to cluster computing: A cluster of machines is not too different from a Carbon-14 lump. System admins can’t really predict which machine will fail in the next second. With experience, they can say how many machines might fail in a day; they can estimate the cluster’s decay rate. Read the rest of this entry »
With petabytes of data to process, we are limited to using clusters of shared-nothing parallel machines. No single machine has the memory or processing capacity to handle such amounts of data. So we divide and conquer: we divide the processing work and data across many machines. The more data we have, the more machines we throw in: we scale horizontally instead of scaling vertically (i.e. we add more commodity machines instead of using a more powerful machine with more memory, more CPU power, more disk, etc). Database systems have done this since 1990, when the first horizontally-scalable parallel database, Gamma, was created. Many commercial systems followed. Database systems however never scaled past a 100 machines. They didn’t need to … Read the rest of this entry »