You've probably heard that MapReduce, the programming model for processing large data sets with a parallel and distributed algorithm on a cluster, the cornerstone of the Big Data eclosion, was invented by Google.
What if I tell you it really was invented by Julius Caesar? The famous Roman military strategist and politician also became known for to several quotes attributed to him such as "alea iacta est" or "vini, vidi, vinci". Another famous one is "divide et impera" (divide and conquer), a military/political strategy based on breaking the opponent into smaller isolated groups in order to weaken the enemy. Nevertheless, divide and conquer can be applied to any other field where a big problem has to be dealt with and it is easier to split it into multiple smaller ones. That's the case of the Big Data world as well.
So, how does it work? Or better said, how would Julius Caesar solve a Big Data problem using divide and conquer/MapReduce?
Let's imagine Caesar arriving triumphantly to Alexandria, after defeating the Egyptian army, and entering the Ancient Library. Caesar was also a great historian and writer, so he’s interested in knowing how many pages of the books in the library are written in Latin, to get an idea of how much is Latin culture rooted in Egypt.
Now imagine the face of Julius when he realizes there are thousands, even millions of copies in the library, and that a single man would need an entire lifetime to inspect all the books and compute the sum of latin pages within.
Quickly, applying his military knowledge, Caesar arranges a Centuria (a set of 80 soldiers), instructs each member to inspect a batch of books and to report to their Centurion (the official in charge of the Centuria) the number of pages they found in a book that is written in Latin.
As you can see, the original big problem has been converted into several smaller ones, not only because the soldiers receive a number of books sensitively smaller than the whole, but because the soldiers don't even have to perform all the required computation; they don’t have to sum their counts! This task has been commanded to the Centurion, who writes in stone the counts his subordinates tell him. Only when all the soldiers have finished their batches, the official performs the sum.
Technically speaking: what they've done? On one hand, the 80 soldiers have performed all of them the same Map operation, i.e. they have applied the same function on the original data, receiving in the input "books" and outputting "number of latin pages" (which can be 0 if the book is a not in latin). On the other hand, the Centurion has performed a Reduce operation, which consists in aggregating the partial results of the mappers into a unique one; in the case of the library, by doing the sum of the map results. Do consider: in this example this has been done by storing the counts in an "array" and adding everything at the end of the process, but this could perfectly be done by accumulating the counts in a variable at the same time the mappers provide their outputs.
Inspired by Julius Caesar? Want to put in practice the leasons this brilliant historical character gave us? You can learn more about MapReduce and how to create your own roman centuria by looking for Cosmos Big Data in the FI-WARE Catalogue!
Francisco Romero Bueno
Responsible for FI-WARE Big Data Enabler