The Twelve Days of NoSQL: Day Ten: Big Data in a Nutshell
On the tenth day of Christmas, my true love gave to me
Ten lords a-leaping.
The topic of Big Data is often brought up in NoSQL discussions so let’s give it a nod. In 1998, Sergey Brin and Larry Page invented the PageRank algorithm for ranking web pages (The Anatomy of a Large-Scale Hypertextual Web Search Engine by Brin and Page) and founded Google. The PageRank algorithm required very large matrix-vector multiplications (Mining of Massive Datasets Ch. 5 by Rajaraman and Ullman) so the MapReduce technique was invented to handle such large computations (MapReduce: Simplified Data Processing on Large Clusters). Smart people then realized that the MapReduce technique could be used for other classes of problems and an open-source project called Hadoop was created to popularize the MapReduce technique (The history of Hadoop: From 4 nodes to the future of data). Other smart people realized that MapReduce could handle the operations of relational algebra such as join, anti-join, semi-join, union, difference, and intersection (Mining of Massive Datasets Ch. 2 by Rajaraman and Ullman) and began looking at the possibility of processing large volumes of business data (a.k.a. “Big Data”) better and cheaper than mainstream database management systems. Initially programmers had to write Java code for the “mappers” and “reducers” used by MapReduce. However, smart people soon realized that SQL queries could be automatically translated into the necessary Java code and “SQL-on-Hadoop” was born. Big Data thus became about processing large volumes of business data with SQL but better and cheaper than mainstream database management systems. However, the smart people have now realized that MapReduce is not the best solution for low-latency queries (Facebook open sources its SQL-on-Hadoop engine, and the web rejoices). Big Data has finally become about processing large volumes of business data with SQL but better and cheaper than mainstream database management systems and with or without MapReduce.
That’s the fast-moving story of Big Data in a nutshell.