History of Big Data: A Technical Comedy
History of Big Data: A Technical Comedy
In this post, we take a look at the history of data science and the coming of age of the big data field. Read on to learn more!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Act 1: Google Doesn’t Like Databases
All analytics were done with databases. All data was stored in databases. It was a well-known anti-pattern to store data worthy of a database in a flat file.
Google did not like databases. Maybe they tried it, and it did not work. Maybe it did work, but the company that can make it work asked for too much money. Maybe they knew it didn't work. At the end of the day, they used files. Being Google, they built a big (distributed) file system. Of course, they wrote a new shiny distributed file system.
Google wanted to query their big files. Many geniuses got together, and for once, came up with the most simple solution. Where the solution came from is not clear.
Maybe they took the map and reduce operations from functional programming, and threw away other operators. Then mashed these two together and created a new operator. Maybe they took MPI, threw away all 100+ dominant operators, and kept two. Maybe one of the geniuses dreamt it up. We do not know.
They created the most straightforward solution to a complicated problem. This is unheard of and not worthy of such geniuses. They did another thing no one has done.
They told everyone about it. While making a lot of money, Google told everyone about part of the secret sauce. Why? We do not know. Maybe they were thinking about increasing human knowledge. Maybe they wanted others to think in the way they did so they can easily hire others. Maybe, they wanted the world to know that they do serious stuff. Maybe they were so far ahead, and sure others couldn't catch up. Maybe they did know MapReduce would not work in the long run so they wanted to send their competition on a wild goose chase. In this world of bluff and double-bluff, who knows.
Act 2: New Dreams
However, all hell broke loose. It was like when Prometheus brought fire from heaven. Well, okay, we do not know how it felt then, but I am sure it was something like this. A few people got together and implemented their solution in open source. Yahoo, the competitor to Google, helped fund some of it - and thus Hadoop was born.
People did not have use cases like Google. They did not have data like Google. Most did not have enough data to even fill in a MySQL database. Yet everybody loved MapReduce. They dreamed about a lot of data and created Big Data.
They dreamed of how one can collect data about the world, make sense of it., and change the world. Then they counted words with it. Some dug in and found some data that was few gigabytes big but others couldn’t even do that. So they dreamed of when they have a lot of data. Others figured scientists were handling big data, as, for a long time, it was called scientific computing. Everyone marveled at what scientists were doing. Now it was much easier to get research grants, so scientists did not mind either. Now we really have big data (which we had all along).
Act 3: I Hate You... Sorry I Love You, SQL
Since Google had a beef with databases, someone figured the problem was SQL. They created a new kind of storage and called it NoSQL. Soon they figured out they needed a way to query their storage. Whatever they did, queries look like SQL. So they change the name to Not Only SQL.
Mike Stonebraker, ten years before receiving his Turing award, spoke out. He told in his humble and spear-like prose that “Guys all you do is counting and grouping. SQL can do all this and more. Just make SQL work with your glorified big files.” Of course, nobody listened.
Academics and investors went crazy and threw their brains and money into Hadoop.
Act 4: Continuing the Legacy
Soon came Spark, which beat Hadoop in performance by 10–20X. Bye, bye Hadoop. Wait wait, what happens to all the investor money to build Hadoop companies? They got together and integrated Spark. Now it is hard to tell where Hadoop ends and Sparks starts. They explained how both MapReduce technologies ( although MapReduce does less than 1/10 of Spark does) and Spark are the future. Everyone was happy.
Meanwhile, Google dropped Hadoop but did not bother to tell us. To be fair, they talked about all technologies they built instead but did not help put 2 and 2 together. It is not like many were paying attention.
Thanks to Spark, Machine Learning (ML) takes off big time. Soon data science is born. Years of old ML research comes back, new things were found, improvements made, new techniques discovered or rediscovered. Soon it turns out Spark does not work that well with deep learning. Google and others had to create new techniques. This did not matter. Most data is small. So we can do the data science with R and Python in a single machine and be mysterious about how we can run it at scale. GAFA (big four tech companies) kept running machine learning on large scale and told us about it. That is enough to keep the mystery and aura going. Also, GAFA hiring everyone who could do machine learning also helped.
Act 5: Show Me the Money
With all that said, the money was in the enterprise. They already had data warehouses and BI. Big data goes there and is welcomed into the fold. Well, I did not say replace. Sometime BI and data warehouses were just folded in and counted as analytics. Sometimes, upgrading the current product took you from old technology to new technology. Sometimes old technology is replaced.
Meanwhile, SQL and NoSQL databases were merging. NoSQL databases were supporting full SQL or coming close it. SQL databases were supporting NoSQL features. Someone should have listened to Mike, but he had grown tired of saying “I told you so.” He didn't say anything.
Act 6: Big Data Has it All
Now big data/analytics/AI has it all. Huge markets, use cases, customers, investors. All dreams have come to fruition.
But, all is not well. It is tough to find people who can build these systems. It is even harder to find architects who can think it through and make it usable. Almost no one thinks about usability yet, and we are just waking up to problems like data bias. However, some systems are up, somebody must be making use of it, and somebody is getting some benefits. Who knows?
Big companies that were reframed as big data companies are not growing. All that promised growth must have gone to blockchain. Open source companies are growing by 50%, but they are too small. At the current rate, they might catch up in about 10–20 years.
Act 7: Onward Ho
Everyone is busy. There is AI, and singularity is coming, and robots are coming. What will happen when they take over our jobs? Though right now, most are not usable, and it is much easier to use a UI to get the same thing done. However, this is details, who has time to read the details. Who cares whether big data works or not?
There might still be time to make it work. Maybe the hard work is done in already. Programmers have been trained. The new generation is being thought. More and more is asking of analytics. We can get the end-to-end stories right, get usability right, and get the tools right. It can work.
I hope this was useful. If you enjoyed this post you might also like my other posts Mastering the 4 Balancing Acts in Microservices Architecture and Can Middleware survive the Serverless enabled Cloud?
I write at https://medium.com/@srinathperera.
Published at DZone with permission of Srinath Perera , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.