More data + Better models + More accurate metrics + Better approaches & architectures = Lots of room for improvement!!
It’s amazing to watch how quickly the data engineering / analytics/ reporting/ modeling/ visualization toolset is evolving in the BI ecosytem. There are clearly massive foundational shifts taking place around big data. I am not sure how large conventional Fortune 500 firms can innovate and keep up with what’s going on. I have run into CIOs who have not heard of Hadoop in some cases.
It’s also fascinating to see how data-driven “bleeding” edge firms like Netflix are pushing the envelope. Netflix is clearly reinventing Television. Binge-watching, cord-cutting are now part of our everyday lingo. What most people don’t realize is how data-driven Netflix is…. from “giving viewers what they want” to “leveraging data mining to boost subscriber base”. Viewing -> Improved Personalization -> Better Experience is the virtuous circle.
Here is a glimpse at how their BI landscape has evolved in the past five years. The figures are from a presentation by Blake Irvine, Manager Data Science and Engineering.
BI tools @ NetFlix pre-Hadoop
ETL -> Data Warehouse -> BI platform for reporting is currently the mainstream model in most firms. This is typically the state of the art in most large enterprises.
BI tools incorporating Hadoop
What is the catalyst for this shift? Complaints about slowness, dealing with data volumes/variety, and the dependence on IT for incorporating changes, has made it necessary for business users to adopt new analysis and reporting tools.
This architecture change also allows for multiple ways of interacting with data, including interactive structured query language, or SQL, real-time processing and online data processing, along with its traditional batch data processing. Hadoop gives the ability to keep all data together for shared use and analysis. Apache Hive and Presto are used for running ad-hoc, lightweight aggregation and interactive analytic queries against Hadoop data sources.
Just to give you a sense of what the bleeding edge firms are doing with Hadoop. At Hadoop Summit 2015, Twitter in their presentation claims to have 300+ petabytes in their Hadoop cluster. They also have multiple 1000+ machine Hadoop clusters. Hadoop clusters combine commodity servers with local storage and an open source software distribution to create a reliable distributed compute and storage platform for large data sets scalable up to petabytes, or PBs, with thousands of servers or nodes.
Post Hadoop – The Future
The future increasingly seems to have some variant of Spark complementing Hadoop. Applying machine learning to Big Data in Hadoop is something firms like NetFlix are moving to clearly. This means grabbing data, analyzing it, creating a model and using it for predictions.
However, the toolset by itself is not enough if you don’t have a culture of analytics. Blake Irvine said their cultural values are:
- High Performance
- Freedom & Responsibility
- Highly Aligned, Loosely Coupled
Makes sense. Being a data leader requires a mix of cultural bias, team structure, innovative tools and leadership to make it happen.
Additional Notes and References
- Netflix metrics… subscribers 53+ Million; 50+ countries; >2 bln hours watched each month
- Analyzing viewing data…. Who, What, When, Where, How Long…. Since 2008, Netflix streaming has expanded from thousands of customers watching occasionally, to millions of customers watching billions of hours every month. Each time a customer views, Netflix gathers events describing that view – both user-driven events like pause and resume, fast forward and rewind, and device-driven events like network throughput traces and video quality selections. To organize, understand, and create value out of these events, Netflix has built a data architecture to process all these events.
- Hadoop has crossed the chasm, to use Geoffrey Moore’s term, from early adopters to mainstream adopters. Every major corporation is building some variant of a Hadoop Managed Services (HMS) platform as a service.
- According to a study conducted by Allied Market research, the Hadoop market is expected to grow from $2B in 2013 to more than $50B by 2020 with a CAGR of 58.2 percent over that time.
- Spark was originally developed at UC Berkeley in 2009. Databricks is the firm commercializing Spark. Spark was initially designed for interactive queries and iterative algorithms, as these were two major use cases not well served by batch frameworks like MapReduce. Consequently Spark excels in scenarios that require fast performance, such as iterative processing, interactive querying, large-scale batch computations, streaming, and graph computations.
- Interesting figure from MapR that helped me understand the Apache Hadoop project chaos…