Originally written by Scott Jarr
The last post defined what the Corporate Data Architecture of the future will look like and how “Fast” and “Big” will work together. This one will delve into the details of how to do Fast Data right.
Many solutions are popping onto the scene from some serious tech companies, a testament to the fact that a huge problem is looming. Unfortunately, these solutions miss a huge part of the value you can get from Fast Data. If you go down these paths, you will be re-writing your systems far sooner than you thought.
I am fully convinced that Fast Data is a new frontier. It is an inevitable step when we start to deeply integrate analytics into an organization’s data management architecture.
Here’s my rationale: Applications used to be written with an operational database component. App developers rarely worried about how analytics would be performed – that was someone else’s job. They wrote the operational application.
But data has become the new gold, and applications developers have realized applications now need to interact with fast streams of data and analytics to take advantage of the data available to them. This is where Fast Data originates and why I say it is inevitable. For a refresher on data growth trends, take a look at the EMC Digital Universe report, which includes IDC research and analysis; as well as Mary Meeker’s 2013 Internet Trends report.
So, if you are going to build one of these data-driven applications that runs on streams of data, what do you need? In working with people building these applications, it comes down to five general requirements to get it right. Sure, you can give on some, and people do. But let that decision be driven by the application’s needs, not by a limitation of the data management technology you choose.
The five requirements of Fast Data Applications are:
1. Ingest/interact with the data feed
Much of the interesting data coming into organizations today is coming fast, from more sources and at greater frequency. These data sources are often the core of any data pipeline being built. However, ingesting this data alone isn’t enough. Remember, there is an application facing the stream of data, and the ‘thing’ at the other end is usually looking for some form of interaction.
Example: VoltDB is powering a number of smart utility grid applications, including a planned rollout of 53 million meters in the UK (link to the UK grid win). When you have these numbers of meters outputting multiple sensor readings per second, you have a serious data ingestion challenge. Moreover, each reading needs to be looked at to determine the status of the sensor and whether interaction is required.
2. Make decisions on each event in the feed
Using other pieces of data to make decisions on how to respond enhances the interaction described above – it provides much-needed context to your decision. Some amount of stored data is required to make these decisions. If an event is taken only taken at its face value, you are missing the context in which that event occurred. The ability to make better decisions because of things you may know about the entire application is lost.
Example: Our utility sensor reading becomes much more informative and valuable when I can compare a reading from one meter to 10 others connected to the same transformer to determine there is a problem with that transformer, rather than the single meter located at a home.
Here’s another example that may strike closer to home. A woman is in the store shopping for bananas. If we present her with recommendations for what other shoppers purchased when they bought bananas, the recommendation would be timely, but not necessarily relevant; i.e., we don’t know if she’s buying bananas to make banana bread, or simply to serve with cereal. Thus if we provide her with recommendations based on aggregated purchase data, those recommendations will be relevant, but may not be personalized. Our recommendations need context to be relevant, they need to be timely to be useful, and they need to be personalized to the shopper’s needs. To accomplish all three – to do it without tradeoffs – we need to act on each event, with the benefit of context, e.g. stored data. The ability to interact with the ingest/data feed means we can know exactly what the customer wants, at the exact moment of his or her need.
3. Provide visibility into fast-moving data with real-time analytics
The best way to articulate what I mean by this is with a story. I remember being at the first-ever JasperWorld conference in 2011. I described to someone how you could use VoltDB to look at aggregates and dashboards of fast-moving data. He said something as simple as it was profound: “Of course, how else are you going to make any sense of data moving that fast?”
But the ability to make sense of fast-moving data extends beyond a human looking at a dashboard. One thing that makes Fast Data applications distinguishable from old-school OLTP is that real-time analytics are used in the decision-making process. By running these analytics within the Fast Data engine, operational decisions are informed by the analytics. The ability to take more than just the single event into context when making a decision makes that decision much more informed. In big data, as in life, context is everything.
Example: Keeping with our smart meter example, I am told that transformers show a particular trend prior to failure. And failure of that type of electrical componentry can be rather, um, spectacular. So, if at all possible we’d like to identify these impending failures prior to them actually happening. This is a classic example of a real-time analytic that is injected into a decision making process. IF a transformer’s 30 minutes of historical data indicate it is TRENDing like THIS, THEN shut it down and re-route power.
4. Seamlessly integrate Fast Data systems into systems designed to store Big Data
We have clearly established that we believe that one size does not fit all when it comes to database technology in the 21st century. So, while a fast operational database is the correct tool for the job of managing Fast Data, other tools are best optimized for storing and deep analytic processing of the Big Data (see my previous post for details). Moving data between these systems is an absolute requirement.
However, this is much more than just data movement. In addition to the pure movement of data, the integration between Big Data and Fast Data needs to allow for:
- Dealing with the impedance mismatch between the Big system’s import capabilities and the Fast Data arrival rate;
- Reliable transfer between systems, including persistence and buffering, and
- Pre-processing of data so when it hits the Data Lake it is ready to be used (aggregating, cleaning, enriching).
Example: Fast Data coming from smart meters across an entire country accumulates quickly. This historical data has obvious value in showing seasonal trends, year-over-year grid efficiencies and the like. Moving this data to the Data Lake is critical. But, there are validations and security checks and data cleansing that can all be done prior to the data arriving in the Data Lake. The more this integration is baked into data management products, the less code the application architect needs to figure out (“How do I persist data if one system fails?” “Where can I overflow data if my Data Lake can’t keep up ingesting?” ….).
5. Ability to serve analytic results and knowledge from Big Data systems quickly to users and applications, closing the data loop
The deep insightful analytics generated by your BI reports and analyzed by data scientists needs to be operationalized. This can be achieved in two ways:
- Make the BI reports consumable by more people/devices the analytics system can support, and
- Take the intelligence from the analytics and move it into the operational system.
Number one is easy to describe. Reporting systems (e.g., data warehouses and Hadoop) do a great job generating and calculating reports. They are not designed to serve those reports to thousands of concurrent users with millisecond latencies. To meet this need, many customers are moving the results of these analytics stores to an in-memory operational component that can serve these results at Fast Data’s frequency/speed. Frankly, I suspect we will see in-memory acceleration of these analytics stores for just such a purpose in the future.
The second item is far more powerful. The knowledge we gain from all the Big Data Processing we do should inform decisions. Moving that knowledge to the operational store allows these decisions, driven by deep analytical understanding, to be operationalized for every event entering the system.
Example: If our system is working as described up to this point, we are making operational decisions on smart meter and grid-based readings. We are using data from the current month to access trending of components, determine billing and provide grid management. We are exporting that data back to Big Data systems where scientists can explore seasonality trends, informed by data gathered about certain events.
Let’s say these exploratory analytics have discovered that, given current grid scale, if a heat wave of +10 degrees occurs during the late summer months, electricity will need to be diverted or augmented from other providers. This knowledge can now be used within our operational system so that if/when we get that +10 degree heat wave, the grid will dynamically adjust based on current data and informed by history. We have closed the loop on the data intelligence within the power grid.
Finally, I have seen these requirements in real deployments. No, not every customer is looking to solve all five at once. But through the course of almost every conversation I have, most points are included in the ultimate requirements document. It’s risky to gloss over these requirements; I warn people to not make a tactical decision on the Fast Data component because they think, “I only have to worry about ingesting right now”. This is a sure-fire path to refactoring the architecture, and far sooner than might otherwise be the case.
In the next post, I will address the idea of evaluating technology for the Fast Data challenge and take a specific look at why stream processing-type solutions will not solve the problem for 90% of Fast Data use cases.