I've spent much of my time recently in conferences, talking to customers and analysts, and realized they all were saying many of the same things about the challenges of productized, modern analytics solutions.
Digital businesses, mobility, and IoT depend on real-time actionable insights and machine learning. At the same time, traditional big data, data warehousing (DW), and business intelligence (BI) solutions have mostly worked on batch and interactive data queries. Streaming solutions and machine learning logic have been added on top of legacy architecture (and are not well-integrated), leading to complexity and sub-optimal performance.
The time is ripe for re-architecting analytics to maximize the value of machine learning and real-time streaming, drive actionable insights, and enable continuous operations.
Current Approaches Fail to Deliver Actionable Insights
Many data scientists and engineers start by collecting data from various sources, then query it to aggregate and denormalize data (group by, joins, etc.) and try to find common patterns or anomalies using various visualization tools. One you have denormalized and aggregate result set you can also use machine learning to identify patterns for you (assuming you selected the right features). This is why most big data solutions are designed for the data science exploration or reporting phase. They mostly expose SQL semantics (Hive, Spark, Impala, RedShift, and others) and their key focus is on maximizing SQL query performance and achieving interactive results by compressing data to immutable columns. Many queries still take minutes or hours to run.
The phenomenal growth in machine learning and deep learning, and the quickly growing demand for real-time actions rather than batch means the data collection and query part takes a smaller role in the overall solution. There needs to be more focus on continuous and event-driven processing; anytime an event arrives in the system, it needs to be evaluated against a fresh data model in real-time and drive an action or serve dashboards. In many production applications, data is constantly modified and searched, meaning immutable data warehousing and BI techniques (where data is updated once and queried multiple times) are not practical.
We need solutions with fast update, ingest, enrichment, and indexing support.
A common short-term solution involves creating complex big data “stacks” or “pipelines.” But these are slow and are nearly impossible to operate or govern. In many cases, projects are further complicated and slowed down due to different stacks for data research and production environments.
Figure 1: Common analytics architecture
Prevalent architecture also has data copied to a data lake periodically, using batch ETL tasks. With this process, data is compressed to immutable columnar formats such as Parquet to accelerate query performance. Events and logs are pushed through streaming storage and processed immediately with a copy stored in the data lake — they are loosely coupled with the data lake to avoid a speed mismatch (Lambda or Kappa architectures).
Data in the lake is usually unstructured, forcing long data “cleaning” and wrangling. This all leads to inaccurate results and consumes many computational resources every time data is processed — not to mention that without proper metadata, it is impossible to find relevant datasets and to secure the data.
While there’s a great deal of chatter about real-time streaming and machine learning, the dominant method for data exploration is querying data through batch or interactive queries and applying statistical modeling to the results. Stream processing is still mostly used to address faster data ingestion and transformation, or for simple tasks of evaluating events against an (old) data model to drive some immediate actions.
Machine learning usually is applied to old datasets extracted from the data warehouse or in data lakes; it is simply too difficult to learn from fresh data. AI decisions and prediction need to work at the speed of the event, limiting the decision and feature vectors to smaller datasets or short history that fit in-memory databases. This is part of the reason why many recommendation engines provide irrelevant results, mistargeted ads, or false detections of fraud.
What I noticed is that once data scientists finish the exploration and modeling phase, different teams come in to redesign for production, starting the project from scratch and using new dev tools and languages that address better error handling or higher performance.
With so many of us accustomed to running interactive queries on old data, the key challenge now is in initiating sub-second actions driven from fresh data models.
Designing Continuous Analytics From Scratch
With continuous analytics, actions and insights are delivered from fresh data in real-time for production use, as opposed to just work on interactive queries for data exploration and reports. This means eliminating complex and slow data wrangling and parsing as soon as possible so that data entering the system is machine-optimized, clean, and curated — meaning no more schema on read, dubious data, and data swamps.
Instead of periodic ETLs from external databases, this method streams data using continuous data capture (CDC) tools. That means no more time gaps and unknowns because data is synchronized between operational databases and the analytics system in real-time.
Figure 2: Continuous Analytics Architecture
When all of this happens, the focus can shift from traditional BI and DW queries to parallelism and continuous operations that address both real-time and interactive responses while streamlining operations.
With continuous analytics solution, tasks are broken into independent microservices or serverless functions which run concurrently, such as the following.
Enrichment and Denormalization
Data entering the system often requires additional context. The system may accept sensor information keyed by the sensor ID but also want manufacturer, model, or other related historical information. The same goes for correlating a user profile with their ID to serve custom content. Real-time analytics uses in-memory caching and fast random access to a real-time database to address real-time requirements. This is a huge improvement over current enrichment processes and SQL
JOIN queries, which consume much more time and resources.
Merge and Aggregate
Many analytics processes refer to a historical state, such as how many times a user performed a certain operation, the average temperature of a sensor, or the number of cars in a specific geo-location. With continuous analytics, some statistics can be aggregated and stored with the existing data, ready for immediate use or validation against predefined thresholds instead of running an expensive
GROUP BY database query.
The event data, along with the previous state, aggregate results, and enriched data together form a wide feature vector. This vector is input to an AI prediction logic using algorithms for regression, classification, etc. The prediction output results can drive an action or are updated in a real-time database which serves dashboards.
Real-Time Event Processing
Once detected, a threshold, anomaly, or an event can trigger microservices that address it by alerting a user, conducting repair operations, modifying a data model, or performing another action.
Analytics and Machine Learning
Multiple analytics and ML processes such as data exploration, model training, and reinforced learning can be fed with fresh data in the form of streams, images, or structured tables and generate new AI models frequently or continuously. Those new models are immediately shared with real-time inferencing services. Detected features can update/tag individual data records and images immediately. They can also trigger an event. Real-time analytics tasks consume enriched and aggregated results directly and don’t need to wait for query processing.
With continuous analytics, since data and its derivatives are always up-to-date, dashboards only need to read and present relevant datasets. API services map complex dashboard elements to several real-time data requests and serve them to clients through ready-to-use JSON responses.
When application microservices and analytics engines are stateless and containerized, they allow for greater elasticity, simpler deployment, and continuous development and operation. Tools such as Spark handle many of those tasks, but it is also possible to leverage emerging stream processing tools, new ML and AI libraries (such as TensorFlow), or implement event-driven tasks using “serverless” functions such as Amazon Lambda or a real-time open-source serverless platform.
A key attribute of the new architecture is that it uses a real-time unified data store to update, manipulate and query a variety of data objects simultaneously. Rather than working with multiple store classes and complicating application logic and operations, a unified store provides tiering across memory, flash and cloud storage, balancing cost vs performance requirements.
A preferred approach is to use structured/semi-structured records or streaming, where possible, and avoid unstructured data. This approach eliminates errors, provides better performance. and enables more granular security. Using transactional or atomic data update semantics allows for a consistent state across failures and software version upgrades.
Using modern orchestration platforms such as Mesos or Kubernetes, coupled with a CI/CD pipeline or an alternative public cloud offering, allows for continuous development and operations, drives agility and minimizes gaps between research, development, and production.
What’s needed are continuous applications which respond in real-time, using modern agile and cloud methodologies. Modern solutions are the ones that move away from the notion of long-running batch and interactive queries. They’re also restructuring the organizational separation between research, dev, and ops.