Streaming Data Analytics is an approach to Big Data analysis that shifts focus from systems of record (e.g., “what were last quarter's sales of a product X”) to real-time insight and action (e.g., “what an individual customer likely to buy, and what form of engagement will best influence their behavior?”).
The field is evolving rapidly, with the Apache Foundation, Google, Amazon AWS, and Microsoft Azure launching new projects and services for real time streaming data. In this article we’ll review the architecture of Streaming Analytics solutions, outline a framework for evaluation, and suggest resources for further research.
Streaming Data and Real Time Event Processing Architecture
Streaming Analytics systems are designed to process events and deliver corresponding actions within 50 to 100 milliseconds. Systems run in-memory, and avoid queries typical of Hadoop. Streaming Analytics is widespread in financial services for fraud management. Use is growing for personalized online offers, customer engagement, and automated network services or Internet of Things (IOT) devices.
Events can begin with a social media post, or a web browsing sessions, a wireless call or text, or a database update. Events are formatted by “event listeners,” and routed to an ingestion service such as Amazon’s AWS Kinesis, or to the event processing system. Real time events are combined and processed according to pre-defined scenarios.
Scenarios combine multiple events to trigger a corresponding action. Financial transactions attempted in distant locations triggers a fraud scenario, with notices going to the account owner and blocked transactions. A series of dropped calls can be handled according to customer segmentation and order history. Subscribers approaching contract renewal is another common use-case. In all cases, scenarios combine to form a strategy for real time customer engagement.
Streaming Analytics Applied
Streaming analytics adds value through customer engagement, improving revenue, renewals, and brand loyalty. With a branded application we can look forward to a travel experience that includes:
- Being informed in real time of flight schedule changes. On the day of the flight, traffic patterns are used to recommend departure times and routes to the airport.
- Parking options are presented with directions to the selected parking lot.
- Travelers who miss the flight are placed on an outbound call queue for rebooking.
- Maintenance is notified when check-in kiosks are not functioning.
- Travelers are updated when baggage is loaded. In the event that baggage misses a connection, the traveler is notified and instructions for deliver are solicited without requiring a visit to “customer service.”
- On-ground maintenance is expedited, based on in-flight maintenance alerts.
- On arrival travelers are provided connecting flight information, and directions to Uber pick up locations and other services.
Evaluating Streaming Analytics Solutions
Streaming Analytics systems are complex, and include a complex set of critical capabilities. The Gartner Group recognizes the following vendors as leaders in Streaming Analytics: Apache Foundation, EVAM, Microsoft, Oracle, SAP, SAS, Software AG, and Tibco Software (Source: Gartner Hype Cycle for Data Science, July 25, 2016).
In this short article we’ll summarize key capabilities which are all important, but in my experience the approach to Scenarios is critically important. We’ll focus on Scenario design and management accordingly.
Business considerations include the time and cost to implement a solution. Vendors should have proven deployments with customer references, and demonstrated ability to add-value to your business needs with domain expertise.
Architecture and performance includes support for public cloud or on-premise solutions, with plug and play support for open source projects, and a resilient clustered architecture. The system should be capable of scaling down to departmental use or scaling to support enterprise wide use.
Event Capture and data integration is achieved with a library of event listeners, and off the shelf integration with legacy systems, relational data stores, and support for data enrichment (customer profile data). Integration should be available for flume, log stash, Kafka, and Rabbit MQ.
Event Processing includes sub 100 millisecond response times, with a scalable in-memory distributed engine. Events should be supported with a persistent event queue, and support third party query and analytics systems, and dynamic configuration updates without system interruption. The system should support flexible time windows, and counts, sums, and averages, and support both asynchronous and synchronous event models. Throughput should be easily monitored with a system wide dashboard.
Actions should include a library including email, SMS, push notifications, calling a third party event engine, restful APIs, or web services, and should be easily extensible with a documented SDK.
Security and Audit is required fordeployments with authentication and logging of all use.
Logging and Monitoring should include the logging of all events, by event ID and timestamp. Scenarios and updates should also be logged by user.
Operations and Testing includechange and release management process, with changes in system configuration and scenarios added without system interruption. New scenarios can also be launched in a “test” mode, or tested with simulated events, where events are logged but without actions.
Analytics: the system should include persistent stores for analytics, including predefined and customizable views, and open to integration with third party analytics systems such as R, MOA, or H2O. The system should also be capable of generating real time alerts to the users. The platform should support extensibility with optional modules, such as Frequent Pattern Analysis, Enhanced Real-Time clustering, and other analytical methods.
Business or technical events: scenario design and management
The ability to implement and update scenarios quickly, and scale management to scores of scenarios is determined by how events are designed. Most systems build on technical events (ie., an update to a database). Scenarios are logical combinations of technical events, assembled as code by a programmer. Adding new scenarios or updating existing scenarios are determined by the speed and availability of a programmer who is intimately familiar with the system design. This is, unfortunately, the common denominator today. As we’ll discuss below, this approach is difficult to scale as scenarios grow in number.
Alternatively, events are qualified and exposed as a catalog of business events (ie., a “new customer” or “dropped call”). Business events are combined in scenarios by marketing professionals and non-programmers, using Visual designers or simple languages. Scenarios built on business events are easily reviewed by management for release management, and can be implemented in minutes, and existing scenarios quickly updated.
As the number of scenarios grows they tend to overlap, and a number of scenarios can be triggered in a short time frame. This “cascade” of events is common, and can inundate customers with a deluge of uncoordinated actions. To avoid this it’s important for systems to support scenario prioritization and constraints. Scenarios built on business events can be easily compared to detect overlap, and scenarios can be prioritized. Finally, systems should also support user-action constraints, where users are not subjected to more than X actions.
There are a lot of moving parts involved in a Streaming Analytic system, but these systems are among the most practical Big Data solutions. I’ve been part of implementing a pilot for a global wireless carrier involving millions of customers that was implemented in one month, and was hosted on a single AWS machine. Unlike most other Big Data strategies, a pilot of Streaming Analytics delivers immediate and easily recognized business value. The pilot system drove improved customer renewal rates, and the system has been rapidly expanded to a global footprint.
The Gartner Group recognizes the following vendors as leaders in Streaming Analytics: Apache Foundation, EVAM, Microsoft, Oracle, SAP, SAS, Software AG, and Tibco Software (Source: Gartner Hype Cycle for Data Science, July 25, 2016). Many of these vendors offer good resources. I recently co-authored a comprehensive Guide to Evaluating Streaming Analytics solutions, which is available for download at EVAM, who co-sponsored the Guide. Download a comprehensive guide to Evaluating Streaming Analytics here: http://www.evam.com/evaluation-guide-to-streaming-analytics.