Data Streaming Revolution
Data Streaming Revolution
Working with data streams ensures the timely and accurate analysis that enables enterprises to harness the value of the data they work so hard to collect.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
This article is featured in the DZone Guide to Big Data, Business Intelligence, and Analytics – 2015 Edition. Get your free copy for more insightful articles, industry statistics, and more.
The rise of Apache Spark and the general shift from batch to real-time has disrupted the traditional data stack over the last two years. But one of the last hurdles to getting actual value out of big data is on the analytics side, where the speed to querying and visualizing big data (and the effectiveness of those visualizations translated into actual business value) is still a relatively young conversation, despite the fact that 87% of enterprises believe Big Data analytics will redefine the competitive landscape of their industries within the next three years .
Most engineers who are using legacy business intelligence tools are finding them woefully unprepared to handle the performance load of big data, while others who may be writing their own analytics with D3.js or similar tools are wrestling with the new backend challenges of fusing real-time data with other datastores.
Let’s take a look at the megatrend towards streaming architectures, and how this is shaking up analytics requirements for developers.
Data Naturally Exists in Streams
All commerce, whether conducted online or in person, takes place as a stream of events and transactions. In the beginning, the stream was recorded in a book—an actual book that held inventories and sales, with each transaction penned in on its own line on the page. Over time, this practice evolved. Books yielded to computers and databases, but practical limitations still constrained data processing to local operations. Later on, data was packaged, written to disk, and shipped between locations for further processing and analysis. Grouping the data stream into batches made it easier to store and transport.
Technology marches on, and it has now evolved to the point that, in many cases, batching is no longer necessary. Systems are faster, networks are faster and more reliable, and programming languages and databases have evolved to accommodate a more distributed streaming architecture. For example: physical retail stores used to close for a day each quarter to conduct inventory. Then, they evolved to batch analysis of various locations on a weekly basis, and then a daily basis. Now, they keep a running inventory that is accurate through the most recent transaction. There are countless similar examples across every industry.
So Why Are Analytics and Visualizations Still in Batch Mode?
Traditional, batch-oriented data warehouses pull data from multiple sources at regular periods, bringing it to a central location and assembling it for analysis. This practice causes data management and security headaches that grow larger over time as the number of data sources and the size of each batch grows. It takes a lot of time to export batches from the data source and import them into the data warehouse. In very large organizations, for which time is of the essence, batching can cause conflicts with backup operations. And the process of batching, transporting, and analysis often takes so much time that it becomes impossible for a complex business to know what happened yesterday or even last week.
By contrast, with streaming-data analysis, organizations know they are working with the most recent, and timely, version of data because they stream the data on demand. By tapping into data sources only when they need the data, organizations eliminate the problems that storing and managing multiple versions of data present. Data governance and security are simplified; working with streaming data means not having to track and secure multiple batches.
We live in an on-demand world. It’s time to leave behind the model of the monolithic, complex, batch-oriented data warehouse and move toward a flexible architecture built for streaming-data analysis. Working with data streams ensures the timely and accurate analysis that enables enterprises to harness the value of the data they work so hard to collect, and tap into it to build competitive advantage.
Breaking Down the Barriers to Real-Time Data Analysis
Previously, building streaming-data analysis environments was complex and costly. It took months or even years to deploy. It required expensive, dedicated infrastructure; it suffered from a lack of interoperability; it required specialized developers and data architects; and it failed to adapt to rapid changes in the database world, such as the rise of unstructured data.
In the past few years we have witnessed a flurry of activity in the streaming-data analysis space, both in terms of the development of new software and in the evolution of hardware and networking technology. Always-on, low-latency, high-bandwidth networks are less expensive and more reliable than ever before. Inexpensive and fast memory and storage allow for more efficient data analysis.
In the past few years, we’ve witnessed the rise of many easy-to-use, inexpensive, and open-source streaming-data platform components. Apache Storm , a Hadoop-compatible add-on (developed by Twitter) for rapid data transformation, has been implemented by The Weather Channel, Spotify, WebMD, and Alibaba.com. Apache Spark , a fast and general engine for large-scale data processing, supports SQL, machine learning, and streaming-data analysis. Apache Kafka , an open-source message broker, is widely used for consumption of streaming data. And Amazon Kinesis , a fully managed, cloud-based service for real-time data processing over large, distributed data streams, can continuously capture large volumes of data from streaming sources.
Checklist for Developers Building Analytics and Visualizations on Top of Big Data
The past few years have been witness to explosive growth in the number of streaming-data sources and the volume of streaming data. It’s no longer enough to look to historical data for business insight. Organizations require timely analysis of streaming data from such sources as the Internet of Things (IoT), social media, location, market feeds, news feeds, weather feeds, website clickstream analysis, and live transactional data. Examples of streaming-data analytics include telecommunications companies optimizing mobile networks on the fly using network device log and subscriber location data, hospitals decreasing the risk of nosocomial (hospital-originated) infections by capturing and analyzing real-time data from monitors on newborn babies, and office equipment vendors alerting service technicians to respond to impending equipment failures.
As the charge to streaming analytics continues and the focus becomes the “time to analytics gap” (how long it takes from arrival of data to business value being realized), I see three primary ways that developers should rethink how they embed analytics into their applications:
- Simplicity of Use—Analytics are evolving beyond the data scientist workbench, and must be accessible to broad business users. With streaming data, the visual interface is critical to make the data more accessible to a non-developer audience. From allowing them to join different data sources, to interacting with that data at the speed of thought—any developer bringing analytics into a business application is being forced to deal with the “consumerization of IT” trend such that business users get the same convenience layers and intuitiveness that any mobile or web application affords.
- Speed / Performance First—With visualization come requirements to bring the query results to the users in near real-time. Business users won’t tolerate spinning pinwheels while the queries get resolved (as was the case with old approaches to running big data queries against JDBC connectors). Today we’re seeing analytics pushed into the stream (via Spark), which is emerging as the de facto approach for sub-second query response times across billions of rows of data, and not having to move data before it’s queried.
- Data Fusion—Embedded analytics capabilities must make multiple data sources appear as one. Businesses shouldn’t have to distinguish between “big data” versus other forms of data. There’s just data, period (including non-streaming and static data)—and it needs to be visualized, and available within business applications for sub-second, interactive consumption.
For more insights on workload and resource management, real-time reporting, and data analytics, get your free copy of the Guide to Big Data, Business Intelligence, and Analytics – 2015 Edition!
Opinions expressed by DZone contributors are their own.