Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Do THIS Not THAT for Modern BI: Stream, Don't Batch

DZone's Guide to

Do THIS Not THAT for Modern BI: Stream, Don't Batch

Batches are so '90s. Streams are in now. But how can they applied to modern BI frameworks and solutions? Read on for a great overview.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

No data is born in a batch. It’s always created in a stream. A stream of transactions at a store, interactions with a website, or observations by an IoT sensor. However, in the BI world almost all data extraction, transport, enrichment, and loading has been done in batch. This batching thing started centuries ago when merchants ran out of pages in their ledger books. And before today’s Internet we used to actually mail disks and tapes to each other. Some of us remember racing floppy disks to Federal Express locations near airports in the evenings, because they had the latest cutoff times.

Today we have the technology and networks to stream data end-to-end, from its point of creation into the place we plan to store it forever, without ever batching it up. But most data these days still gets batched up anyway during ETL and other stages of the BI process, which can create the following problems — all of which are better in a stream-only system.

Consistency: Do you really know if each transaction is in one of the batches? What if some are missing, or in multiple batches? Has each batch been handled and processed in the same consistent manner? Having a single traceable stream means that everything is handled the same way and nothing is missing.

Security: Having batches means having copies of data. If there are multiple copies in various stages of transport or transformation, there is the potential for data to be intercepted, altered, replaced, lost, or otherwise mishandled. By having all the data in one stream, there is one thing to set up, understand, monitor, and secure.

Freshness: Batches imply latency. You can’t see and analyze the freshest data if you are waiting for it to be batched up, processed, and loaded. And it’s not just the data from the last hour or day that you may be missing, but also you’re missing any corrections and restatements of even older data. So you could be making decisions without the most recent data, and even based on inaccurate historical data. A streaming system for both the latest as well as any changed data means that all decisions are made on the best possible data available at that moment.

But it’s not just the data itself that has stream characteristics. Users also ask streams of questions of data. One analysis leads to the next question, which leads to the next. And across thousands of users within an organization that turns into a massive continuous stream of questions. One big challenge in the world of modern BI is to have a BI engine that can sit in the middle, and be the place where the stream of incoming data meets up with the stream of continuous questions. And to do that in a scalable manner, without relying on pre-aggregation, cubing, or other batch approaches. At Zoomdata, we spend a lot of time optimizing our engine for high levels of both data stream and question stream scale — without batching.

There is one other area that is happening in a continuous stream: the evolution of an organization’s collective understanding of its data, both of the structure and its contents. As users interact with data, the questions they ask and the answers they get create a stream of knowledge exhaust. People tend to drill from here to here. They tend to join A with B. Or people in department A or with job B tend to filter data this way or that way. By capturing this knowledge exhaust from the stream of analytics itself, enriching it with things that we know about the people asking questions, and then applying AI techniques, we can often predict what may be of interest next to a user, or what else they may want to look at or consider.

Taking the stream of an organization’s data, the stream of questions that users ask about the data, and the stream of things they do to the data, a modern BI engine can continuously learn and adapt to an organization, making it more efficient and competitive.

The rest of this blog series will build on this theme and dive into depth in some of the key areas.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data streaming ,data batching ,bi

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}