Improve Your Data Solution Efficiency With Stream Processing
Today's data solutions—handling myriad data sources and massive data volume—are expensive. Stream processing reduces costs and brings real-time scalability.
Join the DZone community and get the full member experience.
Join For FreeToday's data solutions can quickly become expensive. NetApp reports that organizations see 30% data growth every 12 months and risk seeing their solutions fail if this growth isn’t managed. Gartner echoes the concerns, stating that data solutions costs tend to be an afterthought, not addressed until they’re already a problem.
If you're a developer that builds solutions for a data-driven organization, then you know that resource costs for data solutions can balloon quickly.
So what can you do? In this article, we’ll look at:
- Why data solutions are expensive
- Techniques that developers can employ to improve the efficiency of their data solutions while also reducing costs
- How stream processing can play a major role in this improvement
Through it all, we’ll work toward crafting an effective plan that leads to a more efficient data solution.
Why Are Data Solutions Expensive?
First, let’s look at the factors that contribute to the increasing costs of data solutions. You probably already face many, if not all, of these situations:
- Increasing volume of data: Data is being generated at an accelerated rate, and the cost of storing and managing that data has also increased.
- A myriad of data sources and their associated complexity: Disparate data sources make data integration a herculean task that leads to increased data quality checks and management costs.
- Frequently updated technology: Inevitable technological advancements lead to frequent upgrades in hardware and software, resulting in higher costs.
- Data security and privacy concerns: Not only do we have to worry about how to store and manage data, but also how to protect it from bad actors. Data privacy is critical, demanding hefty investments in encryption, security, and maintenance.
- Data consolidation: Many organizations hoard a tremendous amount of data without first thinking through what data is actually needed for long-term storage and analysis. This leads to unnecessary storage costs, incompatible systems, and inefficient and unscalable solutions.
- Lack of skilled data professionals: According to Bloomberg, the big data market will grow to $229.4 billion by 2025. With that growth comes the need for specialized—and expensive—skills, such as Apache Kafka, Apache Flink, Hadoop, Docker, containerization, DevOps, and others.
How Do You Reduce Your Data Solution Costs?
What can you do to help your organization keep these costs down? There are many strategies, including:
- Data compression and archiving (reduce data size and archive unused data)
- Data partitioning (make it easier and faster to process data)
- Data caching (keep frequently used data in high-speed storage)
- On-demand autoscaling
- Data governance measures (ensure accurate, complete, and well-managed data)
- Efficient data movement (improve routing from cloud and on-prem in order to speed insights)
- Cloud-agnostic deployment (reduce vendor lock-in, optimize costs)
- Using a multi-cloud approach
Let’s add one more to this list, the solution that we’ll look at in more detail in this article: stream processing.
What Is Stream Processing?
Stream processing, as the name suggests, is a data management strategy that involves ingesting (and acting on) continuous data streams—such as the user’s clickstream journey, sensor data, or sentiment from a social media feed—in real-time.
By using stream processing, developers can build data solutions with improved scalability and real-time processing for increased data quality. Stream processing solves many of the issues encountered with other data solutions, including rising costs, by giving you:
- Reduced storage costs: You handle the data in real-time.
- Real-time scalability: Stream processing systems can handle large-scale data using a distributed computing architecture.
- Preventing the downstream flow of bad data: Stream processing enables real-time data validation, transformation, and filtering at the source.
- Faster decision-making to identify high-impact opportunities: Real-time analysis gives you real-time analytics—and real-time insights.
- Team efficiency through automation: Automated workflows in stream processing platforms reduce complexity by abstracting the low-level details of data processing. This enables your data team to focus on data analytics, visualization, and reporting.
Manage Stream Processing Effectively
Stream processing is powerful and opens many use cases, but managing the implementation is no easy task! Let’s look at some challenges associated with implementing stream processing from scratch.
- Storage and computing resources: Anomalies in streaming data can be difficult and time-consuming to detect and address without specialized tools.
- Continuous delivery: Deploying updates to streaming applications can be challenging without disrupting the production version of the application.
- Database integration: Integrating data from different databases can be complex and time-consuming for developers. Handling schema management and scalability frequently becomes cumbersome when you're integrating databases. Migrating data from one database to another is costly and time-consuming, requiring additional infrastructure.
- Developer productivity: Developing streaming applications requires you to invest a significant amount of time in managing the infrastructure and configuration. Such efforts increase with the complex data processing logic, taking time away from actual application development.
Stream processing when built from scratch
Developers can circumvent all of these issues by using a stream processing framework, such as Turbine. A stream processing framework helps you to:
- Connect to an upstream resource to stream data in real-time
- Receive the data
- Process the data
- Send the data on to a destination.
Here's an example write-up on what this looks like at a high level. Turbine supports data applications written in JavaScript, Go, Ruby, and Python. Let's look at an example of how a Go developer might interact with Turbine. The following code comes from this example application:
func (a App) Run(v turbine.Turbine) error {
source, err := v.Resources("demopg")
if err != nil {
return err
}
// a collection of records, which can't be inspected directly
records, err := source.Records("user_activity", nil)
if err != nil {
return err
}
// second return is dead-letter queue
result := v.Process(records, Anonymize{})
dest, err := v.Resources("s3")
if err != nil {
return err
}
err = dest.Write(result, "data-app-archive")
if err != nil {
return err
}
return nil
}
func (f Anonymize) Process(records []turbine.Record) []turbine.Record {
for i, r := range records {
hashedEmail := consistentHash(r.Payload.Get("email").(string))
err := r.Payload.Set("email", hashedEmail)
if err != nil {
log.Println("error setting value: ", err)
break
}
records[i] = r
}
return records
}
func consistentHash(s string) string {
h := md5.Sum([]byte(s))
return hex.EncodeToString(h[:])
}
In the above code, we see the following steps in action:
- Create an upstream source (named
source
), from a PostgreSQL database (nameddemopg
). - Fetch records (from the
user_activity
table) from that upstream source. - Call
v.Process
, which performs the stream processing. This process iterates through the list of records and overwrites the email of each record with an encoded hash. - Create a downstream destination (named
dest
), using AWS S3. - Write the resulting stream-processed records to the destination.
As we can see, you only need a little bit of code for Turbine to process new records and stream them to the destination.
Using a framework such as Turbine for stream processing brings several benefits, including:
- It reduces the need for storage and computing resources by using stream processing instead of batch processing.
- It provides a powerful mechanism to identify anomalous behavior from large volumes of streaming data, assisting with a quick fix from highly specialized real-time anomaly detection tools.
- Because continuous delivery is a big challenge for streaming applications, Turbine streaming data apps offer Feature Branch Deploys, allowing you to deploy the branch whenever it is ready—without impacting the production version of the application.
- It integrates any source database to any destination database by leveraging Change Data Capture (CDC) that receives real-time streams and publishes them downstream. Data transformation, processing logic, and orchestration are handled by the platform through the Turbine app. As a developer, you no longer need to worry about schema management or scalability issues. Ultimately, you can focus your time and effort on your core business needs.
- Turbine's code-first integration is aimed at helping you to focus on building applications, leading to faster development while also reducing the infrastructure typically needed to support stream processing.
Conclusion
Data-driven organizations are betting high on data assets to fuel their transformation journeys. As a developer, your mandate is to deliver better—and more cost-efficient—applications that can handle fast-growing datasets. Stream processing helps you to achieve this, bringing improved data quality and faster decision-making all while reducing storage costs and infrastructure complexity. With the right plan and the right stream processing platform to handle the data efficiently, you'll be well on your way!
Published at DZone with permission of Alvin Lee. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments