DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • ETL and How it Changed Over Time
  • How Trustworthy Is Big Data?
  • Enhancing Avro With Semantic Metadata Using Logical Types
  • A Deep Dive into Apache Doris Indexes

Trending

  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 2
  • Using Java Stream Gatherers To Improve Stateful Operations
  • A Guide to Auto-Tagging and Lineage Tracking With OpenMetadata
  • Agile’s Quarter-Century Crisis
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Improve Your Data Solution Efficiency With Stream Processing

Improve Your Data Solution Efficiency With Stream Processing

Today's data solutions—handling myriad data sources and massive data volume—are expensive. Stream processing reduces costs and brings real-time scalability.

By 
Alvin Lee user avatar
Alvin Lee
DZone Core CORE ·
May. 16, 23 · Analysis
Likes (1)
Comment
Save
Tweet
Share
2.2K Views

Join the DZone community and get the full member experience.

Join For Free

Today's data solutions can quickly become expensive. NetApp reports that organizations see 30% data growth every 12 months and risk seeing their solutions fail if this growth isn’t managed. Gartner echoes the concerns, stating that data solutions costs tend to be an afterthought, not addressed until they’re already a problem.

If you're a developer that builds solutions for a data-driven organization, then you know that resource costs for data solutions can balloon quickly. 

So what can you do? In this article, we’ll look at:

  • Why data solutions are expensive
  • Techniques that developers can employ to improve the efficiency of their data solutions while also reducing costs
  • How stream processing can play a major role in this improvement

Through it all, we’ll work toward crafting an effective plan that leads to a more efficient data solution.

Why Are Data Solutions Expensive?

First, let’s look at the factors that contribute to the increasing costs of data solutions. You probably already face many, if not all, of these situations:

  • Increasing volume of data: Data is being generated at an accelerated rate, and the cost of storing and managing that data has also increased.
  • A myriad of data sources and their associated complexity: Disparate data sources make data integration a herculean task that leads to increased data quality checks and management costs.
  • Frequently updated technology: Inevitable technological advancements lead to frequent upgrades in hardware and software, resulting in higher costs.
  • Data security and privacy concerns: Not only do we have to worry about how to store and manage data, but also how to protect it from bad actors. Data privacy is critical, demanding hefty investments in encryption, security, and maintenance. 
  • Data consolidation: Many organizations hoard a tremendous amount of data without first thinking through what data is actually needed for long-term storage and analysis. This leads to unnecessary storage costs, incompatible systems, and inefficient and unscalable solutions.
  • Lack of skilled data professionals: According to Bloomberg, the big data market will grow to $229.4 billion by 2025. With that growth comes the need for specialized—and expensive—skills, such as Apache Kafka, Apache Flink, Hadoop, Docker, containerization, DevOps, and others.

How Do You Reduce Your Data Solution Costs?

What can you do to help your organization keep these costs down? There are many strategies, including:

  • Data compression and archiving (reduce data size and archive unused data)
  • Data partitioning (make it easier and faster to process data)
  • Data caching (keep frequently used data in high-speed storage)
  • On-demand autoscaling
  • Data governance measures (ensure accurate, complete, and well-managed data)
  • Efficient data movement (improve routing from cloud and on-prem in order to speed insights)
  • Cloud-agnostic deployment (reduce vendor lock-in, optimize costs)
  • Using a multi-cloud approach

Let’s add one more to this list, the solution that we’ll look at in more detail in this article: stream processing. 

What Is Stream Processing?

Stream processing, as the name suggests, is a data management strategy that involves ingesting (and acting on) continuous data streams—such as the user’s clickstream journey, sensor data, or sentiment from a social media feed—in real-time. 

By using stream processing, developers can build data solutions with improved scalability and real-time processing for increased data quality.  Stream processing solves many of the issues encountered with other data solutions, including rising costs, by giving you:

  • Reduced storage costs: You handle the data in real-time.
  • Real-time scalability: Stream processing systems can handle large-scale data using a distributed computing architecture.
  • Preventing the downstream flow of bad data: Stream processing enables real-time data validation, transformation, and filtering at the source.
  • Faster decision-making to identify high-impact opportunities: Real-time analysis gives you real-time analytics—and real-time insights.
  • Team efficiency through automation: Automated workflows in stream processing platforms reduce complexity by abstracting the low-level details of data processing. This enables your data team to focus on data analytics, visualization, and reporting.

Manage Stream Processing Effectively

Stream processing is powerful and opens many use cases, but managing the implementation is no easy task! Let’s look at some challenges associated with implementing stream processing from scratch.

  • Storage and computing resources: Anomalies in streaming data can be difficult and time-consuming to detect and address without specialized tools.
  • Continuous delivery: Deploying updates to streaming applications can be challenging without disrupting the production version of the application.
  • Database integration: Integrating data from different databases can be complex and time-consuming for developers. Handling schema management and scalability frequently becomes cumbersome when you're integrating databases. Migrating data from one database to another is costly and time-consuming, requiring additional infrastructure.
  • Developer productivity: Developing streaming applications requires you to invest a significant amount of time in managing the infrastructure and configuration. Such efforts increase with the complex data processing logic, taking time away from actual application development.

Bare stream processing in all its glory

Stream processing when built from scratch

Developers can circumvent all of these issues by using a stream processing framework, such as Turbine. A stream processing framework helps you to:

  1. Connect to an upstream resource to stream data in real-time
  2. Receive the data
  3. Process the data
  4. Send the data on to a destination.

Here's an example write-up on what this looks like at a high level. Turbine supports data applications written in JavaScript, Go, Ruby, and Python. Let's look at an example of how a Go developer might interact with Turbine. The following code comes from this example application:

Go
 
func (a App) Run(v turbine.Turbine) error {
	source, err := v.Resources("demopg")
	if err != nil {
		return err
	}
	// a collection of records, which can't be inspected directly
	records, err := source.Records("user_activity", nil)
	if err != nil {
		return err
	}
	// second return is dead-letter queue
	result := v.Process(records, Anonymize{})

	dest, err := v.Resources("s3")
	if err != nil {
		return err
	}
	err = dest.Write(result, "data-app-archive")
	if err != nil {
		return err
	}

	return nil
}

func (f Anonymize) Process(records []turbine.Record) []turbine.Record {
	for i, r := range records {
		hashedEmail := consistentHash(r.Payload.Get("email").(string))
		err := r.Payload.Set("email", hashedEmail)
		if err != nil {
			log.Println("error setting value: ", err)
			break
		}
		records[i] = r
	}
	return records
}

func consistentHash(s string) string {
	h := md5.Sum([]byte(s))
	return hex.EncodeToString(h[:])
}


In the above code, we see the following steps in action:

  1. Create an upstream source (named source), from a PostgreSQL database (named demopg).
  2. Fetch records (from the user_activity table) from that upstream source.
  3. Call v.Process, which performs the stream processing. This process iterates through the list of records and overwrites the email of each record with an encoded hash.
  4. Create a downstream destination (named dest), using AWS S3.
  5. Write the resulting stream-processed records to the destination.

As we can see, you only need a little bit of code for Turbine to process new records and stream them to the destination.

Using a framework such as Turbine for stream processing brings several benefits, including:

  • It reduces the need for storage and computing resources by using stream processing instead of batch processing.
  • It provides a powerful mechanism to identify anomalous behavior from large volumes of streaming data, assisting with a quick fix from highly specialized real-time anomaly detection tools.
  • Because continuous delivery is a big challenge for streaming applications, Turbine streaming data apps offer Feature Branch Deploys, allowing you to deploy the branch whenever it is ready—without impacting the production version of the application.
  • It integrates any source database to any destination database by leveraging Change Data Capture (CDC) that receives real-time streams and publishes them downstream. Data transformation, processing logic, and orchestration are handled by the platform through the Turbine app. As a developer, you no longer need to worry about schema management or scalability issues. Ultimately, you can focus your time and effort on your core business needs. 
  • Turbine's code-first integration is aimed at helping you to focus on building applications, leading to faster development while also reducing the infrastructure typically needed to support stream processing.

Conclusion

Data-driven organizations are betting high on data assets to fuel their transformation journeys. As a developer, your mandate is to deliver better—and more cost-efficient—applications that can handle fast-growing datasets. Stream processing helps you to achieve this, bringing improved data quality and faster decision-making all while reducing storage costs and infrastructure complexity. With the right plan and the right stream processing platform to handle the data efficiently, you'll be well on your way!

Big data Database Stream processing Distributed Computing

Published at DZone with permission of Alvin Lee. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • ETL and How it Changed Over Time
  • How Trustworthy Is Big Data?
  • Enhancing Avro With Semantic Metadata Using Logical Types
  • A Deep Dive into Apache Doris Indexes

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!