DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Microsoft Azure Event Hubs
  • Building Real-Time Applications to Process Wikimedia Streams Using Kafka and Hazelcast
  • Applying Kappa Architecture to Make Data Available Where It Matters
  • Designing High-Volume Systems Using Event-Driven Architectures

Trending

  • Power BI Embedded Analytics — Part 2: Power BI Embedded Overview
  • *You* Can Shape Trend Reports: Join DZone's Software Supply Chain Security Research
  • Scalable System Design: Core Concepts for Building Reliable Software
  • Scalable, Resilient Data Orchestration: The Power of Intelligent Systems
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Streaming Solution for Better Transparency

Streaming Solution for Better Transparency

For streaming analytics, there is a bounded timeline during which action must be taken to control process or asset parameters.

By 
Akshat Thakar user avatar
Akshat Thakar
·
Updated Jun. 22, 22 · Opinion
Likes (1)
Comment
Save
Tweet
Share
5.1K Views

Join the DZone community and get the full member experience.

Join For Free

What do you do when you have million-dollar equipment in your manufacturing pipeline giving you sleepless nights? To mitigate risk, you might create a digital counterpart of your physical asset, popularly known as the Digital twin, and leverage augmented intelligence derived from data streams. IoT makes the solution affordable and big data enables analytics at scale. For streaming analytics, there is a bounded timeline during which action needs to be taken to control process or asset parameters. Digital twin and stream analytics can help improve the availability of assets, improve quality in the manufacturing process and help in finding RCAs for failures.

For similar analytics use cases, I see Spark streaming best suited as part of the solution due to its open-source and easy-to-program APIs.

We will discuss the flawless design with respect to scalability, latency, and fault tolerance by leveraging the latest features of Spark, and Kafka.

Spark Framework

Spark currently offers two frameworks for spark stream processing -

  1. Spark streaming framework- This framework is based on Resilient Distributed Datasets [RDD], which process events in micro-batches.
Python
 
    kafkaDStream = KafkaUtils.createDirectStream(spark_streaming,[topic],{"metadata.broker.list": brokers})    

    def handler(message):
      records = message.collect()
      for record in records:
    		print(json.loads(record[1]))
    		payload = json.loads(record[1])
    		print(payload["temp_number"])
    kafkaDStream.foreachRDD(lambda rdd: handler(rdd))

2. Structural Streaming-based Spark framework - This framework is based on a data frame that is optimized for performance, and provides support for both micro-batches with latency(~100ms at best) and continuous stream processing with millisecond latency(~1ms). Continuous processing is a new, experimental streaming execution mode

Python
 
    df = (
      spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "region.azure.confluent.cloud:9092")
      .option("kafka.security.protocol", "SASL_SSL")
      .option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(CLUSTER_API_KEY, CLUSTER_API_SECRET))
      .option("kafka.ssl.endpoint.identification.algorithm", "https")
      .option("kafka.sasl.mechanism", "PLAIN")
      .option("subscribe", kafka_topic_name)
      .option("startingOffsets", "earliest")
      .option("failOnDataLoss", "false")
      .load()
      .withColumn('fixedValue', fn.expr("substring(value, 6, length(value)-5)"))
      .select('topic', 'partition', 'offset', 'timestamp', 'fixedValue')
    )
    display(df)

Kafka Streaming Platform

To provide data buffering layer between data producer IoT devices, and backend spark consumers, the streaming platform plays a vital role with message reliable processing guarantees, message replay, and retention capability. The streaming platform enables a query mechanism for Spark consumers to poll events at regular intervals. These events are then added to the unbounded input table in the Spark structural streaming framework.

Fault Tolerance

Before structural streaming, it was the developer's nightmare to avoid duplicate writes for incoming messages. Structural Streaming makes the development model easy as "exactly-once" is enabled by using transactional data sources and checkpointing sinks for micro-batch trigger intervals. 

Kafka Event streaming platform serves as the data source for Spark streaming with offset-based commits. This offset capability enables Spark Structural Streaming applications the ability to restart and replay messages from any point in time. Structured Streaming can ensure exactly-once message processing semantics under any failure using replayable sources and idempotent sinks like a key-value store.

The streaming query can be configured with a checkpoint location, and the query will save all the offset progress information and the running aggregates to the checkpoint location.

These checkpointing and write-ahead logs help recover the previous good state in case of failure for re-generating RDDs. 

How to Scale Stream Processing?

By default, the number of executors required on the Spark cluster will be equal to the number of partitions. Increasing Kafka partition to Spark executor partition ratio will improve Spark Structural Stream throughput for consumption of messages while adding to processing cost.

If the data in the Kafka partition is needed to be further split for Spark processing, the Spark partition limit can be increased by the “minPartitions” configuration. For 10 partitions of Kafka stream, 20 partitions can be specified as “minPartitions” in Spark job configuration.

Event Timestamp Processing

Spark Structural Streaming processes events based on incoming event timestamp, thus enabling handling of late-arriving data and watermarking threshold for such events. Former Spark streaming framework can process events only based on system or processing timestamp. Thus Structural Streaming makes possible aggregations on Windows over Event-Time, say, last 5 minutes. Tumbling window[window duration is the same as sliding duration] and overlapping window[window duration is greater than sliding duration] aggregation are possible using event time. This feature is useful in case there is network latency for incoming IoT data.

Hybrid joins between streaming data, and static dataset has simpler data frame-based API with structural streaming. Join API can be leveraged to enrich incoming streaming data with static master data to ensure completeness of report data or for validation.

Platform and Service Offerings

Below are service offerings from leading cloud-based platforms. Based on the use case and service suitability, a services can be selected for the streaming solution. 

Platform brings in advantage for ease in management, monitoring and version updates for Spark and Kafka based services. I see certain advantages of opting Databricks for spark engine and Confluent for Kafka-

  1. Databricks Spark engine is 50 times optimized as compared to Apache Spark and platform provides collaborative environment for Data Engineers, Data Scientists to work together
  2. For structural data sources like databases, Confluent offers CDC connectors which can be useful for real time streaming use cases

Platform offerings for Spark and Kafka-



Platform


Azure
AWS
Cloudera
    GCP
Services Spark  Azure Databricks, HDInsight, Synapse Spark Engine EMR, Glue, Databricks Data Engineering Dataproc, Databricks
Kafka Event Hub, HDInsight, Confluent MSK, Confluent CDP Streams Messaging Confluent

Spark Structural Streaming and Kafka streaming platform together can be leveraged to deliver a true event-based, fault-tolerant, and highly scalable stream processing framework for real-time Digital twin streaming analytics use cases.

Big data Stream processing Event kafka

Opinions expressed by DZone contributors are their own.

Related

  • Microsoft Azure Event Hubs
  • Building Real-Time Applications to Process Wikimedia Streams Using Kafka and Hazelcast
  • Applying Kappa Architecture to Make Data Available Where It Matters
  • Designing High-Volume Systems Using Event-Driven Architectures

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!