Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Fast Data Processing Pipeline for Predicting Flight Delays Using Apache APIs (Part 2)

DZone's Guide to

Fast Data Processing Pipeline for Predicting Flight Delays Using Apache APIs (Part 2)

Learn how to use a Spark ML model in a Spark Streaming app and integrate Spark Streaming with MapR Event Streams to consume and produce messages using the Kafka API.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

Check out Part 1 here!

According to Bob Renner, CEO of MapR Partner Liaison Technologies, in Forbes machine learning predictions for 2018, the possibility to blend machine learning with real-time transactional data flowing through a single platform is opening a world of new possibilities, such as enabling organizations to take advantage of opportunities as they arise. Leveraging these opportunities requires fast, scalable data processing pipelines that process, analyze, and store events as they arrive.

This is the second in a series of blogs that discuss the architecture of a data pipeline that combines streaming data with machine learning and fast storage. The first post discussed creating a machine learning model to predict flight delays. This second post will discuss using the saved model with streaming data to do real-time analysis of flight delays. The third post will discuss fast storage with MapR-DB.

Microservices, Data Pipelines, and ML Logistics

The microservice architectural style is an approach to developing an application as a suite of small, independently deployable services. A common architecture pattern combined with microservices is event-sourcing using an append-only publish-subscribe event stream such as MapR Event Streams (which provides a Kafka API).

Publish-Subscribe Event Streams With MapR-ES

A central principle of publish/subscribe systems is decoupled communications, wherein producers don't know who subscribes and consumers don't know who publishes. This system makes it easy to add new listeners or new publishers without disrupting existing processes. MapR-ES allows any number of information producers (potentially millions of them) to publish information to a specified topic. MapR-ES will reliably persist those messages and make them accessible to any number of subscribers (again, potentially millions). Topics are partitioned for throughput and scalability, producers are load-balanced, and consumers can be grouped to read in parallel. MapR-ES can scale to very high throughput levels, easily delivering millions of messages per second using very modest hardware.

You can think of a partition like a queue; new messages are appended to the end and messages are delivered in the order they are received.

Unlike a queue, messages are not deleted when read; they remain on the partition, available to other consumers. Messages, once published, are immutable and can be retained forever.

Not deleting messages when they are read allows for high performance at scale and also for processing of the same messages by different consumers for different purposes such as multiple views with polyglot persistence.

New subscribers of information can replay the data stream, specifying a starting point as far back as the data retention policy enables.

Data Pipelines

When you combine these messaging capabilities with the simple concept of microservices, you can greatly enhance the agility with which you build, deploy, and maintain complex data pipelines. Pipelines are constructed by simply chaining together multiple microservices, each of which listens for the arrival of some data, performs its designated task, and optionally publishes its own messages to a topic. Development teams can deploy new services or service upgrades more frequently and with less risk because the production version does not need to be taken offline. Both versions of the service simply run in parallel, consuming new data as it arrives and producing multiple versions of output. Both output streams can be monitored over time; the older version can be decommissioned when it ceases to be useful.

Machine Learning Logistics and Data Pipelines

Combining data pipelines with machine learning can handle the logistics of machine learning in a flexible way by:

  • Making input and output data available to independent consumers.
  • Managing and evaluating multiple models and easily deploying new models.

Architectures for these types of applications are discussed in more detail in the e-booksMachine Learning Logistics, Streaming Architecture, and Microservices and Containers.

Below is the data processing pipeline for this use case of predicting flight delays. This pipeline could be augmented to be part of the rendezvous architecture discussed in the O'Reilly Machine Learning Logistics ebook.

Spark Streaming Use Case Example Code

The following figure depicts the architecture for the part of the use case data pipeline discussed in this post:

  1. Flight data is published to a MapR Streams topic using the Kafka API.
  2. A Spark streaming application, subscribed to the first topic:
    1. Ingests a stream of flight data.
    2. Uses a deployed machine learning model to enrich the flight data with a delayed/not delayed prediction.
    3. Publishes the results in JSON format to another topic.
  3. A Spark streaming application subscribed to the second topic:
    1. Stores the input data and predictions in MapR-DB.

Example Use Case Data

You can read more about the data set in Part 1 of this series. The incoming and outgoing data is in JSON format, an example is shown below:

{"_id":"AA_2017-02-16_EWR_ORD_1124”,
"dofW”:4, "carrier":"AA", "origin":"EWR",
"dest":"ORD”, "crsdeptime":705,
"crsarrtime”:851, "crselapsedtime":166.0,"dist":719.0}

Spark Kafka Consumer Producer Code

Parsing the Dataset Records

We use a Scala case class and Structype to define the schema, corresponding to the input data.

Loading the Model

The Spark CrossValidatorModel class is used to load the saved model fitted on the historical flight data.

Spark Streaming Code

These are the basic steps for the Spark Streaming Consumer Producer code:

  1. Configure Kafka Consumer and Producer properties.
  2. Initialize a Spark StreamingContext object. Using this context, create a DStream that reads messages from a Topic.
  3. Apply transformations (which create new DStreams).
  4. Write messages from the transformed DStream to a Topic.
  5. Start receiving data and processing. Wait for the processing to be stopped.

We will go through each of these steps with the example application code.

1. Configure Kafka Consumer Producer Properties

The first step is to set the KafkaConsumer and KafkaProducer configuration properties, which will be used later to create a DStream for receiving/sending messages to topics. You need to set the following parameters:

  • Key and value deserializers for deserializing the message.
  • Auto offset reset to start reading from the earliest or latest message.
  • Bootstrap servers: This can be set to a dummy host:port since the broker address is not actually used by MapR Streams.

For more information on the configuration parameters, see the MapR Streams documentation.

2. Initialize a Spark StreamingContext Object

ConsumerStrategies.Subscribe, as shown below, is used to set the topics and Kafka configuration parameters. We use the KafkaUtils createDirectStream method with a StreamingContext, the consumer and location strategies, to create an input stream from a MapR Streams topic. This creates a DStream that represents the stream of incoming data where each message is a key-value pair. We use the DStream map transformation to create a DStream with the message values.

3. Apply Transformations (Which Create New DStreams)

We use the DStream foreachRDD method to apply processing to each RDD in this DStream. We read the RDD of JSON strings into a flight dataset. Then we display 20 rows with the dataset show method. We also create a temporary view of the dataset in order to execute SQL queries.

Here is example output from the df.show:

We transform the dataset with the model pipeline, which will transform the features according to the pipeline, estimate, and then return the predictions in a column of a new dataset. We also create a temporary view of the new dataset in order to execute SQL queries.

4. Write Messages From the Transformed DStream to a Topic

The dataset result of the query is converted to JSON RDD Strings, then the RDD sendToKafka method is used to send the JSON key-value messages to a topic (the key is null in this case).

Example message values (the output for temp.take(2) ) are shown below:

{"_id":"DL_2017-01-01_MIA_LGA_1489","dofW":7,"carrier":"DL","origin":"MIA","dest":"LGA","crsdephour":13,"crsdeptime":1315.0,"crsarrtime":1618.0,"crselapsedtime":183.0,"label":0.0,"pred_dtree":0.0}
{"_id":"DL_2017-01-01_LGA_MIA_1622","dofW":7,"carrier":"DL","origin":"LGA","dest":"MIA","crsdephour":8,"crsdeptime":800.0,"crsarrtime":1115.0,"crselapsedtime":195.0,"label":0.0,"pred_dtree":0.0}

5. Start Receiving Data and Processing It. Wait for the Processing to Be Stopped.

To start receiving data, we must explicitly call start() on the StreamingContext, then call awaitTermination to wait for the streaming computation to finish. We use ssc.remember to cache data for queries.

Streaming Data Exploration

Now, we can query the cached streaming data in the input temporary view flights, and the predictions temporary view flightsp. Below, we display a few rows from the Flights view:

Below, we display the count of predicted delayed/not delayed departures by Origin:

Below, we display the count of predicted delayed/not delayed departures by Destination:

Below, we display the count of predicted delayed/not delayed departures by Origin, Destination:

Summary

In this blog post, you learned how to use a Spark machine learning model in a Spark Streaming application and how to integrate Spark Streaming with MapR Event Streams to consume and produce messages using the Kafka API.

Code

Running the Code

All of the components of the use case architecture we just discussed can run on the same cluster with the MapR Converged Data Platform.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
ai ,data analytics ,predictive analytics ,tutorial ,apache spark

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}