Fast Cars, Big Data: How Streaming Data Can Help Formula 1
Join the DZone community and get the full member experience.
Join For FreeA Formula 1 race is a high-speed example of the Internet of Things, where gathering, analyzing, and acting on tremendous amounts of data in real time is essential for staying competitive. The sport’s use of such information is so sophisticated that some teams are exporting their expertise to other industries, even for use on oil rigs. Within the industry, automobile companies such as Daimler and Audi are leveraging deep learning on the MapR Converged Data Platform to successfully analyze the continuous data generated from car sensors.
This blog discusses a proof of concept demo, which was developed as part of a pre-sales project to demonstrate an architecture to capture and distribute lots of data, really fast, for a Formula 1 team.
What’s the Point of Data in Motor Sports?
Formula 1 cars are some of the most heavily instrumented objects in the world.
Read more about The Formula 1 Data Journey
Formula 1 data engineers analyze data from ~150 sensors per car, tracking vital stats such as tire pressure, fuel efficiency, wind force, GPS location, and brake temperature in real time. Each sensor communicates with the track, the crew in the pit, a broadcast crew on-site, and a second team of engineers back in the factory. They can transmit 2GB of data in one lap and 3TB in a full race.
Watch WSJ's video "The F1 Big Data Explainer"
Data engineers can make sense of the car’s speed, stability, aerodynamics, and tire degradation around a race track. The graph below shows an example of what is being analyzed by race engineers: rpm, speed, acceleration, gear, throttle, brakes, by lap.
More analysis is also completed at the team’s manufacturing base. Below is an example race strategy, analyzing whether it would be faster to do a pit stop tire change at lap 13 and lap 35 vs. lap 17 and lap 38.
The Challenge: How Do You Capture, Analyze, and Store This Amount of Data in Real Time at Scale?
The 2014 U.S. Grand Prix collected more than 243 terabytes of data in a race weekend, and now there is even more data. Formula 1 teams are looking for newer technologies with faster ways to move and analyze their data in the cockpits and at the factory.
Below is a proposed proof of concept architecture for a Formula 1 team:
MapR Edge in the cars provides an on-board storage system that buffers the most recent data and retries when transmission fails. MapR Edge addresses the need to capture, process, and provide backup for data transmission close to the source. A radio frequency link publishes the data from the edge to the referee “FIA” topic, and from there it is published to each team’s topic, where local analytics is done. The “team engineers” Stream replicates to the “trackside” and “factory” Streams, so that the data is pushed in real time for analytics. In a sport where seconds make a difference, it’s crucial that the trackside team can communicate quickly with the engineers at headquarters and team members around the world. MapR Streams replication provides an unusually powerful way to handle data across distant data centers at large scale and low latency.
The Demo Architecture
The demo architecture is considerably simplified because it needs to be able to run on a single node MapR sandbox. The demo does not use real cars; it uses the Open Racing Car Simulator (TORCS), a race car simulator often used for AI race games and as a research platform.
The demo architecture is shown below. Sensor data from the TORCS simulator is published to a MapR Streams topic using the Kafka API. Two consumers are subscribed to the topic. One Kafka API consumer, a web application, provides a real-time dashboard using websockets and HTML5. Another Kafka API consumer stores the data in MapR-DB JSON, where analytics with SQL are performed using Apache Drill. You can download the demo code here:https://github.com/mapr-demos/racing-time-series.
How Do You Capture This Amount of Data in Real Time at Scale?
Older messaging systems track message acknowledgements on a per-message, per-listener basis. A new approach is needed to handle the volume of data for the Internet of Things.
MapR Streams is a distributed messaging system that enables producers and consumers to exchange events in real time via the Apache Kafka 0.9 API. In MapR Streams, topics are logical collections of messages, which organize events into categories and decouple producers from consumers.
Topics are partitioned for throughput and scalability. Producers are load balanced between partitions, and consumers can be grouped to read in parallel from multiple partitions within a topic for faster performance.
You can think of a partition like a queue; new messages are appended to the end, and messages are delivered in the order they are received.
Unlike a queue, events are not deleted when read; they remain on the partition, available to other consumers.
A key to high performance at scale, in addition to partitioning, is minimizing the time spent on disk reads and writes. With Kafka and MapR Streams, messages are persisted sequentially as produced and read sequentially when consumed. These design decisions mean that non-sequential reading or writing is rare, allowing messages to be handled at very high speeds with scale. Not deleting messages when they are read also allows processing of the same messages by different consumers for different purposes.
Example Producer Using the Kafka API
Below is an example producer using the Kafka Java API. For more information, see the MapR Streams Java applications documentation.
How to Store the Data
One of the challenges, when you have 243 terabytes of data every race weekend, is where do you want to store it? With a relational database and a normalized schema, related data is stored in different tables. Queries joining this data together can cause bottlenecks with lots of data. For this application, MapR-DB JSON was chosen for its scalability and flexible ease of use with JSON. MapR-DB and a denormalized schema scale, because data that is read together is stored together.
With MapR-DB (HBase API or JSON API), a table is automatically partitioned across a cluster by key range, providing for really fast reads and writes by row key.
JSON Schema Flexibility
JSON facilitates the natural evolution of your data schema during the life of your application. For example, in version 1, we have the following schema, where each JSON message group sensors values for Speed, Distance, and RPM:
{ "_id":"1.458141858E9/0.324",
"car" = "car1",
"timestamp":1458141858,
"racetime”:0.324,
"records":
[
{
"sensors":{
"Speed":3.588583,
"Distance":2003.023071,
"RPM":1896.575806
},
{
"sensors":{
"Speed":6.755624,
"Distance":2004.084717,
"RPM":1673.264526
},
},
For version 2, you can easily capture more data values quickly without changing the architecture of your application, by adding attributes as shown below:
...
"records":
[
{
"sensors":{
"Speed":3.588583,
"Distance":2003.023071,
"RPM":1896.575806,
"Throttle" : 37,
"Gear" : 2
},
. . .
As discussed before, MapR Streams allows processing of the same messages for different purposes or views. With this type of architecture and flexible schema, you can easily create new services and new APIs–for example, by adding Apache Spark Streaming or an Apache Flink Kafka Consumer for in-stream analysis, such as aggregations, filtering, alerting, anomaly detection, and/or machine learning.
Processing the Data With Apache Flink
Apache Flink is an open source distributed data stream processor. Flink provides efficient, fast, consistent, and robust handling of massive streams of events as well as batch processing, a special case of stream processing. Stream processing of events is useful for filtering, creating counters and aggregations, correlating values, joining streams together, machine learning, and publishing to a different topic for pipelines.
The code snippet below uses the FlinkKafkaConsumer09
class to get a DataStream from the MapR Streams “sensors” topic. The DataStream is a potentially unbounded distributed collection of objects. The code then calculates the average speed with a time window of 10 seconds.
Querying the Data With Apache Drill
Apache Drill is an open source, low-latency query engine for big data that delivers interactive SQL analytics at petabyte scale. Drill provides a massively parallel processing execution engine, built to perform distributed query processing across the various nodes in a cluster.
With Drill, you can use SQL to query and join data from files in JSON, Parquet, or CSV format, Hive, and NoSQL stores, including HBase, MapR-DB, and Mongo, without defining schemas.
Below is an example query to ‘show all’ from the race car’s datastore :
SELECT *
FROM dfs.`/apps/racing/db/telemetry/all_cars`
LIMIT 10
And the results:
Below is an example query to show the average speed by car and race:
SELECT race_id, car, ROUND(AVG( `t`.`records`.`sensors`.`Speed` ),2) as `mps`, ROUND(AVG( `t`.`records`.`sensors`.`Speed` * 18 / 5 ), 2) as `kph`
FROM
(
SELECT race_id, car, flatten(records) as `records`
FROM dfs.`/apps/racing/db/telemetry/all_cars`
) AS `t`
GROUP BY race_id, car
And the results:
All of the components of the use case architecture that we just discussed can run on the same cluster with the MapR Converged Data Platform.
Software
You can download the code and instructions to run this example from here: https://github.com/mapr-demos/racing-time-series.
Additional Resources
MapR Edge product information
Ebook: Data Where You Want It: Geo-Distribution of Big Data and Analytics
Ebook: Streaming Architecture: New Designs Using Apache Kafka and MapR Streams
MapR Streams Application Development
Getting Started with Apache Flink
Free Apache Flink Ebook
NorCom selects MapR converged data platform for foundation of deep learning framework for Mercedes
Published at DZone with permission of Carol McDonald, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments