Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Autonomous Cars, Big Data, and Edge Computing: What You Need to Know

DZone's Guide to

Autonomous Cars, Big Data, and Edge Computing: What You Need to Know

Self-driving cars need to take in large amounts of data and process them almost instantly. Read on for overview of this process.

· Big Data Zone ·
Free Resource

Cloudera Data Flow, the answer to all your real-time streaming data problems. Manage your data from edge to enterprise with a no-code approach to developing sophisticated streaming applications easily. Learn more today.

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

The driverless car has been a high-tech dream for decades. Now that broadband connectivity, cloud computing, and artificial intelligence are increasingly available, autonomous cars should go mainstream in the near future, provided certain technical and regulatory milestones are reached. But another issue that must be addressed before self-driving cars can reach critical mass is the issue of data. Specifically, the data analysis and storage requirements of autonomous cars present challenges beyond the capabilities of most current big data solutions.

Autonomous cars generate a staggering amount of data. Intel estimated one car generates terabytes of data in eight hours of operation. Multiple images, radar/lidar, time-of-flight, accelerometers, telemetry, and gyroscope sensors generate data streams that must be analyzed in order to perform the calculations and adjustments required to safely navigate a car. That analysis needs to happen in real-time if the car is to keep up with constantly changing driving conditions (other cars or pedestrians moving around the vehicle, changing weather and light conditions, traffic signs, and so on). These real-time performance requirements mean there's no time to upload data to a central server, conduct the necessary analytics, and then send instructions back to the car for execution. Data that is critical to safely navigate the car must be analyzed locally by the car itself — essentially, the car is an edge device in a cloud network.

Not only does the car need to analyze data on its own, it must also learn to pick and choose between different data streams to identify the ones best suited for analysis at any given moment to keep the car driving safely. 

That last requirement — the need to determine what data is required to perform an analysis — is tricky. While predefined filters can help a car's machine learning routines learn what data to use and when to use it, those filters are generated by human engineers, so they can't be updated in real-time. Accordingly, an autonomous car will need to run machine learning and analytics engines powerful enough to recognize mission-critical data requiring immediate analysis and action on their own, without involving a human in the analysis. Once input from a person is required, decision-making based on data analysis in real time is simply not possible.

We need analytics and machine learning algorithms for autonomous cars that can:

  • Identify data in all formats.

  • Recognize what data is required for mission-critical operations and perform analysis of that data locally.

  • Compress or aggregate non-critical data for uploading to the cloud for future use.

  • Schedule uploads of non-critical data from the car to the cloud when less expensive communications are available (for example, when the car is parked overnight at home and can access the owner's Wi-Fi instead of a metered cellular network).

  • Know how to call for historical data from the cloud so the AI can use it for future analytics.

The last bullet is particularly important. An autonomous car manufacturer will be responsible for storing vast amounts of data generated by cars operating around the world, and much of that data will likely have no real value when initially captured. However, that data's value may be revealed in the future as the manufacturer's autonomous driving applications evolve and improve. Today's non-critical data can be useful for future applications, provided the data is properly stored and easily accessible. If they don't make plans in advance for how to make data available whenever necessary, autonomous car vendors run the risk of creating a "dark data" problem. Dark data is the term used to describe data assets an organization collects but fails to take advantage of — because they don't know how to, or perhaps forgot they have. This will be a particularly significant problem for self-driving cars because of the sheer volume of data they generate.

To address the dark data problem, autonomous car vendors need to move their data storage strategies away from data warehouse models and adopt emerging data storage models like data lakes. While a detailed examination of the difference between a data warehouse and a data lake is beyond the scope of this article, to illustrate the difference between the two, compare a book with a library. With a book (data warehouse), someone has already determined what content is contained in that book and how it is formatted, while a library (data lake) allows you to store whatever content you want in almost any format. In other words, a data warehouse is a centralized platform for basic importing, exporting, and preprocessing of data gathered from a collection of linked systems using one data schema. A data lake is a distributed yet integrated data platform that supports schemaless (including unstructured and structured) data and performs queries of data in real-time by leveraging metadata to quickly find, transform, and load data between systems. Data lakes' support for both structured and unstructured data on the same platform is important, as autonomous car sensors generate datastreams in very different formats that can't easily be stored in the same schema. Other key differences that distinguish a data lake from a data warehouse include:

  • Schema on read.

  • Unlimited storage.

  • The ability to access both raw and processed data.

  • The ability to link data from many individual clusters.

Linking data between clusters is particularly important for autonomous cars, as it allows for the integration of different datasets from different geographic locations. Car OEMs are global companies with multiple offices and data centers scattered around the world. As more countries move to support autonomous cars, autonomous car vendors will want to use all the data generated by cars driving locally in the self-driving AI and ML algorithms they use to power their cars globally. As we see more vendors enter the autonomous driving market, the ones who will ultimately win out over others will be those vendors best prepared to analyze data at the local level and those who have cataloged their databases properly — so future autonomous applications can find the data they need, when they need it.

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

 Cloudera Enterprise Data Hub. One platform, many applications. Start today.

Topics:
big data ,autonomous cars ,real-time data analysis ,machine learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}