DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Maintaining ML Model Accuracy With Automated Drift Detection
  • Artificial Intelligence (AI) Utilizing Deep Learning Techniques to Enhance ADAS
  • Explainability of Machine Learning Models: Increasing Trust and Understanding in AI Systems
  • Understanding the Basics of Neural Networks and Deep Learning

Trending

  • Cookies Revisited: A Networking Solution for Third-Party Cookies
  • Start Coding With Google Cloud Workstations
  • Automating Data Pipelines: Generating PySpark and SQL Jobs With LLMs in Cloudera
  • Measuring the Impact of AI on Software Engineering Productivity
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Petastorm: A Simple Approach to Deep Learning Models in Apache Parquet Format

Petastorm: A Simple Approach to Deep Learning Models in Apache Parquet Format

Learn how to generate a Petastorm dataset that is compatible with different machine learning frameworks, analyze and manipulate the dataset, and more.

By 
Dr. Michael Garbade user avatar
Dr. Michael Garbade
·
Jan. 14, 21 · Tutorial
Likes (22)
Comment
Save
Tweet
Share
5.0K Views

Join the DZone community and get the full member experience.

Join For Free

Petastorm, an open-source data access library, enables single-node or distributed training as well as evaluation of deep learning models precisely from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. As Andrey, a U.S.-based Python engineer, notes, it supports popular Python-based machine learning (ML) frameworks including Tensorflow, PyTorch, and PySpark. For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.

Petastorm enables either single machine or distributed training, as well as support for multiple Python-based ML frameworks such as NumPy, Tensorflow, Theano, Pytorch, and PySpark. It is the go-to library for the evaluation of deep learning models using Apache Parquet formatted datasets.

The article will take you through:

  • Generating a Petastorm dataset that is compatible with different ML frameworks
  • Analyzing and manipulating the dataset
  • Parallelizing data loading and decoding operations

What Are Some Petastorm Features?

To support different training scenarios for autonomous driving algorithms, Petastorm incorporates various features, including efficient implementation of data sharding, row filtering, shuffling, access to a subset of fields, and support of time-series data. These are also called n-grams. 

What Is the Structure of a Typical Dataset

  • Multiple columns that contain sensor-acquired signals that have been collected during autonomous vehicle test runs, including cameras, radars, and lidar.
  • Manually generated labels that are stored as fields in a row.

The rows in a typical dataset are sorted in chronological order and grouped by runs. A typical row size ranges between 30 to 100.

Generating a Petastorm Dataset That Is Compatible With Different ML Frameworks

For you to generate a dataset using Petastorm, you will need to define a Unischema, which is simply a data schema. It is only at this step that you will need to define the schema, since Petastorm will translate it into all supported framework formats, which include TensorFlow, pure Python, and PySpark. 

A path to the dataset is sufficient to read an instance of Unischema since it is serialized as a customized field into a Parquet store metadata. 

Analyzing and Manipulating the Dataset

Analysis and manipulation of the dataset is made possible by the use of the Parquet data format, which is supported by Spark, hence the availability of Spark tools. 

Parallelizing Data Loading and Decoding Operations

Petastorm avails two strategies to parallelizing data loading and decoding operations: 

  1. Thread pool implementation
  2. Process pool implementation

The strategic choice will depend on the kind of data you want to read.

In a typical scenario, as Andrey illustrates in his project, “Machine Learning Model: Python Sklearn & Keras,” the thread pool implementation strategy is used when a row contains encoded and high-resolution images. This is because in this case, a lot of the processing time is being spent in decoding the images through a C++ code. In this instance, no Python Global Interpreter Lock (GIL) is being held.

The process pool implementation strategy, on the other hand, is more appropriate when row sizes are small. In this instance, most processing is done using Python code only. More than one process must run parallelly so as to overcome the execution serialization that is brought about by Global Interpreter Lock.

Summary

Petastorm, which we have learned is an open-source data access library developed by Uber ATG, enables both single machine and distributed training and the evaluation of deep learning models precisely from datasets in the Apache Parquet format. 

This article discusses how to use Petastorm as the go-to approach because it enables a one dataset approach. It reviews the supported tools that help with evaluating deep learning models.

Petastorm supports popular machine learning frameworks that are Python-based, such as PyTorch, PySpark, and Tensorflow.

Deep learning Apache Parquet Machine learning Apache Spark Data (computing) Open source

Opinions expressed by DZone contributors are their own.

Related

  • Maintaining ML Model Accuracy With Automated Drift Detection
  • Artificial Intelligence (AI) Utilizing Deep Learning Techniques to Enhance ADAS
  • Explainability of Machine Learning Models: Increasing Trust and Understanding in AI Systems
  • Understanding the Basics of Neural Networks and Deep Learning

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!