DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • AI-Based Multi-Cloud Cost and Resource Optimization
  • Viking Enterprise Solutions: Empowering Modern Data Infrastructure
  • Navigating the Complexities of AI-Driven Integration in Multi-Cloud Environments: A Veteran’s Insights
  • Engineering LLMOps: Building Robust CI/CD Pipelines for LLM Applications on Google Cloud

Trending

  • The 7 Pillars of Meeting Design: Transforming Expensive Conversations into Decision Assets
  • Hallucination Has Real Consequences — Lessons From Building AI Systems
  • How to Prevent Data Loss in C#
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 2]
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Why High Performance Storage is Important for AI Cloud Build

Why High Performance Storage is Important for AI Cloud Build

Learn the technology and architecture behind building AI Cloud and why high performance storage is important. Explore the latest benchmarks and understand the market.

By 
Anjul Sahu user avatar
Anjul Sahu
·
Jan. 20, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.7K Views

Join the DZone community and get the full member experience.

Join For Free

The AI cloud market is experiencing exceptionally rapid growth worldwide, with the latest reports projecting annual growth rates between 28% and 40% over the next five years. It may reach up to $647 billion by 2030, according to various analyst reports. The surge in AI cloud adoption, GPU-as-a-service platforms, and enterprise interest in AI “factories” has created new pressures and opportunities for product engineering and IT leaders. Regardless of which public cloud or private cluster you choose, one key differentiator sets each AI and HPC solution apart: the performance of storage.

While leading clouds often use the same GPUs and servers, the way data flows — between compute, network, storage, and persistent layers  —determines everything from training speed to scalability. Understanding storage fundamentals will help you architect or select the right solution. We have previously covered how to build AI cloud solutions, and with hands-on experience in this space, we would like to share our thoughts in this article.

Business and technology leaders now recognize that real-world AI breakthroughs require infrastructure with high bandwidth, low latency, and extreme parallelism. As deep learning and data-intensive analytics move from labs to production, GPU clusters run ever-larger models on ever-growing datasets.

Why Does Storage Matter in AI Workloads?

Storage plays an important role across the entire AI lifecycle. Let’s look at the three major areas: data preparation, training and tuning, and inference.

Data Preparation

Key Tasks

  • Scalable and performant storage to support transforming data for AI use
  • Protecting valuable raw and derived training datasets

Critical Capabilities

  • Storing large structured and unstructured datasets in many formats
  • Scaling under the pressure of map-reduce–like distributed processing commonly used for AI data transformation
  • Support for file and object access protocols to ease integration

Training and Tuning

Key Tasks

  • Providing training data to keep expensive GPUs fully utilized
  • Saving and restoring model checkpoints to protect training investments

Critical Capabilities

  • Sustaining read bandwidths necessary to keep training GPU resources busy
  • Minimizing time spent saving checkpoint data to limit training pauses
  • Scaling to meet the demands of data-parallel training in large clusters

Inference

Key Tasks

  • Safely storing and quickly delivering model artifacts for inference services
  • Providing data for batch inferencing

Critical Capabilities

  • Reliably storing expensive-to-produce model artifact data
  • Minimizing model artifact read latency for fast inference deployment
  • Sustaining read bandwidths necessary to keep inference GPU resources busy

High-Performance Storage Is Critical in the Checkpointing Process in AI Training

Checkpointing is a critical process in large-scale AI training, enabling models to periodically save and restore their state as training progresses. As model and dataset sizes expand into the billions of parameters and petabytes of data, this operation becomes increasingly demanding for storage infrastructure. Efficient checkpointing helps safeguard training progress against inevitable hardware failures and disruptions, while also allowing for fine-tuning, experimentation, and rapid recovery. However, frequent checkpointing can introduce performance overhead due to pauses in computation and intensive reads and writes to persistent storage, especially as distributed clusters grow to thousands of accelerators.

To address these challenges, modern AI storage architectures leverage strategies such as asynchronous checkpointing — where checkpoints are saved in the background, minimizing idle time — and hierarchical distribution, which reduces bottlenecks by having leader nodes manage data transfers within clusters. The result is faster training throughput, lower risk of lost work, and more efficient use of compute resources. Optimizing checkpoint size, frequency, and concurrent access patterns is vital to ensure high throughput and low latency, making high-performance, scalable storage systems an indispensable foundation for reliable and cost-effective AI model training at scale. You can read more about this in an AWS article.

What Kind of Storage Is Needed for AI and HPC Workloads?

For AI and HPC workloads, the demands extend well beyond ordinary enterprise storage. Key requirements include:

  • Parallel File Systems: Multiple servers and GPUs need to access datasets simultaneously. Systems such as Lustre, WEKA, VAST Data, CephFS, and DDN Infinia enable concurrent access, avoiding bottlenecks and improving throughput for distributed workloads.
  • High Throughput and Low Latency: Training GPT-like models or running simulations generates millions of read/write operations per second. Storage must deliver bandwidth in the tens to hundreds of GB/s and latency below 1 ms so that GPUs remain fed and productive.
  • POSIX Compliance: Many AI frameworks and HPC applications expect a traditional POSIX interface for seamless operation.
  • Scalability and Elasticity: Petabyte-scale capacity is the norm. Modern solutions allow horizontal scaling, adding performance and capacity as demand grows.
  • Data Integrity and Reliability: Enterprise-grade AI and HPC workloads require uninterrupted access to data. Redundancy, fault tolerance, and robust disaster recovery features are essential.

Typical Storage Specifications and Requirements

For modern AI cloud, AI factory, and GPU cloud infrastructure, expect:

  • Bandwidth: 15–512 GB/s (or higher for top-tier solutions)
  • IOPS: From 20,000 (entry-level) up to 800,000+
  • Latency: Sub-1 ms to 2 ms for parallel file systems
  • Capacity: 100 TB to multi-petabyte scale, often with tiering to object storage
  • Protocols: NFSv3/v4.1, SMB, Lustre, S3 (for hybrid and archival storage), HDFS, and native REST APIs

On-premises or hybrid deployments may include NVMe storage, CXL-enabled expansion, and advanced cooling to support high-density GPU clusters.

AI Lifecycle Stage Requirements Considerations
Reading Training Data - Accommodate wide range of read BW requirements and IO access patterns across different AI models
- Deliver large amounts of read BW to single GPU servers for most demanding models
- Use high performance, all-flash storage to meet needs
- Leverage RDMA capable storage protocols, when possible, for most demanding requirements
Saving Checkpoints - Provide large sequential write bandwidth for quickly saving checkpoints
- Handle multiple large sequential write streams to separate files, especially in same directory
- Understand checkpoint implementation details and behaviors for expected AI workloads
- Determine time limits for completing checkpoints
Restoring Checkpoints - Provide large sequential read bandwidth for quickly restoring checkpoints
- Handle multiple large sequential read streams to same checkpoint file
- Understand how often checkpoint restoration will be required
- Determine acceptable time limits for restoration
Servicing GPU Clusters - Meet performance requirements for mixed storage workloads from multiple simultaneous AI jobs
- Scale capacity and performance as GPU clusters grow with business needs
- Consider scale-out storage platforms that can increase performance and capacity while providing shared access to data

Source: snia.org - John Cardente Talk


Storage Options for AI Cloud and HPC Workloads

To achieve next-generation AI and HPC results, enterprises and product teams should evaluate both commercial vendors and open-source platforms.

Open-Source Parallel File Systems

  • Ceph (CephFS): Highly flexible, POSIX-compliant, and scalable from small clusters to exabytes. Used in academic and commercial AI labs for robust file and object storage. Many early-stage AI factories are built on top of Ceph.
  • Lustre / DDN Lustre: Optimized for large-scale HPC and AI workloads; widely used in supercomputing and enterprise environments.
  • IBM Spectrum Scale (GPFS): A high-performance parallel file system widely adopted in science and industry.

Commercial AI and HPC Storage Solutions

  • VAST Data: Delivers extreme performance for AI storage, combining parallel file system performance with NAS and archive economics. VAST has been widely adopted by AI cloud providers such as CoreWeave and Lambda.
  • WEKA: Highly optimized for metadata and file access in AI and multi-tenant clusters, helping overcome bottlenecks in legacy systems. Customers include Yotta, Cohere, and Together.ai.
  • DDN: An industry leader in research, hybrid file-object storage, and scalable data intelligence for model training and analytics. Solutions such as Infinia and xFusionAI focus on both performance and efficiency for GPU workloads.
  • Pure Storage, Cloudian, IBM, Dell: Also recognized for delivering enterprise-grade AI and HPC storage platforms.

Many solutions integrate natively with public clouds such as AWS S3, Google Cloud Storage, and Azure Blob, enabling hybrid architectures and seamless data movement.

Product Examples and Use Cases

  • Ceph (Open Source): Used by research labs and private cloud teams to build petabyte-scale, resilient storage for AI and HPC clusters.
  • WEKA: Often deployed in AI factories with hundreds of GPUs running concurrent training jobs, thanks to elastic scaling and strong metadata performance.
  • VAST Data: Designed for high throughput across both small and large files, increasingly chosen for generative AI workloads and data-intensive analytics in fintech, healthcare, and media.
  • DDN: Supports hybrid deployment strategies and offers both parallel file systems and object storage in a unified stack.

Parallel file systems such as Lustre and Spectrum Scale facilitate near-instant recovery, zero-data-loss architectures, and compliance for regulated industries.

Identifying the Best Storage for Your Needs

Because every cloud environment is unique, the first step in creating a differentiated solution is to establish a baseline through hardware benchmarking. MLCommons benchmarking tools can be run directly on your hardware to gather reliable performance data.

The latest MLPerf Storage v2.0 benchmark results from MLCommons highlight the increasingly critical role of storage performance in the scalability of AI training systems. Participation nearly doubled compared to the previous v1.0 round, reflecting rapid industry innovation. Storage solutions now support roughly twice the number of accelerators as before. The new iteration includes checkpointing benchmarks that address real-world scenarios faced by large AI clusters, where frequent hardware failures can disrupt training jobs. By simulating such events and evaluating storage recovery speeds, MLPerf Storage v2.0 provides valuable insights into how checkpointing ensures uninterrupted performance in large datacenter environments.

A broad spectrum of storage technologies participated in the benchmark, ranging from local storage and in-storage accelerators to object stores, reflecting the diversity of approaches in AI infrastructure. Over 200 results were submitted by 26 organizations worldwide, many participating for the first time, showcasing the growing global momentum behind the MLPerf initiative. The open-source, rigorously peer-reviewed benchmarking framework provides unbiased, actionable data for system architects, datacenter managers, and software vendors. MLPerf Storage is a go-to resource for designing resilient, high-performance AI training systems in a rapidly evolving technology landscape.

Conclusion: Building Your AI Cloud and HPC Strategy

As the AI cloud, GPU-as-a-service, and HPC landscape evolves, storage is no longer a background detail — it is the core differentiator for speed, scale, and future innovation. Vendor neutrality empowers organizations to architect best-of-breed systems, leveraging open-source foundations while integrating commercial solutions where appropriate. Every cloud or on-premises cluster benefits from storage designed specifically for AI and HPC, not just traditional workloads.

You can connect with the author via LinkedIn.

AI Cloud Performance

Published at DZone with permission of Anjul Sahu. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • AI-Based Multi-Cloud Cost and Resource Optimization
  • Viking Enterprise Solutions: Empowering Modern Data Infrastructure
  • Navigating the Complexities of AI-Driven Integration in Multi-Cloud Environments: A Veteran’s Insights
  • Engineering LLMOps: Building Robust CI/CD Pipelines for LLM Applications on Google Cloud

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook