AI Data Storage: Challenges, Capabilities, and Comparative Analysis

Deep dive into the storage challenges in AI scenarios, critical storage capabilities, and comparative analysis of storage products.

Rui Su

Dec. 15, 25 · Analysis

Likes (0)

Comment

Save

1.9K Views

The explosion in the popularity of ChatGPT has once again ignited a surge of excitement in the AI world. Over the past five years, AI has advanced rapidly and has found applications in a wide range of industries. As a storage company, we’ve had a front-row seat to this expansion, watching more and more AI startups and established players emerge across fields like autonomous driving, protein structure prediction, and quantitative investment.

AI scenarios have introduced new challenges to the field of data storage. Existing storage solutions are often inadequate to fully meet these demands. In this article, we’ll deep dive into the storage challenges in AI scenarios, critical storage capabilities, and comparative analysis of storage products. I hope this post will help you make informed choices in AI and data storage.

Storage Challenges for AI

AI scenarios have brought new data patterns:

High-Throughput Data Access Challenges

In AI scenarios, the growing use of GPUs by enterprises has outpaced the I/O capabilities of underlying storage systems. Enterprises require storage solutions that can provide high-throughput data access to fully leverage the computing power of GPUs. For instance, in smart manufacturing, where high-precision cameras capture images for defect detection models, the training dataset may consist of only 10,000 to 20,000 high-resolution images. Each image has several gigabytes in size, resulting in a total dataset size of 10 TB. If the storage system lacks the required throughput, it becomes a bottleneck during GPU training.

Managing Storage for Billions of Files

AI scenarios need storage solutions that can handle and provide quick access to datasets with billions of files. For example, in autonomous driving, the training dataset consists of small images, each about several hundred kilobytes in size. A single training set comprises tens of millions of such images, each sized several hundred kilobytes. Each image is treated as an individual file. The total training data amounts to billions or even 10 billion files. This creates a major challenge in effectively managing large numbers of small files.

Scalable Throughput for Hot Data

In areas like quantitative investing, financial market data is smaller compared to computer vision datasets. However, this data must be shared among many research teams, leading to hotspots where disk throughput is fully used but still cannot satisfy the application's needs. This shows that we need storage solutions that can handle a lot of hot data quickly.

The basic computing environment has also changed a lot.

These days, with cloud computing and Kubernetes getting so popular, more and more AI companies are setting up their data pipelines on Kubernetes-based platforms. Algorithm engineers request resources on the platform, write code in a Notebook to debug algorithms, use workflow engines like Argo and Airflow to plan data processing workflows, use Fluid to manage datasets, and use BentoML to deploy models into apps.

Cloud-native technologies have become a standard consideration when building storage platforms. As cloud computing matures, AI businesses are increasingly relying on large-scale distributed clusters. With a significant increase in the number of nodes in these clusters, storage systems face new challenges related to handling concurrent access from tens of thousands of pods within Kubernetes clusters.

IT professionals managing the underlying infrastructure face significant changes brought about by the evolving business scenarios and computing environments. Existing hardware-software-coupled storage solutions often suffer from several pain points, including a lack of elasticity, a lack of distributed high availability, and constraints on cluster scalability. Distributed file systems like GlusterFS, CephFS, and those designed for HPC, such as Lustre, BeeGFS, and GPFS, are typically designed for physical machines and bare-metal disks. While they can deploy large capacity clusters, they cannot provide elastic capacity and flexible throughput, especially when dealing with storage demands in the order of tens of billions of files.

Key Capabilities for AI Data Storage

Considering these challenges, we’ll outline essential storage capabilities critical for AI scenarios, helping enterprises make informed decisions when selecting storage products.

POSIX Compatibility and Data Consistency

In the AI/ML domain, POSIX is the most common API for data access. Previous-generation distributed file systems, except HDFS, are also POSIX-compatible, but products on the cloud in recent years have not been consistent in their POSIX support:

Compatibility

Users should not solely rely on the description "POSIX-compatible product" to assess compatibility. You can use pjdfstest and the Linux Test Project (LTP) framework for testing.

Strong Guarantee of Data Consistency

Storage systems have different ways of ensuring consistency. File systems usually use strong consistency, while object storage systems often use eventual consistency. It takes a lot of thought to choose the right storage system.

User Mode or Kernel Mode

Early developers preferred kernel mode because it could optimize I/O operations. However, in recent years, more developers have been moving away from kernel mode for several reasons:

Using kernel mode ties the file system client to specific kernel versions. GPU and high-performance network card drivers often need compatibility with certain kernel versions. This combination of factors places a significant burden on kernel version selection and maintenance.
Exceptions of kernel-mode clients can potentially freeze the host operating system. This is highly unfavorable for Kubernetes platforms.
The user-mode FUSE library has undergone continuous iterations, resulting in significant performance improvements. It has been well-supported among JuiceFS customers for various business needs, such as autonomous driving perception model training and quantitative investment strategy training. This demonstrates that in AI scenarios, the user-mode FUSE library is no longer a performance bottleneck.

Linear Scalability of Throughput

Different file systems employ different principles for scaling throughput. Previous-generation distributed storage systems like GlusterFS, CephFS, the HPC-oriented Lustre, BeeGFS, and GPFS primarily use all-flash solutions to build their clusters. In these systems, peak throughput equals the total performance of the disks in the cluster. To increase cluster throughput, users must scale the cluster by adding more disks.

However, when users have imbalanced needs for capacity and throughput, traditional file systems require scaling the entire cluster, leading to capacity wastage.

For example, for a 500 TB capacity cluster using 8 TB hard drives with two replicas, 126 drives with a throughput of 150 MB/s each are needed. The theoretical maximum throughput of the cluster is 18 GB/s (126 ×150 = 18 GB/s). If the application demands 60 GB/s throughput, there are two options:

Switching to 2 TB HDDs (with 150 MB/s throughput) and requiring 504 drives
Switching to 8 TB SATA SSDs (with 500 MB/s throughput) while maintaining 126 drives

The first solution increases the number of drives by four times, necessitating a corresponding increase in the number of cluster nodes. The second solution, upgrading to SSDs from HDDs, also results in a significant cost increase.

As you can see, it’s difficult to balance capacity, performance, and cost. Capacity planning based on these three perspectives becomes a challenge because we cannot predict the development, changes, and details of the real business.

Therefore, decoupling storage capacity from performance scaling would be a more effective approach for businesses to address these challenges.

In addition, handling hot data is a common problem in AI scenarios. An effective approach is to employ a cache grouping mechanism that automatically distributes hot data to different cache groups. This means that it automatically creates multiple copies of hot data during computation to achieve higher disk throughput, and these cache spaces are automatically reclaimed after computation.

Managing Massive Amounts of Files

Efficiently managing a large number of files, such as 10 billion files, has three demands on the storage system:

Elastic scalability: The real-world scenario of JuiceFS users is to expand from tens of millions of files to hundreds of millions of files and then to billions of files. This process is not possible by adding a few nodes. Storage clusters need to add nodes to achieve horizontal scaling, enabling them to support business growth effectively.
Data distribution during horizontal scaling: During system scaling, data distribution rules based on directory name prefixes may lead to uneven data distribution.
Scaling complexity: As the number of files increases, the ease of system scaling, stability, and the availability of tools for managing storage clusters become vital considerations. Some systems become more fragile as file numbers reach billions. Ease of management and high stability are crucial for business growth.

Concurrent Load Capacity and Feature Support in Kubernetes Environments

When we look at the storage system specifications, some specify the maximum number of concurrent accesses. Users need to conduct stress testing based on their business. When there are more clients, quality of service (QoS) management is required, including traffic control for each client and temporary read/write blocking policies.

We must also note the design and supported features of CSI in Kubernetes. For instance, the deployment method of the mounting process, whether it supports ReadWriteMany, subPath mounting, quotas, and hot updates.

Cost Analysis

Cost analysis is a complex topic. It covers both hardware and software purchases, while often being overshadowed by operational and maintenance costs. As AI companies grow, the volume of data increases significantly. Storage systems must exhibit both capacity and throughput scalability, offering ease of adjustment.

In the past, the procurement and scaling of systems like Ceph, Lustre, and BeeGFS in data centers involved lengthy planning cycles. It took months for hardware to arrive, be configured, and become operational. Time costs, notably ignored, were often the most significant expenditures. Storage systems that enable elastic capacity and performance adjustments equate to faster time-to-market.

Another often overlooked cost is efficiency. In AI workflows, the data pipeline is extensive, involving many interactions with the storage system. Each step, from data collection, clear conversion, labeling, feature extraction, training, backtesting, to production deployment, is influenced by how efficient the storage system is.

However, businesses typically utilize only a fraction (often less than 20%) of the entire dataset actively. This subset of hot data needs to work quickly, while warm or cold data may not be accessed very often or at all. It's hard to meet both needs in systems like Ceph, Lustre, and BeeGFS.

As a result, many teams use more than one storage system to meet different needs. To get a lot of space and low costs, a common strategy is to use an object storage system for archiving. But object storage isn't exactly known for its speed, and it may handle data ingestion, preprocessing, and cleaning in the data pipeline. While this may not be the best way to preprocess data, it's often the pragmatic choice due to the sheer volume of data. Engineers then have to wait for a substantial period to transfer the data to the file storage system used for model training.

Therefore, in addition to hardware and software costs of storage systems, total cost considerations should account for time costs invested in cluster operations (including procurement and supply chain management) and time spent managing data across multiple storage systems.

Storage System Comparison

Here's a comparative analysis of the storage products mentioned earlier for your reference:

Category	product	posix compatibility	elastic capacity	maximum supported file count	performance	cost (USD)
	Amazon S3	Partially compatible through S3FS	Yes	Hundreds of billions	Medium to Low	About $0.02/GB/ month
	Alluxio	Partial compatibility	N/A	1 billion	Depends on cache capacity	N/A
Cloud file storage service	Amazon EFS	NFSv4.1 compatible	Yes	N/A	Depends on the data size. Throughput up to 3 GB/s, maximum 500 MB/s per client	$0.043~0.30/GB/month
	Azure	SMB & NFS for Premium	Yes	100 million	Performance scales with data capacity. See details	$0.16/GiB/month
	GCP Filestore	NFSv3 compatible	Maxmium 63.9 TB	Up to 67,108,864 files per 1 TiB capacity	Performance scales with data capacity. See details	$0.36/GiB/month
Lustre	Lustre	Compatible	No	N/A	Depends on cluster disk count and performance	N/A
Lustre	Amazon FSx for Lustre	Compatible	Manual scaling, 1,200 GiB increments	N/A	Multiple performance types of 50 MB~200 MB/s per 1 TB capacity	$0.073~0.6/GB/month
GPFS	GPFS	Compatible	No	10 billion	Depends on cluster disk count and performance	N/A
	BeeGFS	Compatible	No	Billions	Depends on cluster disk count and performance	N/A
	JuiceFS Cloud Service	Compatible	Elastic capacity, no maximum limit	10 billion	Depends on cache capacity	JuiceFS $0.02/GiB/month + AWS S3 $0.023/GiB/month

Conclusion

Over the last decade, cloud computing has rapidly evolved. Previous-generation storage systems designed for data centers couldn't harness the advantages brought by the cloud, notably elasticity. Object storage, a newcomer, offers unparalleled scalability, availability, and cost-efficiency. Still, it exhibits limitations in AI scenarios.

File storage, on the other hand, presents invaluable benefits for AI and other computational use cases. So, what does it take to build a file system ready for the AI era? It starts with the cloud. The key is to marry the virtually limitless scale of object storage with a high-performance file interface. The result? We get systems that are both powerful and practical. Storage is no longer a bottleneck — it's becoming a real driver of AI innovation.

AI Data storage Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending