Advancing Application Performance With NVMe Storage, Part 2
Advancing Application Performance With NVMe Storage, Part 2
In the second part of this series, we take a look at how using NVMe storage provides a significant advantage over local SSDs.
Join the DZone community and get the full member experience.Join For Free
Using local SSDs inside of the GPU node delivers fast access to data during training, but introduces challenges that impact the overall solution in terms of scalability, data access, and data protection.
Normally, GPU nodes don't have much room for SSDs, which limits the opportunity to train very deep neural networks that need more data. For example, one well-respected vendor's standard solution is limited to 7.5TB of internal storage, and it can only scale to 30TB. In contrast, there are generally available NVMe solutions that can scale from 100TB to 1PB of shared NVMe storage at the performance of local NVMe SSDs, providing the opportunity to significantly increase the depth of the training for neural networks.
A number of today's GPU-based servers have the power to perform entire processing operations on their own, however, some workloads require more than a single GPU node, either to speed up operations by processing across multiple GPUs, or to process a machine learning model to large to fit into a single GPU. If the clustered GPU nodes all need access to the same dataset for their machine learning training, the data has to be copied to each CPU node, leading to capacity limitations and inefficient storage utilization. Alternatively, if the dataset is split among the nodes in the GPU cluster, then data is only stored locally and cannot be shared between the nodes, and there is no redundancy scheme (RAID/replication) to protect the data.
Because using local SSDs may not have the capacity to store the full dataset for machine learning or deep learning, some installations instead use local SSD as cache for a slower storage array to accelerate access to the working dataset. This leads to performance bottlenecks as the amount of data movement leads to delays in cached data being available on the SSDs. As datasets grow, local SSD caching becomes ineffective for feeding the GPU training models at the required speeds.
Shared NVMe storage can solve the performance challenge for GPU clusters by giving shared read/write data access to all nodes in the cluster at the performance of local SSDs. The need to cache or replicate datasets to all nodes in the GPU cluster is eliminated, improving the overall storage efficiency of the cluster. With some solutions offering support for up to 1PB of RAID protected, shared NVMe data, the GPU cluster can tackle massive deep learning training for improved results. For clustered applications, this type of solution is ideal for global filesystems such as IBM Spectrum Scale, Lustre, CEPH and others.
Use Case Scenario Example: Deep Learning Datasets
One vendor provides the hardware infrastructure that their customers use it to test a variety of applications. With simple connectivity via Ethernet (or InfiniBand), shared NVMe storage provides more capacity for deep learning datasets, which would allow them to expand the use cases that it offers to its customers.
Moving to Shared NVMe-oF Storage
Having now discussed the performance of NVMe inside of GPU nodes, let's explore the performance impacts of moving to shared NVMe-oF storage. For this discussion, we will use an example where performance testing would be focused on assessing single node performance of using shared NVMe storage relative to the local SSD inside of the GPU node.
Reasonable benchmark parameters and test objectives could be:
1. RDMA Performance: Test whether RDMA-based (remote direct memory access) connectivity at the core of the storage architecture could enable low-latency and high data throughput.
2. Network Performance: How would large quantities of data affect the network, and whether the network became a bottleneck during data transfers.
3. CPU Consumption: How much CPU power is used during large data transfers over the RDMA enabled NICs.
4. In general, whether RDMA technology could be a key component of an AI/ML computing cluster.
I have, in fact, been privy to similar benchmarks. For side-by-side testing, a TensorFlow benchmark with two different data models was utilized: ResNet-50, a 50-layer residual neural network, as well VGG-19, a 19-layer convolutional neural network that was trained on more than a million images from the ImageNet database. Both models were read-intensive as the neural network ingests massive amounts of data during both the training and processing phases of the benchmark. A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The storage appliance was connected to the node via the NVMe-oF protocol over 50GbE/100GbE ports for the shared NVMe storage testing. For the final results, all of the tests used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.
A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The NVMe appliance was connected to the node via the NVMe-oF protocol over 50GbE/100GbE ports for the shared NVMe storage testing. For the final results, all of the test runs used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.
In both image throughput and overall training time, the appliance exceeded the performance of the local NVMe SSD inside the GPU node by a couple of percentage points. This highlights one of the performance advantages of shared NVMe storage: the ability to spread volumes across all drives in the array gains the throughput advantages of multiple SSDs, which compensates for any latency impacts of moving to external storage. In other words, the improved image throughput performance means that more images can be processed in an hour/day/week when using shared NVMe storage than with local SSDs. Although the difference is just a few percentage points, this advantage will scale up as more GPU nodes are added to the compute cluster.
In addition, the training time with NVMe storage was much faster than with local SSDs, again highlighting the advantage of being able to bring the performance of multiple NVMe SSDs to bear in a shared volume. Combined with the scalability of the NVMe storage, this enables customers to not only speed up the performance of training, but to also leverage 100TB or more datasets to enable deep learning for improved results.
Opinions expressed by DZone contributors are their own.