Editor’s note: Ghislain Putois, who works under the Orange Innovation Data & AI team, also contributed to this project and blog post.
Using natural language processing (NLP) training, we have achieved a training throughput of 935,343 tokens/sec and reached minimal validation loss under 1 hour and 49 mins on the WMT14 en de transformer translation task, which is a standardized translation task from English to German. The task is based on Facebook’s language model tool Fairseq and uses that company’s PyTorch framework. This accomplishment was significant because we achieved our highest training speed so far using commercially available general-purpose GPUs. Our team consisted of members from Orange Silicon Valley and the Orange Innovation Data & AI organizations.
As we had illustrated in our past experiment (as explained on this blog in 2020) with peripheral component interconnect express (PCIe) fabric-based composable infrastructure, we have the ability to transform any off-the-shelf server into a multi-graphics processing unit (GPU) single-node supercomputer. This time we were able to get obtain the Nvidia A100 GPUs from Nvidia and PCIe Gen4 Composable stack from Liqid, a provider of composable disaggregated infrastructure (CDI) software solutions and intelligent fabrics. This was a collaboration between Orange Silicon Valley, the Orange Data & AI team and engineers at Liqid.
Below is the description of the full pod configuration:
For this exercise, we selected one Dell Server and — using Liqid Matrix™ CDI software and composable fabric — assigned 16 A100 GPUs (40GB), eight GPUs per JBOG (Just a Bunch of GPUs) with Liqid’s 16TB NVM Express for Deep Learning Cache, where training data was stored.
Server Specs: Two-way AMD 7H12 64 Core, 1 TB memory
All GPUs had peer-to-peer enabled across the ToR (top-of-rack) PCIe Fabric.
The following tests were performed: ImageNet training throughput where a Resnet50 model was trained with TensorFlow with batch sizes of 512 and 768. Unlike last year’s benchmark with Nvidia’s RTX 8000 (48GB), we were not able to fit a training batch of 1024 within 40GB memory.
NLP: Via Transformer for pyTorch. Transformer is currently the backbone for neural network architecture for training complex and structured NLP use cases. We have executed the standard WMT14_en_de benchmark to collect a standard baseline:
We also ran full training to achieve minimal validation loss, which was reached within two hours with a batch size of 10,240 tokens (as part of words split of the SentencePiece software). We tried to increase the batch size quite a bit but after many crashes, trials, and errors, we were able to squeeze in a maximum batch size of 16,000. This allowed to achieve a throughput of 935,343 tokens/sec. Also, we were able to reach the minimal validation loss (the objective function the Neural Net is trying to minimize) under 1 hour and 49 mins.
We did not have access to a 20 GPU JBOG on gen4 fabric, unlike our experiment with gen3 fabric in 2020. With four more GPUs, we can gather that even higher throughput and lower time to achieve the minimal validation loss for both ImageNet and NLP training can be achieved. However, based on the current results, we can conclude that we have built one of the fastest single-node deep learning supercomputers by leveraging composable architecture and commercial off-the-shelf general purpose GPUs.
We thank the Liqid and Nvidia teams for their collaboration and support regarding this quick experiment.