Last year, I shared the results of our efforts to build a new AI-capable, deep learning supercomputer, with the best possible configuration in a single system with 16 P100. We crossed more than 2,000 images per second, which was our (figurative) Mach 2, while also training Resnet50 on Imagenet image dataset. At the time, we estimated that if we would have had access to the next generation GPUs from Volta, we could have even reached Mach 5. So this year at SC18, the international conference for High Performance Computing, we teamed up with researchers at U.S. Naval Research Laboratory and Liqid to achieve this goal.
We pulled together a 16 GPU system with the NVIDIA Volta V100 32 GB cards and NVMe SSD made with Intel® Optane™ on Liqid PCIe fabrics, to create the best of class READ performance, since loading 20 GPUs with 32 GB VREAM (each) demands very high throughput from the storage media. NVIDIA Volta is the most powerful GPU in the industry with Tensor Cores specialized for Deep Learning. As a standard practice we chose the PCIe version, which is slightly slower than the SXM2 module with NVLink in terms of clock speed and power headroom (TDP =250W). The large GPU memory capacity was lucrative in terms of the size of minibatches we could use for Deep Learning. Additionally, we picked NVCaffe as the Deep Learning framework optimized for Volta platforms and ResNet 50 for the model with a minibatch of 320. Yes, that was the largest batch size we were able to use in any GPU for Imagenet training with 16 bit Floating Point (FP16) precision.
System Specs :
Compute:

2-way Intel E5-2699 v4’s at 2.2GHz (88 CPU threads in total)with 1024 GB LRDIMM. The system was built by CocoLink Corp. with 20 GPUcapability in a 7RU form factor. This is the same system we had used for our crypto currency mining experiment last year.
Storage:
16 GPUs with 32 GB VRAM each contribute to total 512GB of GPU memory, which is a huge amount to be loaded from a slow SATA 3 SSD that did not have adequate READ bandwidth, thereby causing long delays in batch loads. However, this problem was eliminated when we switched to Liqid NVMe PCIe SSD built with Intel M.2 NVMe and an internal PCIe fabric featuring 1.6 Million IOPS for random Read (4K block).
Training Benchmark :

Configuration:16 NVIDIA Tesla V100 (32GB) + cuDNN 10 (FP16), ResNet50|Image Batch Size 320, Image Size 256, Crop Size 224. GPU application clocks were overclocked to 1,380MHz
We had initially anticipated that switching to the V100 from P100 would allow us to cross Mach 5 with FP16 (5,000 images / sec), though ultimately the results exceeded our expectations by more than double. Yes, we have hit Mach 10!

We’d like to sincerely thank our partners at U.S. Naval Research Labs for providing us with most of the V100 for the experiment as well as the SC18 booth space and power. Additionally, this would not have been possible without the Liqid team, who provided the remaining V100 cards and fast NVMe storage with tuning support onsite. The CocoLink team also helped us with most of the system build/assembly, along with the benchmarks related tunings. Last, but not the least, thanks to the onsite support from the NVIDIA Solution Architects, who helped troubleshoot some system performance issues before we could optimize the final performance with 16 GPUs.
Now, we are working on enabling the fullcapacity of the system with 20 V100 and some tunings in the BIOS with ourpartners CocoLink and ASRock. We’ll also investigate how we can go beyond 20 GPUs using a compostable infrastructure solution from Liqid. As always, I’ll make sure to keep you posted along this amazing journey.
Editor’s note: Alexandre Delteil a deep learning researcher at Orange collaborated on the project and contributed to this report.