What is it like to build a groundbreaking supercomputer and then be asked to make it even faster? Here’s a look at what I have been doing over the past year.
In 2016, teams from Orange Silicon Valley and CoCoLink Korea, and artificial intelligence researchers from Orange in France showcased the world’s highest density deep learning supercomputer-in-a-box loaded with 20 NVIDIA K40 GPUs in a single PCIe root complex and fully functional with the NVIDIA’s CUDA 7.5 software stack.
This year, the same team tried to upgrade the system from the 20 NVIDIA K40 processors to take advantage of the more powerful NVIDIA P100 GPU processor, which was released in April of last year. We ran into a constraint, and (spoiler alert) it wasn’t the hardware.
The CUDA 8 software stack did not support a configuration beyond the device count of 16 to a rack — forcing us to limit our efforts to 16 cards. This is pretty capacious, giving us capacity of 57,344 CUDA cores. However, when the software evolves to support 20 NVIDIA P100 cards, with 71,680 CUDA cores, we can generate 186 TeraFLOPs Single-Precision (FP32) Performance and 94 TeraFLOPS of Double-Precision (FP-64) performance. This should be a contender for the most powerful single-node deep learning supercomputer in the market, based on what we know today.
We also wanted to ensure that the platform we have relied on open standards for inter-device communications, and that means PCIe. We wanted to give our users the flexibility of choosing any PCIe-based accelerator card for deep learning, and any quantity of GPUs, up to the maximum number of 16 slots. And we wanted the maximum compute per slot. When we put that all together — again based on what we know of the market today — we chose to go with the fastest (PCIe-based) GPU for deep learning: the NVIDIA Tesla P100 16GB.
For optimal P2P performance in Deep Learning workloads, all GPUs were allocated under a single PCIe root complex with a bidirectional P2P bandwidth of approximately 25 GB/s. Driving that many powerful GPUs requires significant CPU power. We chose two Intel E5-2699 v4’s at 2.2GHz (88 CPU threads in total) with 512 GB of system memory.
We replaced the standard heatsinks with specially engineered heatsinks (single width) to pack all the GPUs in a 4RU chassis. We used high CFM fans to provide sufficient airflow, which ensures an optimal operating temperature of the GPUs at full throttle.
To test the capability of the system, we took it for a spin with Caffe2, the recently released deep learning framework which was created by a Facebook research team in collaboration with NVIDIA. We benchmarked it by training ResNet50 on the Imagenet image dataset and by measuring the image throughput during the training phase. All the P100 GPUs were overclocked at 1,328 MHz (base clock 1,189 MHz).
With an image training throughput of 2,050 images/sec this is like crossing the (figurative) Mach 2 barrier and makes it the fastest single node Deep Learning Supercomputer available in the market (as of this date) created out of commercially available components.
With upcoming Volta GPUs (PCIe version), the same system can be upgraded to 20 V100, and it would be very exciting for us to see the acceleration with the new Tensor Cores and FP16 support of Caffe2.
These results are very promising with NVIDIA V100 (Volta with NVLink) with FP16 support. If we can load our system with 20 PCIe V100 GPUs and they are supported by CUDA 9 and Caffe2 software stack then it may be possible to exceed 5,000 images/sec with a single node. That’s “Mach 5.” We can’t wait to see that happen!
Editor’s note: Deep learning researcher Alexandre Delteil collaborated on the project and also contributed to this report.