Editor’s note: Orange Silicon Valley recently worked with composable infrastructure company Liqid on a groundbreaking supercomputing project to solve some very real problems encountered at the data center level. Below is a first-hand account about the work and discovery that took place. Orange’s Olivier Varene also contributed to this article.
Multi-graphics processing unit (GPU) data center architectures are gaining in popularity as a way to more efficiently support the uneven, data-intensive operations associated with the deployment of artificial intelligence and machine learning applications. As we experiment with ways to address the challenges associated with these workflows, we have looked at ways to pack more powerful GPU accelerators into a single system as a way to increase the speed of artificial intelligence (AI) computing.
Unfortunately, rapidly advancing techniques for GPU aggregation can get bottlenecked quickly by design and testing cycles. For example, to accommodate a new server design for GPU computing in a single environment or enclosure, server original equipment manufacturer (OEM) providers must redesign the peripheral component interconnect express (PCIe) fabric and chassis around the new CPU and motherboard. This usually requires 6-9 months of design and testing by the OEM before it can go on mass production — a significant amount of time to wait in a rapidly advancing field.
Composable infrastructure offers a new way to manage resources in the datacenter where compute, network, and storage are considered as a pool of disaggregated resources and can be provisioned on-demand, based on the requirements of any given application workload. By extending PCIe fabric outside the physical constraints of a server via the top-of-rack PCI fabric switching, it is possible to create and “compose” bare metal systems on the fly with the necessary amount of resources.
Working with the composable infrastructure provider Liqid, Orange Silicon Valley has identified a solution to address frustrations associated with the aforementioned inevitable delays by manufacturers, while also increasing performance and utilization for powerful GPU accelerators. We can simplify the deployment by disaggregating servers — with CPU and memory only — and GPUs, as well as other similar compute and storage devices, in separate physical enclosures. PCIe fabric-based switching and composable infrastructure software enable servers and GPUs to be configured independently via software without waiting for a new GPU server to be designed.
At the same time, with the right capability of composability we can logically assign any number of GPUs to the same server without any need for chassis redesign. Having said that, her is what we have accomplished in collaboration with Liqid:
1. We have designed the world’s highest capacity JBOG (short for “Just a Bunch of GPUs”) capable of hosting 20 full-size datacenter class GPUs; it is a PCI enclosure outside a server than is connected to a server via a software defined PCI Fabric switch.
2. We have created one of the world’s fastest off-the shelf single-node, production-ready AI supercomputers, fully populated with 20 GPUs and one server connected via Liqid’s intelligent fabric and composable software.
3. We have the highest GPU memory capacity of 960GB in a single node, with an NVIDIA Quadro RTX 8000 equipped with 48GB of memory.
4. We can scale up to 20 and even 40 GPUs with two JBOGs if needed, with the right BIOS.
5. With composable software from Liqid we can reallocate GPUs to other users when heavy workloads are completed, preventing the valuable resources from sitting idle.
For testing, we chose Resnet50 model training with TensorFlow with a batch size of 1024 (yes, big GPU memory of 48GB). We were able to achieve an image training throughput in excess of 15,000 images.
In the old architectural model, every time a new processor was released by AMD or other motherboard providers, systems for GPUs and other resources need to be redesigned to accommodate. By configuring servers via composable software, however, this ceases to be a problem. Decoupled resources can be added (or removed) as needed, without the need for additional engineering that can add three-to-six months to the development cycle. In our case we have used a Dell PowerEdge R7515 with single AMD EPYC 7702P 64-Core Processor and 1 TeraByte of DRAM.
We can now confirm that this is the industry’s first adaptive AI supercomputing block, delivering one of the fastest documented achievements in deep learning performance as well, while using all off-the-shelf components. In addition, we have optimized the cost of GPU deployment by aggregating the maximum number of GPUs in a JBOG, significantly reducing the overhead for deployment of high density computing. We are excited to see what real-world deployment of this advanced system will bring, and we look forward to reporting future results as they come.