Alibaba unveils the network and datacenter design it uses for large language model training

Alibaba has revealed its datacenter design for LLM training, which apparently consists of an Ethernet-based network in which each host contains eight GPUs and nine NICs that each have two 200 GB/sec ports.

The tech giant, which also offers one of the best large language models (LLM) around via its Qwen model, trained on 110 billion parameters, says this design has been used in production for eight months, and aims to maximize the utilization of a GPU’s PCIe capabilities increasing the send/receive capacity of the network.

Another feature that increases speed is the use of NVlink for the intra-host network providing more bandwidth between hosts. Each port on the NICs is connected to a different top-of-rack switch avoiding a single point of failure a design that Alibaba call rail-optimized.

Each pod contains 15,000 GPUs

A new type of network is required because the traffic patterns in LLM training is different from general cloud computing because of low entropy and bursty traffic. there is also a higher sensitivity to faults and single point failures.

“Based on the unique characteristics of LLM training, we decided to build a new network architecture specifically for LLM training. We should meet the following goals; scalability, high performance, and single-ToR fault tolerance,” the company said.

Another part of the infrastructure that was revealed was the cooling mechanism. As no vendors could provide a solution to keep chips below 105C, the temperature at which switches begin to shut down, Alibaba designed and created its own vapor chamber heat sink along with using more wicked pillars at the center of chips carrying heat away more efficiently.

The design for LLM training is encapsulated in pods that contain 15,000 GPUs and each pod can be located in a single datacenter. “All datacenter buildings in commission in Alibaba Cloud have an overall power constraint of 18MW, and an 18MW building can accommodate approximately 15K GPUs. In conjunction with HPN, each single building perfectly houses an entire Pod, making predominant links inside the same building.” Alibaba wrote.

Alibaba also wrote it expects model parameters to continue to rise by an order of magnitude in the next several years from one trillion to 10 trillion parameters, and that its new architecture is planned to be able to support this and increase to a scale of 100,000 GPUs.

Via The Register