Nvidia says the new B200 GPU offers up to 20 petaflops of FP4 horsepower from its 208 billion transistors, and that a GB200 that combines two of those GPUs with a single Grace CPU can offer 30 times the performance for LLM inference workloads while also potentially being substantially more efficient. It “reduces cost and energy consumption by up to 25x” over an H100, says Nvidia.
On a GPT-3 LLM benchmark with 175 billion parameters, Nvidia says the GB200 has a somewhat more modest 7 times the performance of an H100, and Nvidia says it offers 4x the training speed.
Nvidia told journalists one of the key differences is a second-gen transformer engine that doubles the compute, bandwidth, and model size by using four bits for each neuron instead of eight (thus the 20 petaflops of FP4 I mention earlier.) A second key difference only comes when you link up huge numbers of these GPUs in a server: a next-gen NVLink networking solution that lets 576 GPUs talk to each other, with 1.8 terabytes per second of bidirectional bandwidth.
Previously, Nvidia says, a cluster of just 16 GPUs would spend 60 percent of their time communicating with one another and only 40 percent actually computing.
Nvidia is counting on companies buying large quantities of these GPUs, of course, and is packaging them in larger supercomputer-ready designs, like the GB200 NVL72 which plugs 36 CPUs and 72 GPUs into a single liquid-cooled rack for a total of 720 petaflops of AI training performance or 1,440 petaflops (aka 1.4 exaflops) of inference. Each tray in the rack contains either two GB200 chips, or two NVLink switches, with 18 of the former and 9 of the latter per rack. In total, Nvidia says one of these racks can support a 27-trillion parameter model. GPT-4 is rumored to be around a 1.7-trillion parameter model.
The company says Amazon, Google, Microsoft, and Oracle are all already planning to offer the NVL72 racks in their cloud service offerings, though it’s not clear how many they’re buying.
And of course, Nvidia is happy to offer companies the rest of the solution, too. Here’s the DGX Superpod for DGX GB200, which combines eight systems in one for a total of 288 CPUs, 576 GPUs, 240TB of memory, and 11.5 exaflops of FP4 computing.
Nvidia says its systems can scale to tens of thousands of the GB200 superchips, connected together with 800Gbps networking with its new Quantum-X800 Infiniband (for up to 144 connections) or Spectrum-X800 Ethernet (for up to 64 connections).
+ There are no comments
Add yours