Microsoft has announced the launch of new Azure virtual machines (VMs) aimed specifically at ramping up cloud-based AI supercomputing capabilities.
The new H200 v5 series VMs are now generally available for Azure customers and will enable enterprises to contend with increasingly cumbersome AI workload demands.
Harnessing the new VM series, users can supercharge foundation model training and inferencing capabilities, the tech giant revealed.
Scale, efficiency and performance
In a blog post, Microsoft said the new VM series is already being put to use by a raft of customers and partners to drive AI capabilities.
“The scale, efficiency, and enhanced performance of our ND H200 v5 VMs are already driving adoption from customers and Microsoft AI services, such as Azure Machine Learning and Azure OpenAI Service,” the company said.
Among these is OpenAI, according to Trevor Cai, OpenAI’s head of infrastructure, which is harnessing the new VM series to drive research and development and fine-tune ChatGPT for users.
“We’re excited to adopt Azure’s new H200 VMs,” he said. “We’ve seen that H200 offers improved performance with minimal porting effort, we are looking forward to using these VMs to accelerate our research, improve the ChatGPT experience, and further our mission.”
Under the hood of the H200 v5 series
Azure H200 v5 VMS are architected with Microsoft’s systems approach to “enhance efficiency and performance,” the company said, and include eight Nvidia H200 Tensor Core GPUs.
Microsoft said this addresses a growing ‘gap’ for enterprise users with regard to compute power.
With GPUs growing in raw computational capabilities at a faster rate than attached memory and memory bandwidth, this has created a bottleneck for AI inferencing and model training, the tech giant said.
“The Azure ND H200 v5 series VMs deliver a 76% increase in High Bandwidth Memory (HBM) to 141GB and a 43% increase in HBM Bandwidth to 4.8 TB/s over the previous generation of Azure ND H100 v5 VMs,” Microsoft said in its announcement.
“This increase in HBM bandwidth enables GPUs to access model parameters faster, helping reduce overall application latency, which is a critical metric for real-time applications such as interactive agents.”
Additionally, the new VM series can also compensate for more complex large language models (LLMs) within the memory of a single machine, the company said. This thereby improves performance and enables users to avoid costly overheads when running distributed applications over multiple VMs.
Better management of GPU memory for model weights and batch sizes are also a key differentiator for the new VM series, Microsoft believes.
Current GPU memory limitations all have a direct impact on throughput and latency for LLM-based inference workloads, and create additional costs for enterprises.
By drawing upon a larger HBM capacity, the H200 v5 VMs are capable of supporting larger batch sizes, which Microsoft said drastically improves GPU utilization and throughput compared to previous iterations.
“In early tests, we observed up to 35% throughput increase with ND H200 v5 VMs compared to the ND H100 v5 series for inference workloads running the LLAMA 3.1 405B model (with world size 8, input length 128, output length 8, and maximum batch sizes – 32 for H100 and 96 for H200),” the company said.
+ There are no comments
Add yours