How Cerebras boosted Meta's Llama to 'frontier model' performance

cerebras-2024-neurips-deck-slide-12 — Cerebras used chain of thought at inference time to make a smaller AI model equal or better to a larger model.

Cerebras Systems

Cerebras Systems announced on Tuesday that it’s made Meta Platforms’s Llama perform as well in a small version as it does on a large version by adding the increasingly popular approach in generative artificial intelligence (AI) known as “chain of thought.” The AI computer maker announced the advance at the start of the annual NeurIPS conference on AI.

“This is a closed-source only capability, but we wanted to bring this capability to the most popular ecosystem, which is Llama,” said James Wang, head of Cerebras’s product marketing effort, in an interview with ZDNET.

The project is the latest in a line of open-source projects Cerebras has done to demonstrate the capabilities of its purpose-built AI computer, the “CS-3,” which it sells in competition with the status quo in AI — GPU chips from the customary vendors, Nvidia and AMD.

Also: DeepSeek challenges OpenAI’s o1 in chain of thought – but it’s missing a few links

The company was able to train the Llama 3.1 open-source AI model that uses only 70 billion parameters to reach the same accuracy or better accuracy on various benchmark tests as the much larger 405-billion parameter version of Llama.

Those tests include the CRUX test of “complex reasoning tasks,” developed at MIT and Meta, and the LiveCodeBench for code generation challenges, developed at U.C. Berkeley, MIT and Cornell University, among others.

Chain of thought can enable models using less training time, data, and computing power, to equal or surpass a large model’s performance.

“Essentially, we’re now beating Llama 3.1 405B, a model that’s some seven times larger, just by thinking more at inference time,” said Wang.

The idea behind chain-of-thought processing is for the AI model to detail the sequence of calculations performed in pursuit of the final answer, to achieve “explainable” AI. Such explainable AI could conceivably give humans greater confidence in AI’s predictions by disclosing the basis for answers.

OpenAI has popularized the chain-of-thought approach with its recently released “o1” large language model.

Also: How laws strain to keep pace with AI advances and data theft

Cerebras’s answer to o1, dubbed “Cerebras Planning and Optimization,” or CePO, operates by requiring Llama — at the time the prompt is submitted — to “produce a plan to solve the given problem step-by-step,” carry out the plan repeatedly, analyze the responses to each execution, and then select a “best of” answer.

“Unlike a traditional LLM, where the code is just literally token by token by token, this will look at its own code that it generated and see, does it make sense?” Wang explained. “Are there syntax errors? Does it actually accomplish what the person asks for? And it will run this kind of logic loop of plan execution and cross-checking multiple times.”

In addition to matching or exceeding the 405B model of Llama 3.1, Cerebras was able to take the latest Llama version, 3.3, and make it perform at the level of “frontier” large language models such as Anthropic’s Claude 3.5 Sonnet and OpenAI’s GPT-4 Turbo.

“This is the first time, I think, anyone has taken a 70B model, which is generally considered medium-sized, and achieved a frontier-level performance,” said Wang.

Also: AI startup Cerebras unveils the WSE-3, the largest chip yet for generative AI

Humorously, Cerebras also put Llama to the “Strawberry Test,” a prompt that alludes to the “strawberry” code name for OpenAI’s o1. When the numbers of “r” are multiplied, such as “strrrawberrry,” and language models are prompted to tell the number of r’s, they often fail. The Llama 3.1 was able to accurately relate varying numbers of r’s using chain of thought.

From a corporate perspective, Cerebras is eager to demonstrate the hardware and software advantage of its AI computer, the CS-3.

The work on Llama was done on CS-3s using Cerebras’s WSE3 chip, the world’s largest semiconductor. The company was able to run the Llama 3.1 70B model, as well as the newer Llama 3.3, on chain of thought without the typical lag induced in o1 and other models running on Nvidia and AMD chips, said Wang.

The chain-of-thought version of 3.1 70B is “the only reasoning model that runs in real time” when running on the Cerebras CS2s, the company claims. “OpenAI reasoning model o1 runs in minutes; CePO runs in seconds.”

Cerebras, which recently introduced what it calls “the world’s fastest inference service,” claims the CS2 machines are 16 times faster than the fastest service on GPU chips, at 2,100 tokens processed every second.

Also: AI startup Cerebras debuts ‘world’s fastest inference’ service – with a twist

Cerebras’s experiment supports a growing sense that not only the training of AI models but also the making of predictions in production, is scaling to ever larger computing needs as prompts become more complex.

In general, said Wang, the accuracy of large language models will improve in proportion to the amount of compute used, both in training and in inference; however, the factor by which the performance improves will vary depending on what approach is used in each case.

“Different techniques will scale with compute by different degrees,” said Wang. “The slope of the lines will be different. The remarkable thing — and why scaling laws are talked about — is the fact that it scales at all, and seemingly without end.”

Also: AI pioneer Cerebras opens up generative AI where OpenAI goes dark

“The classical view was that improvements would plateau and you would need algorithmic breakthroughs,” he said. “Scaling laws say, ‘No, you can just throw more compute at it with no practical limit.’ The type of neural network, reasoning method, etc. affects the rate of improvement, but not its scalable nature.”

In different implementations, chain of thought can output either a verbose series of its intermediate results or a kind of status message saying something like “thinking.” Asked which Cerebras opted for, Wang said that he had not himself seen the actual output, but that “it’s probably verbose. When we release stuff that’s designed to serve Llama and open-source models, people like to see the intermediate results.”

cerebras-2024-neurips-deck-slide-17 — Cerebras demonstrates initial training of a trillion-parameter language model on a single machine, using 55 terabytes of commodity DRAM.

Cerebras Systems

Also on Tuesday, Cerebras announced it has shown “initial” training of a large language model that has one trillion parameters, in a research project conducted with Sandia National Laboratories, a laboratory run by the US Department of Energy.

The work was done on a single CS-3, combined with its purpose-built memory computer, the MemX. A special version of the MemX was boosted to 55 terabytes of memory to hold the parameter weights of the model, which were then streamed to the CS-3 over Cerebras’ dedicated networking computer, the SwarmX.

Also: Want generative AI LLMs integrated with your business data? You need RAG

The CS-3 system, Cerebras claims, would replace 287 of Nvidia’s top-of-the-line “Grace Blackwell 200” combo CPU and GPU chips that are needed in order to access equivalent memory.

The combination of the one CS-3 and the MemX takes up two standard telco racks of equipment, said Wang. The company claims that this takes less than one percent of the space and power of the equivalent GPU arrangement.

The MemX device uses commodity DRAM, known as DDR-5, in contrast to the GPU cards that have more expensive “high-bandwidth memory,” or, HBM.

“It does not touch the HBM supply chain so it’s extremely easy to procure, and it’s inexpensive,” said Wang.

cerebras-2024-neurips-deck-slide-20 — Cerebras claims its clustered system vastly reduces the amount of code necessary for the task of programming on a trillion-parameter model.

Cerebras Systems

Cerebras is betting the real payoff is in the programming model. To program the hundreds of GPUs in concert, said Wang, a total of 20,507 lines of code are needed to coordinate an AI models’ Python, C, and C++ and shell code, and other resources. The same task can be carried out on the CS-3 machine with 565 lines of code.

“This is not just a need from a hardware perspective, it’s so much simpler from a programming perspective,” he said, “because you can drop this trillion-parameter model directly into this block of memory,” whereas the GPUs involve “managing” across “thousands of 80-gigabyte blocks” of HBM memory to coordinate parameters.

The research project trained the AI model, which is not disclosed, across 50 training steps, though it did not yet train it to “convergence,” meaning, to a finished state. To train a trillion-parameter model to convergence would require many more machines and more time.

Also: The best AI for coding (and what not to use)

However, Cerebras subsequently worked with Sandia to run the training on 16 of the CS-3 machines. Performance increased in a “linear” fashion, said Wang, whereby training accuracy increases in proportion to the number of computers put into the cluster.

“The GPU has always claimed linear scaling, but it’s very, very difficult to achieve,” said Wang. “The whole point of our wafer-scale cluster is that because memory is this unified block, compute is separate, and we have a fabric in between, you do not have to worry about that.”

Although the work with Sandia did not train the model to convergence, such large-model training “is very important to our customers,” said Wang. “This is literally step one before you do a large run which costs so much money,” meaning, full convergence, he said.

One of the company’s largest customers, investment firm G42 of the United Arab Emirates, “is very much motivated to achieve a world-class result,” he said. “They want to train a very, very large model.”

Sandia will probably publish on the experiment when they have some “final results,” said Wang.

The NeurIPS conference is one of the premier events in AI, often featuring the first public disclosure of breakthroughs. The full schedule for the one-week event can be found on the NeurIPS Web site.

Source link