Google’s new gaming AI aims past “superhuman opponent” and at “obedient partner”

Enlarge / Even hunt-and-fetch quests are better with a little AI help.

At this point in the progression of machine-learning AI, we’re accustomed to specially trained agents that can utterly dominate everything from Atari games to complex board games like Go. But what if an AI agent could be trained not just to play a specific game but also to interact with any generic 3D environment? And what if that AI was focused not only on brute-force winning but also on responding to natural language commands in that gaming environment?

Those are the kinds of questions animating Google’s DeepMind research group in creating SIMA, a “Scalable, Instructable, Multiworld Agent” that “isn’t trained to win, it’s trained to do what it’s told,” as research engineer Tim Harley put it in a presentation attended by Ars Technica. “And not just in one game, but… across a variety of different games all at once.”

Harley stresses that SIMA is still “very much a research project,” and the results achieved in the project’s initial tech report show there’s a long way to go before SIMA starts to approach human-level listening capabilities. Still, Harley said he hopes that SIMA can eventually provide the basis for AI agents that players can instruct and talk to in cooperative gameplay situations—think less “superhuman opponent” and more “believable partner.”

“This work isn’t about achieving high game scores,” as Google puts it in a blog post announcing its research. “Learning to play even one video game is a technical feat for an AI system, but learning to follow instructions in a variety of game settings could unlock more helpful AI agents for any environment.”

Learning how to learn

Google trained SIMA on nine very different open-world games in an attempt to create a generalizable AI agent.

To train SIMA, the DeepMind team focused on three-dimensional games and test environments controlled either from a first-person perspective or an over-the-shoulder third-person perspective. The nine games in its test suite, which were provided by Google’s developer partners, all prioritize “open-ended interactions” and eschew “extreme violence” while providing a wide range of different environments and interactions, from “outer space exploration” to “wacky goat mayhem.”
In an effort to make SIMA as generalizable as possible, the agent isn’t given any privileged access to a game’s internal data or control APIs. The system takes nothing but on-screen pixels as its input and provides nothing but keyboard and mouse controls as its output, mimicking “the [model] humans have been using [to play video games] for 50 years,” as the researchers put it. The team also designed the agent to work with games running in real time (i.e., at 30 frames per second) rather than slowing down the simulation for extra processing time like some other interactive machine-learning projects.

Animated samples of SIMA responding to basic commands across very different gaming environments.

While these restrictions increase the difficulty of SIMA’s tasks, they also mean the agent can be integrated into a new game or environment “off the shelf” with minimal setup and without any specific training regarding the “ground truth” of a game world. It also makes it relatively easy to test whether things SIMA has learned from training on previous games can “transfer” over to previously unseen games, which could be a key step to getting at artificial general intelligence.

For training data, SIMA uses video of human gameplay (and associated time-coded inputs) on the provided games, annotated with natural language descriptions of what’s happening in the footage. These clips are focused on “instructions that can be completed in less than approximately 10 seconds” to avoid the complexity that can develop with “the breadth of possible instructions over long timescales,” as the researchers put it in their tech report. Integration with pre-trained models like SPARC and Phenaki also helps the SIMA model avoid having to learn how to interpret language and visual data from scratch.

Source link