At its much-anticipated annual I/O event, Google this week announced some exciting functionality to its Gemini AI model, particularly its multi-modal capabilities, in a pre-recorded video demo.
Although it sounds a lot like the “Live” feature on Instagram or TikTok, Live for Gemini refers to the ability for you to “show” Gemini your view via your camera, and have a two-way conversation with the AI in real time. Think of it as video-calling with a friend who knows everything about everything.
Also: I demoed Google’s Project Astra and it felt like the future of generative AI (until it didn’t)
This year has seen this kind of AI technology appear in a host of other devices like the Rabbit R1 and the Humane AI pin, two non-smartphone devices that came out this spring to a flurry of hopeful curiosity, but ultimately didn’t move the needle away from the supremacy of the smartphone.
Now that these devices have had their moments in the sun, Google’s Gemini AI has taken the stage with its snappy, conversational multi-modal AI and brought the focus squarely back to the smartphone.
Google teased this functionality the day before I/O in a tweet that showed off Gemini correctly identifying the stage at I/O, then giving additional context to the event and asking follow-up questions of the user.
In the demo video at I/O, the user turns on their smartphone’s camera and pans around the room, asking Gemini to identify its surroundings and provide context on what it sees. Most impressive was not simply the responses Gemini gave, but how quickly the responses were generated, which yielded that natural, conversational interaction Google has been trying to convey.
Also: 3 new Gemini Advanced features unveiled at Google I/O 2024
The goals behind Google’s so-called Project Astra are centered around bringing this cutting-edge AI technology down to the scale of the smartphone; that’s in part why, Google says, it created Gemini with multi-modal capabilities from the beginning. But getting the AI to respond and ask follow-up questions in real-time has apparently been the biggest challenge.
During its R1 launch demo in April, Rabbit showed off similar multimodal AI technology that many lauded as an exciting feature. Google’s teaser video proves the company has been hard at work in developing similar functionality for Gemini that, from the looks of it, might even be better.
Google isn’t alone with multi-modal AI breakthroughs. Just a day earlier, OpenAI showed off its own updates during its OpenAI Spring Update livestream, including GPT-4o, its newest AI model that now powers ChatGPT to “see, hear, and speak.” During the demo, presenters showed the AI various objects and scenarios via their smartphones’ cameras, including a math problem written by hand, and the presenter’s facial expressions, with the AI correctly identifying these things through a similar conversational back-and-forth with its users.
Also: Google’s new ‘Ask Photos’ AI solves a problem I have every day
When Google updates Gemini on mobile later this year with this feature, the company’s technology could jump to the front of the pack in the AI assistant race, particularly with Gemini’s exceedingly natural-sounding cadence and follow-up questions. However, the exact breadth of capabilities is yet to be fully seen; this development positions Gemini as perhaps the most well-integrated multi-modal AI assistant.
Folks who attended Google’s I/O event in person had a chance to demo Gemini’s multi-modal AI for mobile in a controlled “sandbox” environment at the event, but we can expect more hands-on experiences later this year.
+ There are no comments
Add yours