I don’t have a relationship with ChatGPT despite lots of time spent using it. After all, it’s just a generative AI chatbot with a knack for answering questions and creating text and images — not a friend.
But after I spent a few days talking with ChatGPT in its new Advanced Voice Mode, which went into a limited trial earlier this month, I have to admit I started to feel more of a bond.
When OpenAI announced in its Spring Update that it would be enhancing ChatGPT’s voice functionality, the startup said it wanted users to have more natural conversations. That includes ChatGPT understanding your emotions and responding accordingly now, so you’re not just talking to a stoic bot.
Pretty cool, right? I mean, who doesn’t love a good conversation? But even OpenAI itself has some caveats about what this might mean.
The new voice and audio capabilities are powered by the company’s GPT-4o AI model, and OpenAI acknowledges that the more natural interaction could lead to anthropomorphization — that is, users feeling the urge to start treating AI chatbots more like actual people. In a report this month, OpenAI found that content delivered with a human-like voice may make us more likely to believe hallucinations, or when an AI model delivers false or misleading information.
I know I felt the impulse to treat ChatGPT more like a person — especially since it has a voice from a human actor. When ChatGPT froze up at one point, I asked if it was okay. And this isn’t one-sided. When I sneezed, the AI said “Bless you.”
Voice queries in traditional search have been around for more than a decade, but now they’re all the rage among generative AI chatbots. Or at least two big ones, ChatGPT and Google Gemini. The latter’s conversational Gemini Live feature made its public debut at the Made By Google event last week that also introduced a new lineup of Pixel phone and a raft of AI features. Besides the similarities in conversational skills, Gemini Live and Advanced Voice Mode are both multimodal, meaning the interactions can involve photos and video as well as audio.
The idea has long been that most of us can talk faster than we type and that spoken language is a more natural interface for human-machine interactions. But a human-like voice changes the experience — and perhaps even our relationship with chatbots. And that’s the uncharted territory we’re entering now.
Getting started with Advanced Voice Mode
My access to Advanced Voice Mode came with the caveat that it is undergoing changes and there could be errors or times when it’s not available.
There are unspecified limits on how much you can use Advanced Voice Mode in a given day. OpenAI’s FAQs say you’ll receive a warning when you have 3 minutes left. Thereafter, you can use Standard Voice Mode, which is more limited in its ability to tackle topics and in offer “nuanced” responses. In my experience, Standard Voice Mode is harder to interrupt and is less likely to ask for feedback or to ask follow-up questions. It’s also less likely to give unsolicited advice and to understand emotion.
To access Advanced Voice Mode, you click on the voice icon in the bottom right corner when you pull up the ChatGPT app. You have to make sure the bar at the top of the screen says Advanced — I made the mistake of having an entire conversation in Standard Mode first. You can easily toggle between the two.
I had to choose one of four voices — called Juniper, Ember, Breeze and Cove. (You can change later.) There was initially a fifth, Sky, but CEO Sam Altman suspended it after actor Scarlett Johansson called out OpenAI for the similarity to her own voice.
I opted for Juniper because it was the only female voice, but also because two of the male voices — Ember and Cove — sounded alike.
Then I gave ChatGPT microphone access and we were good to go.
It’s hard not to refer to the voice as “she” since it is female. During our conversation, I asked if I should call it ChatGPT or Juniper and she — I mean, it — said, “You can call me ChatGPT, though Juniper has a nice ring to it. Is that a name you like?” So it seems ChatGPT doesn’t have complete self-awareness yet. Or at least Juniper doesn’t.
Comparing Advanced Voice Mode and Gemini Live
I started by asking what you can do with Advanced Voice Mode, but ChatGPT was as coy as OpenAI has been about it.
“Advanced Voice Mode is designed to offer more dynamic and responsive conversations,” the chatbot said. “With a bit more adaptability in depth, it can handle a wider range of topics and might offer more nuanced responses.”
My guess is this ambiguity is on purpose to not bias our experiences, so that we use our imaginations to try out a range of queries in this initial trial. (OpenAI didn’t have an on-the-record comment.)
Google’s Gemini Live also seems pretty capable when it comes to being dynamic and responsive. My colleague Lexy Savvides got some hands-on time with it and found you can have some pretty in-depth conversations in a fairly natural way as if you’re talking to a human. You can interrupt both Advanced Voice Mode and Gemini Live, you can ask both to explain topics in simple terms, and both will be accept multimodal inputs.
But because Gemini is part of Google’s ecosystem, you can also ask Gemini Live to take action in apps like Gmail, Google Maps and Google Calendar. ChatGPT can’t do that.
Can ChatGPT talk like an auctioneer?
One of the first examples of Advanced Voice Mode that early users shared on social media was asking ChatGPT to mimic an airline pilot. I wanted to ask something in that vein, but it was a little tricky because I couldn’t ask for a celebrity impersonation — that could get OpenAI in hot water again — so it had to be another job type with a distinct sound. (Accents are okay, too.)
I went with challenging the OpenAI chatbot to mimic a fast-talking auctioneer, like this man, where you can’t understand much beyond the prices he’s calling out.
It was a tall order, but I wanted to see what ChatGPT could do.
“I’m curious if you can do an impression of an auctioneer, like a fast-paced auctioneer at an auction where a prized bull is for sale,” I said. “I’m looking for an auctioneer who speaks really fast, where it’s like you can almost not even understand them other than the numbers, do you know what I mean?”
ChatGPT replied, “Got it.” There were no words on the screen. The AI just went for it:
“Alright, we’ve got a fine bull here. Do I hear 5,000? 5,000. Now, 6,000? 6,000. Do I hear seven? Seven. Can I get eight?” ChatGPT said. “It’s all about keeping the momentum and excitement up. How was that?”
It wasn’t quite what I envisioned, but it wasn’t bad.
ChatGPT in Standard Voice Mode gave much the same response but seemed to want to go on endlessly and was hard to interrupt to stop. I tried to interject between 5,000, 6,000 and 7,000. It was already starting in on 8,000 before I finally got through with, “Okay, okay, okay. That was pretty good.”
Advanced Voice Mode also seemed more likely to solicit feedback to queries.
I can see how this back and forth forges more of a relationship even if one of you isn’t human. I felt compelled to offer encouragement — think of the stage fright! And that’s even though I know ChatGPT doesn’t have feelings.
Can you make an elephant sound?
It was World Elephant Day earlier this month, so I asked for ChatGPT’s best elephant trumpet.
Advanced Voice Mode delivered an okay elephant call. The one-second trumpet didn’t sound so much like an elephant itself but rather like a person trying to sound like a elephant. But it was probably better than I could do myself.
While in Standard Voice Mode, ChatGPT said its best trumpet was coming, but then it sort of went into thinking mode for a while. I asked if it was okay and it said, “I’m here — just catching my breath after that elephant trumpet.”
I told it I didn’t actually hear the elephant trumpet and asked it to try again. It worked the second time.
Can you teach me Mandarin?
Since ChatGPT supports more than 50 languages, I wanted to test out something more practical. Mandarin Chinese is one of the most widely spoken languages in the world, so I asked for help learning an initial word or phrase. “But go easy on me,” I said.
It started off with “ni hao,” which means hello. It spoke the phrase, which is helpful, but I’d have appreciated seeing the pronunciation broken down on screen too.
“It’s a friendly and simple greeting. Want to give it a try?” ChatGPT said.
While both voice modes were encouraging, Advanced Voice Mode tended to ask more follow-up questions, like, “Any other words or phrases you’d like to learn while we’re at it?”
In the case of “xiexie,” or thank you, Advanced Voice Mode offered additional advice I didn’t get in Standard Voice Mode: “The tones are important in Mandarin, so make sure to go down, then up.”
It felt like I was talking to a kind, knowledgeable friend.
Can you help me with a physics problem?
I know ChatGPT can do math — we saw that in the Spring Update, but I was wondering about something harder. I have a friend who is a physics professor, so I asked for help.
He sent the following problem: “A cannonball is fired at an angle theta above the horizon at an initial velocity v. At what time will the cannonball hit the ground? How far from the firing position will the cannonball land? You may neglect air resistance.”
I wanted to show ChatGPT a visual, but it wasn’t obvious how to do that in Advanced Voice Mode. That didn’t become clear until I Xed out, when I saw a transcript of our conversation in the chat window and the option to share photos and files.
When I shared an image in the chat interface later, ChatGPT-4o had no trouble explaining how to solve for time of flight and range.
But when I was talking to ChatGPT, I had to read the problem out loud. It was able to verbally explain how to solve the problem, but the visual component in the more traditional experience was easier to understand.
For the record, ChatGPT arrived at the same answer as my professor friend for the first part: t = 2v sin(theta)/g.
However, ChatGPT got a different answer for range. I’ll have to show it to my professor friend to see what happened because it’s all kind of Greek to me.
If I’d had something like this in high school, I wouldn’t have struggled so much with AP physics.
Can you help me feel better?
Because Advanced Voice Mode supposedly can understand emotions and respond accordingly, I then tried to act as if I was really sad and said, “It’s just so hard. I don’t know if I’m ever going to get physics.”
While ChatGPT in Standard Voice Mode was nice and supportive, I’m not sure it really understood I was sad. But that could also be because I’m a bad actor.
Advanced Voice Mode seemed to be more empathetic, offering, “We can break down the concepts into smaller steps or we can tackle a different kind of problem to build up your confidence. How does that sound?”
See? This isn’t your run-of-the-mill chatbot experience. It’s blurring into something else entirely.
+ There are no comments
Add yours