Most of OpenAI’s changes to ChatGPT involve what the AI-powered bot can do: questions it can answer, information it can access, improved underlying models. This time, though, it’s tweaking the way you use ChatGPT itself. The company is rolling out a new version of the service that allows you to prompt the AI bot not just by typing sentences into a text box, but by either speaking aloud or just uploading a picture. The new features are rolling out to those who pay for ChatGPT in the next two weeks, and everyone else will get it “soon after” according to OpenAI.
The voice chat part is pretty familiar: you tap a button and speak your question, ChatGPT converts it to text and feeds it to the large language model, gets an answer back, converts that back to speech, and speaks the answer out loud. It should feel just like talking to Alexa or Google Assistant, only — OpenAI hopes — the answers will be better thanks to the improved underlying tech. Most virtual assistants are being rebuilt to rely on LLMs, it appears; OpenAI’s just ahead of the game.
OpenAI’s excellent Whisper model does a lot of the speech-to-text work, and the company is rolling out a new text-to-speech model it says can generate “human-like audio from just text and a few seconds of sample speech.” You’ll be able to choose ChatGPT’s voice from five options, but OpenAI seems to think the model has vastly more potential than that. OpenAI is working with Spotify to translate podcasts into other languages, for instance, all while keeping the sound of the podcaster’s voice. There are lots of interesting uses for synthetic voices, and OpenAI could be a big part of that industry.
But the fact that you can build a capable synthetic voice with just a few seconds of audio also opens the door for all kinds of problematic use cases. “These capabilities also present new risks, such as the potential for malicious actors to impersonate public figures or commit fraud,” the company says in a blog post announcing the new features. The model isn’t available for broad use for precisely that reason, OpenAI says: it’s going to be much more controlled and restrained to specific use cases and partnerships.
The image search, meanwhile, is a bit like Google Lens. You snap a photo of whatever you’re interested in, and ChatGPT will try to suss out what you’re asking about and respond accordingly. You can also use the app’s drawing tool to help make your query clear, or speak or type questions to go along with the image. This is where ChatGPT’s back-and-forth nature is helpful: rather than doing a search, getting the wrong answer, and then doing another search, you can prompt the bot and refine the answer as you go. (This is a lot like what Google is doing with multimodal search, too.)
Obviously, image search has its potential issues too. One is what could happen when you prompt a chatbot about a person: OpenAI says it has deliberately limited ChatGPT’s “ability to analyze and make direct statements about people” both for accuracy and privacy reasons. That means one of the most sci-fi visions for AI — the ability to look at someone and say, “who is that?” — isn’t coming anytime soon. Which is probably a good thing.
Almost a year after ChatGPT’s initial launch, OpenAI seems to be still trying to figure out how to give its bot more features and capability without creating new sets of problems and downsides. With these releases, the company’s attempted to walk that line by deliberately capping what its new models can do. But that approach won’t work forever. As more people use voice control and image search, and as ChatGPT inches closer to being a truly multi-modal, useful virtual assistant, it’ll get harder and harder to keep the guardrails on.