OpenAI Launches ChatGPT 4o, New Voice Mode, and Mac App

OpenAI on Monday announced a slew of new additions to ChatGPT, its artificial intelligence chatbot, in a “Spring Update” event streamed in front of a live audience of employees in its San Francisco office. Mira Murati, the company’s chief technology officer, led the announcements alongside some engineers who worked on their development while Sam Altman, OpenAI’s chief executive, live-posted from the audience on the social media website X. I highly recommend watching the entire presentation, as it is truly one of the most mind-blowing demonstrations one will ever see. It is just 26 minutes long and is available for free on OpenAI’s website. But here is the rundown of the main announcements:

A new large language model, called GPT-4o, with “O” standing for “omni.” It is significantly speedier at producing responses than GPT-4 while being as intelligent as the older version of the generative pre-trained transformer.
A new, improved voice mode that integrates a live camera so ChatGPT can see and speak concurrently. Users can interrupt the robot while it speaks, and the model acts more expressively, tuning its responses to the user’s emotions.
A native ChatGPT application for macOS with which users can ask the chatbot questions with a keyboard shortcut, share their screen for questions, and ask ChatGPT about clipboard contents.

Again, the video presentation is compulsory viewing, even for the less technically inclined. No written summary will be able to describe the emotional rush felt while watching a robot act like a human being. The most compelling portion of the demonstration was when the two engineers spoke to the chatbot on an iPhone, through the app, and watched it rattle off eloquent, human-like responses to questions asked naturally. It really is something to behold.

However, something stuck out to me throughout the banter between the humans and the chatbot: the expressiveness. Virtual assistants, no matter how good their text-to-speech capabilities may be, still speak like inanimate non-player characters, in a way. Their responses are tailored specifically to questions posed by the users, but they still sound pre-written and artificial due to the way they speak. Humans use filler words, like “um,” “uh,” and “like” frequently; they take long pauses to finish thoughts before speaking them aloud; and they read and speak expressively, with each word sounding different each time. Emphasis might be placed on different parts of the word, it might be said at different speeds — the point is, humans do not speak perfectly. They speak like humans.

The new voice mode version of ChatGPT, ChatGPT 4o, speaks just like a real person would. It laughs, it takes pauses, it places emphasis on different parts of words and sentences, and it speaks loosely. It acts more like a compassionate friend than a professional assistant — it does not aim to be formal in any way, but it also tries to maintain some degree of clarity. For example, it won’t meander like a person may, but it does sound like it may meander. For example, when the chatbot viewed a piece of paper with the words “I ♥ ChatGPT,” it responded oddly carefully: “Oh, stop it, you’re making me blush!” Aside from the fact that robots cannot blush, the way it said “oh” and the space that came after it had the same expression and emotion that it would carry if a human had said it. The chatbot sounded surprised, befuddled, and flustered, even though it had prepared that response after solving essentially what was just a tough algebra problem.

Other instances, however, seemed pretty awkward: ChatGPT seemed very talkative in the demonstration, such as when the presenters made mistakes or asked the robot to wait a second. Instead of simply replying “Sure” or just firing back with an “mhmm” as a person would, it gave an annoyingly verbose answer: “Sure, I’d love to see it whenever you’re ready!” No person would speak like that unless they were trying to be extra flattering or appear overly attentive. It could be that ChatGPT’s makers programmed the robot to perform this way for the presentation just so that the audience could hear more of the Scarlett Johansson-esque voice straight from the movie “Her,” but the constant talkativeness broke the immersion and made me want to frankly tell it to quiet down a bit.

The robot also seemed oddly witty, as if it carried some sass in its responses. It wasn’t rude, of course, but it sounded like a very confident salesperson when it should’ve been more subdued. It liked to use words like “Whoops!” and added some small humor to its responses — again, signs of wordiness. I assume the reason for this is to make the robot sound more humanlike because awkward silences are unpleasant and may lead users to think ChatGPT is processing information or not ready to receive a request. In fact, while in voice mode, it’s always processing information and ready to receive requests. It can be interrupted with no qualms, it can be asked different questions, and it can wait on standby for more information. Because GPT-4o is so quick at generating responses, there is zero latency between questions, which is jarring to adjust to but also mimics personal interactions.

Because ChatGPT has no facial expressions, it has to rely on sometimes annoying audio cues to keep the conversation flowing. That doesn’t mean ChatGPT can’t sense users’ emotions or feelings, though — the “O” in GPT-4o enables it to understand tacit intricacies in speech. It can also use the camera to detect facial expressions, but the more interesting use was what it could do with its virtual apparatus. Not only can users speak to ChatGPT while it is looking at something by way of its “omni-modal” capabilities, but users can share their computer screens and make selections on the fly to receive guidance from ChatGPT as if it were a friend looking over their shoulder. An intriguing demonstration was when the robot was able to guide a user through solving a math equation, identifying mistakes as they were made on the paper without any additional input. That was seriously impressive. Another example was with writing code: ChatGPT could look at some code in a document and describe what it did, then make modifications to it.

ChatGPT 4o’s underlying technology is still OpenAI’s flagship GPT-4 LLM, which is still available for paying customers — though I wouldn’t know why one would use it as it’s worse and has lower usage limits. But the new LLM is now trained on audio and visual data in addition to text. Previously, as Murati described during the event, ChatGPT would have to perform a dance of transcribing speech, describing images, processing the information like a normal LLM text query, and then finally running the answer through a text-to-speech model. GPT-4o performs all of those steps inherently as part of its processing pipeline. It natively supports multimodal input and processes it naturally without performing any modifications. It knows what objects are in real life, it knows how people speak, and it knows how to speak like them. It is truly advanced technology, and I can’t wait to use it when it launches “in the coming weeks.”

While the concept of a truly humanlike chatbot is still unsettling to me, I feel like we’ll all become accustomed to assistants such as the one OpenAI announced on Monday. And I also believe they’ll be more intertwined with our daily lives due to their deep integration with our current technology like iPhones and Macs, unlike AI-focused devices (grifts) like the ones from Humane and Rabbit. (The new Mac app is awesome.) It’s an exciting, amazing time for technology.