The unveiling of the new GPT-4o model by OpenAI marks a significant milestone in artificial intelligence, with the company claiming it represents a step closer to natural interaction between humans and computers.
This new model can take any combination of texts, audio, and images as inputs and generate outputs in various formats.
Moreover, it has the ability to understand emotions, analyze facial expressions, pause mid-sentence, translate spoken language in real-time, and respond almost humanly fast during conversations.
Mira Morati, the Chief Technology Officer at OpenAI, stated during a presentation, “The standout feature of GPT-4o is that it provides the level of intelligence of GPT-4 to everyone, including our free users. This is the first time we are taking a big step forward in terms of user-friendliness.”
In a presentation, OpenAI showcased GPT-4o’s ability to directly translate between English and Italian languages, helping a researcher solve a linear equation in real-time and providing guidance on deep breathing to another executive at the company by analyzing his breath.
Engineers at OpenAI and the Chief Technology Officer connected over the phone to showcase the new capabilities. They encouraged the assistant to enhance expression while writing a bedtime story, then suddenly asked to transform his voice into a robotic one and later requested to conclude the story with a singing voice.
Later on, they asked the assistant to watch a recording on the phone’s camera and respond to what appeared on the screen. The assistant was also able to speak and respond seamlessly while acting as a translator.
These features represent a significant advancement in the current audio setting in ChatGPT, where users can interact with the system through chat, and although the interaction is limited, the current version cannot be interrupted or respond on camera.
The letter o in GPT-4o stands for omni, highlighting the model’s multimedia capabilities.
OpenAI stated that they trained GPT-4o across text, vision, and audio, meaning the neural network handles all incoming and outgoing signals.
This differs from the company’s previous models, GPT-3.5 and GPT-4, which allowed users to ask questions verbally and convert speech into text, leading to a detachment of tone and emotions and making interactions slow.