ChatGPT maker OpenAI has unveiled its new flagship artificial intelligence (AI) model—GPT-4o—which is said to reason across audio, vision, and text in real time.
The new AI model has GPT-4-level intelligence but is much faster and improves on its capabilities across text, voice, and vision, the company said.
“Our new model GPT-4o is our best model ever. It is smart, it is fast, it is natively multimodal (!) … ,” OpenAI CEO Sam Altman posted on X (formerly Twitter).
“It is available to all ChatGPT users, including on the free plan! So far, GPT-4 class models have only been available to people who pay a monthly subscription. This is important to our mission; we want to put great AI tools in the hands of everyone,” he added.
GPT-4o, wherein ‘o’ represents ‘omni’, will be rolled out to all users for free with usage limits, while paid users will have greater capacity limits.
The new AI model was trained end-to-end across text, vision, and audio, which means all inputs and outputs are processed by the same neural network, said OpenAI, adding that it’s the firm’s first model combining all modalities.
According to the company, GPT-4o is “much better than any existing model” at understanding and discussing the images users share. For instance, users can snap a photo of a menu in a different language and converse with GPT-4o to translate it, discover the food’s background and importance, and receive suggestions.
GPT-4o is 2x faster and has 5x higher rate limits compared to GPT-4 Turbo, said the company.
<div class="tweet embed" contenteditable="false" id="1790068956557881504" data-id="1790068956557881504" data-url="https://twitter.com/sama/status/1790068956557881504" data-html="
and with video mode!! pic.twitter.com/cpjKokEGVd
— Sam Altman (@sama) May 13, 2024
” data-type=”tweet” align=”center”>
and with video mode!! pic.twitter.com/cpjKokEGVd
— Sam Altman (@sama) May 13, 2024
OpenAI said GPT-4o can respond to audio inputs in 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.
Future enhancements will enable more natural, real-time voice chats and the option to interact with ChatGPT through live video conversations.
“We recognise that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities,” the company said.
GPT-4o has undergone external red teaming with over 70 experts in social psychology, bias and fairness, and misinformation to pinpoint risks from newly added modalities.
Red teaming occurs when a group pretends to be the enemy to find problems or weaknesses in a system or plan. This is done to make the system stronger and the plan better.
Edited by Swetha Kannan