Do you know that the OpenAI user community often jokes about the confusing names of GPT models? To this, CEO Sam Altman replied with a post on X: “How about we fix our model naming by this summer, and everyone gets a few more months to make fun of us (which we very much deserve) until then?”
To our surprise, we didn’t know at that time that we would witness the launch of two new models with advanced multimodal reasoning. Yesterday, OpenAI released the o3 and o4-mini system card with highly sophisticated reasoning capabilities. These models are designed to be able to fully use all the capabilities of ChatGPT tools like web browsing, Python code execution, image understanding, and image generation. The expert community is viewing this as a step towards making models able to handle multi-step complex tasks.
OpenAI, in their publication, claimed that these models “excel at solving complex math, coding, and scientific challenges while demonstrating strong visual perception and analysis.”
Along with advanced reasoning, tool integration, and visual reasoning capabilities, what make these models different from previous GPT models is how rigorously they have been evaluated on safety standards, including biorisk, harmful content, and political bias metrics. OpenAI has long been under fire for cutting corners in ethical and safety evaluations. The o3 and o4-mini paper might be a cue that they are finally prioritizing the ethical use of AI.
Let’s now decode the capabilities and performance of o3 and o4-mini on various benchmarks. We will also explore how these newly launched models are better than their predecessors.
OpenAI released o3 and o4-mini state-of-the-art artificial intelligence models, claiming o3 as their most advanced model yet. And o4-mini is the smaller version of their latest prized AI model.
The OpenAI o3 model has outshone its predecessors with advanced reasoning capabilities that equip it to handle complex tasks across domains like math, coding, and scientific analysis. The o3 model has scored 69.1% on SWE-benchmark showcasing its pro-level coding capabilities.
Another feature that makes o3 better than other GPT models is its visual reasoning capability. The company says that the o3 model can “think with images”. You can upload any image, including whiteboards, diagrams, and sketches, and the AI model will analyze and execute tasks even with the lowest quality images.
The best part is that OpenAI o3 does not superficially process the image to give generic answers. It captures the visual information and uses the data in its chain of thought process to generate context-aware and more relevant answers. This is possible because OpenAI o3 can leverage image analysis tools with which it can perform various image transformations like rotate, crop, zoom in, zoom out, and more.
The OpenAI o4-mini model is the smaller version of o3. It is designed to balance performance and efficiency. Also, the o4-mini AI model is more cost-effective and faster in producing exceptional results in the areas of reasoning, coding, math, image analysis, and scientific tasks. The o4-mini model has scored 68.1% on SWE-bench, outperforming models like o3-mini (49.3%) and Claude 3.7 Sonnet (62.3%) on coding benchmarks.
Key Findings:
Third-Party Assessments:
Undoubtedly, o3 and o4-mini have significant reasoning capabilities but come with serious safety, jailbreak, hallucination, and bias concerns.
When o4-mini was evaluated against the PersonQA benchmark, which tests for accuracy and hallucinations, the results were concerning. The model scored 0.48, which was worse than o1's rate of 0.16 and o3's rate of 0.33, nearly 3 times. Moreover, on the accuracy meter, it scored just 0.36, which is lower compared to o3's 0.59 and o1's 0.47.
A particularly concerning result emerged in the "Tutor jailbreak - system message" evaluation, where o4-mini scored only 0.69, significantly below o3's 0.91 and o1's perfect 1.0. This evaluation tested the models' ability to maintain educational guidelines (not giving away answers) despite user attempts to circumvent them, suggesting o4-mini could be more susceptible to manipulation in educational contexts.
Both models were evaluated on their ability to refuse identifying people in images and making ungrounded inferences. While they performed perfectly on non-adversarial person identification tests, o4-mini's performance dropped to 0.88 on adversarial tests (compared to o3's 0.95). Similarly, for ungrounded inference adversarial tests, o4-mini scored 0.81 versus o3's 0.92.
On the BBQ evaluation, which assesses model responses to ambiguous questions that might elicit stereotypical thinking, o4-mini showed lower accuracy on ambiguous questions (0.82) compared to o3 (0.94) and o1 (0.96). Perhaps more concerningly, both o3 and o4-mini showed higher rates of stereotyping on ambiguous questions (0.25 and 0.26, respectively) compared to o1 (0.05).
o3, o4-mini, and o4-mini-high are available for ChatGPT Plus, Pr,o, and Team users. The AI company has replaced o1, o3-mini, and o3-mini-high with their newest models. Also, the Enterprise and Edu users will get access to these latest models in a week or so. For users without subscriptions, they can access the o4-mini model by selecting “Think” in the model selector.
You can read the full o3 and o4-mini model paper here.
Agentic automation is the rising star posied to overtake RPA and bring about a new wave of intelligent automation. Explore the core concepts of agentic automation, how it works, real-life examples and strategies for a successful implementation in this ebook.
Discover the latest trends, best practices, and expert opinions that can reshape your perspective