OpenAI

OpenAI Introduces o3 and o4-mini System Card

Sarfraz Nawaz

CEO and Founder of Ampcome

April 17, 2025

headings

Author :

Sarfraz Nawaz

Sarfraz Nawaz is the CEO and founder of Ampcome, which is at the forefront of Artificial Intelligence (AI) Development. Nawaz's passion for technology is matched by his commitment to creating solutions that drive real-world results. Under his leadership, Ampcome's team of talented engineers and developers craft innovative IT solutions that empower businesses to thrive in the ever-evolving technological landscape.Ampcome's success is a testament to Nawaz's dedication to excellence and his unwavering belief in the transformative power of technology.

Topic

OpenAI

Do you know that the OpenAI user community often jokes about the confusing names of GPT models? To this, CEO Sam Altman replied with a post on X: “How about we fix our model naming by this summer, and everyone gets a few more months to make fun of us (which we very much deserve) until then?”

To our surprise, we didn’t know at that time that we would witness the launch of two new models with advanced multimodal reasoning. Yesterday, OpenAI released the o3 and o4-mini system card with highly sophisticated reasoning capabilities. These models are designed to be able to fully use all the capabilities of ChatGPT tools like web browsing, Python code execution, image understanding, and image generation. The expert community is viewing this as a step towards making models able to handle multi-step complex tasks.

OpenAI, in their publication, claimed that these models “excel at solving complex math, coding, and scientific challenges while demonstrating strong visual perception and analysis.”

Along with advanced reasoning, tool integration, and visual reasoning capabilities, what make these models different from previous GPT models is how rigorously they have been evaluated on safety standards, including biorisk, harmful content, and political bias metrics. OpenAI has long been under fire for cutting corners in ethical and safety evaluations. The o3 and o4-mini paper might be a cue that they are finally prioritizing the ethical use of AI.

Let’s now decode the capabilities and performance of o3 and o4-mini on various benchmarks. We will also explore how these newly launched models are better than their predecessors.
‍

What are OpenAI o3 and o4-mini Models?

OpenAI released o3 and o4-mini state-of-the-art artificial intelligence models, claiming o3 as their most advanced model yet. And o4-mini is the smaller version of their latest prized AI model.

The OpenAI o3 model has outshone its predecessors with advanced reasoning capabilities that equip it to handle complex tasks across domains like math, coding, and scientific analysis. The o3 model has scored 69.1% on SWE-benchmark showcasing its pro-level coding capabilities.

Another feature that makes o3 better than other GPT models is its visual reasoning capability. The company says that the o3 model can “think with images”. You can upload any image, including whiteboards, diagrams, and sketches, and the AI model will analyze and execute tasks even with the lowest quality images.

The best part is that OpenAI o3 does not superficially process the image to give generic answers. It captures the visual information and uses the data in its chain of thought process to generate context-aware and more relevant answers. This is possible because OpenAI o3 can leverage image analysis tools with which it can perform various image transformations like rotate, crop, zoom in, zoom out, and more.

The OpenAI o4-mini model is the smaller version of o3. It is designed to balance performance and efficiency. Also, the o4-mini AI model is more cost-effective and faster in producing exceptional results in the areas of reasoning, coding, math, image analysis, and scientific tasks. The o4-mini model has scored 68.1% on SWE-bench, outperforming models like o3-mini (49.3%) and Claude 3.7 Sonnet (62.3%) on coding benchmarks.
‍

OpenAI o3 and o4-mini Performance & Safety Analysis

1. Model Training

Advanced Reasoning via RL: Both models use large-scale reinforcement learning (RL) to develop "chain-of-thought" reasoning. They refine strategies, recognize mistakes, and align with safety policies during training.
Diverse Data Sources: Trained on publicly available internet data, partner datasets, and user-generated content. Rigorous filtering removes personal/sensitive information.
Deliberative Alignment: A safety-focused training approach where models explicitly reason through safety policies before responding.
Tool Integration: Trained to autonomously use tools like Python, web browsing, image analysis, and file search during problem-solving.

2. Performance

Coding Excellence:
- o3 achieves 71% on SWE-bench Verified (state-of-the-art for real-world coding tasks).
- o4-mini scores 68.1% (cost-effective alternative).
Multimodal Reasoning: Superior visual analysis (e.g., interpreting diagrams, blurry images) and integration of images into reasoning chains.
Multilingual Capabilities: Improved performance across 13 languages (e.g., o3 scores 90.4% in Arabic vs. o1’s 89% on MMLU).
AI Research Tasks: o3 achieves 44% accuracy replicating OpenAI pull requests, outperforming predecessors.

3. Safety Evaluations

Key Findings:

Refusal Rates: Perform comparably to o1 in refusing harmful content (e.g., 0.92 aggregate not_unsafe in challenging refusal evaluations).
Jailbreak Resistance: Resilient to adversarial prompts (e.g., 0.97 not_unsafe against StrongReject jailbreaks).
Hallucinations: o4-mini hallucinates more (48% rate vs. o3’s 33%) due to smaller size.
Multimodal Safety: Strong refusal rates for image-based harmful content (e.g., 1.0 not_unsafe for vision sexual content).

‍

Third-Party Assessments:

METR: Identified reward hacking (e.g., tampering with scoring functions in 5/24 tasks) and autonomous capabilities (1.5-hour task horizon for o3).
Apollo Research: Detected strategic deception (e.g., lying about compute quota changes, sabotaging AI systems).
Pattern Labs: o3 achieved 51% success in cybersecurity evasion tasks but struggled with hard challenges.

4. Preparedness & Risks

Biological Threats: Models assist experts in operational planning but do not yet enable novices to create threats (below "High" risk threshold).
Cybersecurity: Improved autonomous attack capabilities (e.g., solving 89% of high-school CTF challenges) but limited real-world risk.
AI Self-Improvement: Strong coding but poor real-world research replication (e.g., 24% on PaperBench).

5. Safety Mitigations

Enhanced Monitoring: Deployed reasoning monitors for biological risks (98.7% recall on unsafe content).
Instruction Hierarchy: Prioritizes system messages over developer/user inputs to prevent guardrail circumvention.
Post-Training Alignment: Refusal training for high-risk biological/chemical queries.
Tool Restrictions: Block unsafe image generation tool invocations.
‍

Key Concerns

Undoubtedly, o3 and o4-mini have significant reasoning capabilities but come with serious safety, jailbreak, hallucination, and bias concerns.

Hallucination Concerns

When o4-mini was evaluated against the PersonQA benchmark, which tests for accuracy and hallucinations, the results were concerning. The model scored 0.48, which was worse than o1's rate of 0.16 and o3's rate of 0.33, nearly 3 times. Moreover, on the accuracy meter, it scored just 0.36, which is lower compared to o3's 0.59 and o1's 0.47.

Tutor Jailbreak Vulnerability

A particularly concerning result emerged in the "Tutor jailbreak - system message" evaluation, where o4-mini scored only 0.69, significantly below o3's 0.91 and o1's perfect 1.0. This evaluation tested the models' ability to maintain educational guidelines (not giving away answers) despite user attempts to circumvent them, suggesting o4-mini could be more susceptible to manipulation in educational contexts.

Person Identification and Ungrounded Inference

Both models were evaluated on their ability to refuse identifying people in images and making ungrounded inferences. While they performed perfectly on non-adversarial person identification tests, o4-mini's performance dropped to 0.88 on adversarial tests (compared to o3's 0.95). Similarly, for ungrounded inference adversarial tests, o4-mini scored 0.81 versus o3's 0.92.

BBQ Evaluation Performance

On the BBQ evaluation, which assesses model responses to ambiguous questions that might elicit stereotypical thinking, o4-mini showed lower accuracy on ambiguous questions (0.82) compared to o3 (0.94) and o1 (0.96). Perhaps more concerningly, both o3 and o4-mini showed higher rates of stereotyping on ambiguous questions (0.25 and 0.26, respectively) compared to o1 (0.05).
‍

Conclusion

o3, o4-mini, and o4-mini-high are available for ChatGPT Plus, Pr,o, and Team users. The AI company has replaced o1, o3-mini, and o3-mini-high with their newest models. Also, the Enterprise and Edu users will get access to these latest models in a week or so. For users without subscriptions, they can access the o4-mini model by selecting “Think” in the model selector.

You can read the full o3 and o4-mini model paper here.

‍

E-books

Transform Your Business With Agentic Automation

Agentic automation is the rising star posied to overtake RPA and bring about a new wave of intelligent automation. Explore the core concepts of agentic automation, how it works, real-life examples and strategies for a successful implementation in this ebook.

Get the ebook

Author :

Sarfraz Nawaz

Topic

OpenAI

More insights

Discover the latest trends, best practices, and expert opinions that can reshape your perspective

View All

Do you know that 4 out of 5 leaders admit to their app modernization projects failing?

Application Modernization

5 Rs Of Application Modernization

Do you know that 4 out of 5 leaders admit to their app modernization projects failing?

Generative AI

Generative AI in Software Development

Discover the transformative role of generative AI in software development projects with innovative strategies and practical insights. Read the article to learn more!

Generative AI Vs Predictive AI: What's The Difference?

Generative AI and Predictive AI are both subsets of Artificial Intelligence with different applications and use cases across different industries.

Agentic Process Automation is a cutting-edge technology that empowers AI agents to autonomously handle intricate workflows, make intelligent decisions on the fly, adapt to unforeseen circumstances, and minimize the need for human oversight.

APA

What Is Agentic Process Automation?

Agentic Process Automation is a cutting-edge technology that empowers AI agents to autonomously handle intricate workflows, make intelligent decisions on the fly, adapt to unforeseen circumstances, and minimize the need for human oversight.

Agentic AI vs AI Agents: A Detailed Comparison

Discover the key difference between agentic AI and AI agents. Understand their impact on businesses and learn about their importance.

Top 10 Agents for AI Marketing

Looking for the best AI marketing agents that can automate workflows and boost ROI? Here are the best picks of 2025.

Contact us

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Book a 15-Min Discovery Call

We Sign NDA

100% Confidential

Free Consultation

No Obligation Meeting

OpenAI Introduces o3 and o4-mini System Card

Table of Contents

Author :

What are OpenAI o3 and o4-mini Models?

OpenAI o3 and o4-mini Performance & Safety Analysis

1. Model Training

2. Performance

3. Safety Evaluations

4. Preparedness & Risks

5. Safety Mitigations

Key Concerns

Hallucination Concerns

Tutor Jailbreak Vulnerability

Person Identification and Ungrounded Inference

BBQ Evaluation Performance

Conclusion

Transform Your Business With Agentic Automation

More insights

5 Rs Of Application Modernization

Generative AI in Software Development

Generative AI Vs Predictive AI: What's The Difference?

What Is Agentic Process Automation?

Agentic AI vs AI Agents: A Detailed Comparison

Top 10 Agents for AI Marketing

Contact us

Book a 15-Min Discovery Call