OpenAI recently unveiled its latest language models, o3 and o3-mini, which are designed to excel in reasoning benchmarks. These models are successors to the o1 model, which was released just two weeks prior. Notably, OpenAI decided to skip an o2 model, respecting Anthropic’s recent introduction of their Claude language models. OpenAI CEO Sam Altman humorously noted that the naming of o3 follows the company’s tradition of poor naming choices.
The o3 model sets new standards in technical benchmarks related to programming and mathematics. It achieved a 71.7% score in the “SWE-Bench Verified” software-style benchmark, marking a significant improvement over o1. In competitive programming, o3 scored an Elo rating of 2727 in the “Codeforces” benchmark, surpassing most human competitors. In mathematical benchmarks, o3 demonstrated remarkable accuracy, achieving 87.7% in the “GPQ Diamond” benchmark, outperforming typical experts with doctoral degrees in mathematics.
To further showcase o3’s reasoning capabilities, OpenAI presented results from the challenging “Frontier Math Benchmark” by Epoch AI. Here, o3 achieved over 25% accuracy, a significant leap from previous models, which scored under 2%. A notable achievement for o3 was in the “Arc AGI” reasoning benchmark, where it reached an accuracy of 87.5% in a high-compute configuration, surpassing the human performance benchmark of approximately 85%. While this milestone is a step toward Artificial General Intelligence (AGI), o3 still struggles with some simple tasks, highlighting fundamental differences from human intelligence.
OpenAI also introduced o3-mini, which promises cost-effective reasoning performance. It operates at a speed and cost efficiency significantly better than o1 while maintaining similar performance levels. Users can choose between three modes of reasoning intensity. In a demonstration, OpenAI researchers showed how o3-mini can self-evaluate in real-time by writing and executing its own evaluation routine. Altman humorously suggested that the model might be asked to improve itself next time.
Altman announced that both o3 and o3-mini would soon be available for selected security researchers to test. The aim is to identify potential vulnerabilities and misuse risks before releasing the models to the public. A new “Deliberative Alignment” method is intended to align the models more closely with safety guidelines, enabling them to better recognize and reject undesirable requests. According to Altman, o3-mini will be available by the end of January, with o3 following shortly after. Researchers interested in early access can apply until January 10.
Meanwhile, Google has announced its own reasoning-capable language model, “Gemini 2.0 Flash.” This model features a “thinking mode” that reviews and refines answers before presenting them. Users will have the option to view the system’s “thoughts.” Initially, Google’s reasoning model will be available in an experimental, limited version. The development was led by Noam Shazeer, known for his work on the influential “Transformer” paper. Shazeer had left Google but returned following a deal between Google and his startup, Character AI.