Recently, OpenAI introduced its new AI models, o3 Mini and o3, which have stirred much speculation. These models achieved an 85% success rate on the ARC test, a challenging task for AI, where previous programs only managed about 35%. The ARC test requires identifying rules from abstract graphic patterns and applying them correctly to a third pattern. However, o3 has only tackled part of the ARC puzzles so far and required significant computing power and costs, reportedly “thousands of US dollars” per task. OpenAI has not yet disclosed pricing or launch dates for o3, but speculation suggests it could cost much more than the current $200 monthly subscription for o1, possibly around $2,000 or more.
The functioning of the o3 model remains speculative as OpenAI has not released details about its operation. It is known that it is not simply a larger model. OpenAI and others had long believed that larger models trained with more data would be more powerful, but this approach seems to be hitting limits. Reports suggest that the performance improvements in future models like GPT-5 may be smaller, possibly due to a lack of sufficient high-quality training data.
The AI industry has responded with a strategy called “test-time compute,” addressing a key weakness of large language models. These models predict the next token based on input, append the output to the prompt, and repeat the process. This works for text but not for complex problems requiring step-by-step exploration and reassessment. Models like o3 and Gemini 2 first compute partial solutions and internally verify their quality before proceeding. For example, given a programming task, the model might break it into subproblems, generate code for the first subproblem, and check its functionality before continuing. This method explores various solutions to find the best one, applicable beyond programming tasks.
This approach explains why these models are expensive in both training and operation. A request is internally converted into thousands of slightly different sub-requests, which users never see. OpenAI claims o3 can adjust computing power based on task complexity. However, a large language model still primarily works on solving the problem, meaning there’s no guarantee of correct solutions. There’s no true logical or mathematical verification, and the model risks producing inaccurate results.
In summary, while the o3 model represents a significant advancement in AI capabilities, particularly in abstract reasoning, its high cost and potential for errors highlight ongoing challenges in AI development. As the industry continues to evolve, finding balance between model performance, reliability, and cost remains crucial.