OpenAI’s CEO Sam Altman provides insights into what we can expect next in AI. Since OpenAI introduced its latest AI models, o3 Mini and o3, in late December, rumors have been swirling. These models have solved 85% of one of the most challenging tests for artificial intelligence from the Abstract Reasoning Corpus (ARC), known as the ARC test. This is a significant breakthrough, as the best programs previously managed only about 35%.
ARC is particularly difficult for large language models because the task involves identifying the rules by which abstract graphical patterns change based on two examples, and then correctly applying these rules to a third pattern. However, o3 has only tackled a portion of the ARC puzzles so far. This effort consumed considerable computational resources, incurring high costs—”thousands of US dollars” per task, according to the prize initiators. OpenAI has not yet released prices for o3 or announced a general market launch date. Speculation online suggests that a subscription to the new model might cost not just $200 per month, as is the case with o1, but rather $2,000 or more.
Is o3 truly worth this price? How the AI model o3 functions is still a matter of speculation, as OpenAI has not disclosed details about its workings. It is clear, however, that it is not simply a larger model. Proponents of the “scaling hypothesis,” primarily OpenAI, have long believed that larger AI models trained with more data would become increasingly powerful. However, scaling now seems to be reaching its limits. U.S. media, citing anonymous sources at OpenAI, report that the performance leap in the next model generation—GPT5 and beyond—will be smaller. A lack of sufficient, high-quality training data is cited as one reason.
The AI industry responded with a strategy known as “test-time compute.” This strategy addresses a central weakness of large language models: they calculate the next token that matches the input, append the output to the prompt, and repeat the process. This works for texts but not for complex problems where the AI needs to explore possible solutions step-by-step and restart if it hits a dead end. Models like o3 or Gemini 2 first calculate partial solutions, internally verify their quality, and then proceed to the next step. When given a programming task, for instance, the model could break it down into subproblems, generate code for the first subproblem, and check its functionality before moving on. To find the optimal solution, the models explore numerous solution paths and select the best one.
Subbarao Kambhampati from Arizona State University explains in a post how this might work: the language model generates a large number of “chains of thought” to solve a given problem step-by-step. The output of one sub-step is used as input for the next, allowing the model to progress incrementally and explore various solutions in parallel. In a special training process, solutions marked as correct by humans are given higher weight. This training, likely involving synthetic data, is repeated billions of times.
In operation, the model generates solution paths that should most likely lead to a solution according to its training. It then presumably selects the shortest one and presents it to the user. This would explain why these specific models are expensive both in training and operation: a query is internally transformed into thousands of slightly different sub-queries, which users never see. According to OpenAI, o3 can automatically adjust computational effort based on task complexity.
Despite the costs, there is no guarantee that the solution is correct. The model still risks hallucinating without a true logical or mathematical verification of the solution. Thus, while promising, these models are not yet entirely reliable.