OpenAI’s CEO, Sam Altman, provides a glimpse into what we can expect next in terms of AI advancements. Recently, OpenAI introduced its latest AI models, o3 Mini and o3, which have stirred up much speculation. These models tackled a challenging test from the Abstract Reasoning Corpus (ARC), known as the ARC test, and achieved an impressive 85% success rate. Previously, the best AI programs managed only about 35%.
The ARC test is particularly difficult for large language models because it requires recognizing the rules that transform abstract graphic patterns based on two examples and then correctly applying these rules to a third pattern. However, o3 has only addressed part of the ARC puzzles so far. The AI consumed significant computing resources, resulting in high costs—”thousands of US dollars” per task, according to the prize initiators. OpenAI has not yet announced pricing for o3 or a release date for the general market. Speculation online suggests that a subscription to the new model might cost not just $200 per month, as it does for o1, but potentially $2,000 or more.
How does the o3 model likely work? OpenAI has not disclosed details about its operation. It’s clear that it isn’t just a larger model. For a long time, proponents of the “scaling hypothesis,” including OpenAI, believed that larger AI models trained with more data would become increasingly powerful. However, scaling seems to be reaching its limits. U.S. media, citing anonymous sources at OpenAI, report that the performance leap in the next model generation—GPT5 and beyond—will be smaller. A lack of sufficient high-quality training data is cited as a reason.
The AI industry responded with a strategy known as “test-time compute.” This strategy addresses a key weakness of large language models: they calculate the next token that matches the input, append the output to the prompt, and repeat the process. This works for texts but not for complex problems where the AI needs to try different solution paths and start over if it hits a dead end. Models like o3 or Gemini 2 first compute partial solutions, internally verify their quality, and then proceed to the next step. For instance, if given a programming task, the model might break it into subproblems, generate code for the first subproblem, and check if it runs before continuing.
Subbarao Kambhampati from Arizona State University explains that the language model likely generates many “chains of thoughts” to solve a given problem step by step. The output of one step becomes the input for the next, allowing the model to explore various solutions in parallel. In specialized training, human-marked correct solution paths are given higher weight. This process, possibly using synthetic data, is repeated billions of times.
During operation, the model generates solution paths that, according to its training, should lead to the solution. It likely selects the shortest path and presents it to the user. This explains why these models are expensive in both training and operation: a single query is internally transformed into thousands of slightly different subqueries, which users never see. OpenAI claims that o3 can automatically adjust computational effort based on task complexity.
Despite advancements, there’s no guarantee that o3’s solutions are correct. The model still risks producing incorrect results or “hallucinations.” It operates without a genuine logical or mathematical verification of solutions, relying solely on a large language model.