Apple’s AI, known as Apple Intelligence, has faced significant criticism due to its generated news summaries, which were often strange and contained incorrect information. This led Apple to completely halt this part of its AI function and work on a solution.
Apple’s developers had anticipated these issues, having already identified flaws in large language models (LLMs) through their own study. The study revealed that these language models do not exhibit formal thinking but instead rely on complex pattern recognition. This approach is so error-prone that changing names can alter the results. Essentially, language models attempt to replicate thought processes observed in their training data.
The researchers examined all major language models and found similar errors across the board, albeit to varying extents. While simple tasks were reliably handled by AI bots, the accuracy of responses dropped by up to 65% when tasks were complex or included irrelevant information.
An example provided by Apple’s AI experts illustrated this problem with a math question requiring genuine understanding. The question involved Oliver picking kiwis on different days, with the task being to calculate the total number of kiwis picked. However, when irrelevant information about the size of some kiwis was added, both OpenAI’s and Meta’s models incorrectly subtracted these from the total count.
Across 20 tested LLMs, differences in performance were noted. Even the most advanced model, OpenAI’s o1-preview, saw a 17.5% drop in accuracy, while its predecessor GPT-4o had a 32% decrease. Performance declined measurably even when only the numbers in the question were changed.
The researchers concluded that these models have a critical flaw in recognizing relevant information for problem-solving. Their reasoning is not formal and relies heavily on pattern recognition. Despite finding these issues internally, Apple still released its model to its large user base, resulting in widespread criticism.
The integration of AI has not boosted sales, and the response has been overwhelmingly negative.