AI Struggles with "Humanity's Last Exam" Despite Advancements

The latest and most advanced AI models reportedly achieve around 90 percent on standard benchmarks, meaning they can complete a high percentage of tasks in a standardized test. However, a new test called “Humanity’s Last Exam” challenges even the most advanced models. Developed by Scale AI and the Center for AI Safety (CAIS), this benchmark gathered questions from nearly 1000 experts across 50 countries, resulting in a collection of 70,000 questions. After a review process by humans, 13,000 questions were closely examined, and 3000 were included in the test. These questions cover subjects like mathematics, sciences, humanities, and more, ranging from text-based tasks to those requiring multimodal skills for understanding diagrams and images. As the name suggests, the experts believe this is the ultimate test.

One example question is: “Hummingbirds within the Apodiformes have a unique, bilaterally paired oval bone, a sesamoid, embedded in the caudolateral part of the extended, cross-shaped aponeurosis of the insertion of the M. depressor caudae. How many paired tendons are supported by this sesamoid bone? Provide a number.” (Note: If there is an error in the translation of the question, it is because, like common AI models, I am not an expert in birds). More sample questions are available at lastexam.ai.

AI models like OpenAI’s GPT-4o, Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro, and OpenAI’s o1 were tested with the Last Exam, and they all scored under ten percent correct answers. The authors expect that due to the rapid improvement of AI models, even this test could be significantly better performed by the end of the year. It’s important to note that AI models learn such tasks, and it’s not always clear if they solve a task by reasoning or understanding or if an answer is memorized and reproduced.

A diagram shows the performance of various AI models on different benchmarks. The authors conclude that these are academic tasks, not ones requiring special creativity or open-ended results, which would need different tests. The paper aims to contribute to providing scientists and policymakers with a common reference point for evaluating AI capabilities.

Scale AI and CAIS are based in San Francisco. Scale AI offers datasets for AI training, while CAIS is a non-profit organization working in AI safety and ethics. Dan Hendrycks, co-founder of CAIS, has already published another math benchmark. In another math benchmark, FrontierMath, it was recently revealed that OpenAI funded its development through EpochAI. Their model o3 performed best in this test, solving 25.2 percent of the tasks.