Assessing AI’s Limits: Humanity’s Last Exam and Its Challenges

AI : Assessing AI's Limits: Humanity's Last Exam and Its Challenges

What can AI really do? The test called Humanity’s Last Exam aims to measure it. This test is designed to be a milestone, assessing the current capabilities of artificial intelligence. As reported by the New York Times, well-known AI models from OpenAI to Google have struggled with it so far.

Humanity’s Last Exam was developed by two organizations, Scale AI and the Center for AI Safety (CAIS), both based in San Francisco. CAIS is a non-profit organization involved in creating benchmarks for artificial intelligence. The test was created through a meticulous process and is meant to represent a cross-section of human knowledge. It covers fields from natural sciences and mathematics to various humanities.

1,000 experts from 50 countries were asked to submit questions from their specialties. From the 70,000 questions gathered, 13,000 underwent a review process by human examiners, resulting in 3,000 test questions used in the final exam.

The questions are challenging, ranging from text problems to image recognition, where AI must analyze diagrams and graphics. An example from the test’s homepage involves translating a Roman tombstone inscription. To decipher it, AI needs to know Latin and be familiar with common abbreviations on such tombstones. Another example is a specific question about the muscle structure of hummingbirds.

Apparently, these questions still exceed the “general knowledge” of most AI models. The test included OpenAI’s GPT-4o and o1, Google’s Gemini 1.5 Pro, and Anthropic’s Claude 3.5 Sonnet. They could answer less than 10 percent of the questions correctly, with o1 achieving the best result at 9.1 percent. Gemini, for instance, answered only about 6 percent of the questions correctly.

The degree to which AI models confidently gave incorrect answers was also measured, with all models showing over 80 percent confidence even when wrong, often over 90 percent.

Researchers believe the results will improve by the end of the year. However, this doesn’t necessarily indicate overall progress of AI models, as improvements might come from rote memorization rather than understanding. AI might translate the tombstone correctly next time but fail with a new inscription if it doesn’t grasp the abbreviation principles.

Researchers also emphasize that Humanity’s Last Exam doesn’t involve questions requiring creativity, which should be measured in a separate test.