Advancements and Challenges in AI: New Benchmarks, Tools, and Ethical Concerns

Two organizations from San Francisco have introduced a new AI benchmark called “Humanity’s Last Exam,” designed to challenge even the most advanced AI models. While leading AI models claim to solve 90% of tasks on common benchmarks, Scale AI and the Center for AI Safety have developed a more demanding test. This benchmark allows only about 10% of tasks to be completed by the best AI models.

The “Last Exam” includes 3,000 tasks selected from an original 70,000 expert questions across various academic fields such as science, mathematics, and humanities. Even advanced models like GPT-4o, Claude, or Gemini struggle with many tasks, such as questions about the skeletal structure of a hummingbird. However, this benchmark may not fully represent the capabilities of AI models, as knowledge tests can be passed by rote memorization.

OpenAI has released an early version of its first AI agent called Operator. This agent interacts with the web like a human, using its own browser to type, click, and scroll. Users can see each step visually and take over at any time. Operator is based on a new AI model called “Computer-Using Agent” (CUA), which combines image processing abilities with reasoning through reinforcement learning. OpenAI collaborates with internet companies like DoorDash and Uber, allowing users to order food, book tables, buy tickets, or request rides directly through Operator.

Perplexity has also released its Assistant, which can book restaurant tables, write emails, book rides, and set reminders. The Perplexity Assistant is free on the Google Play Store but not yet available for iOS. OpenAI’s Operator is initially available only to selected customers via the ChatGPT-Pro subscription, costing $200 per month.

OpenAI has also improved the Canvas feature for code rendering in ChatGPT. Users can now use the o1 model in Canvas, and HTML and React code can be rendered directly, saving developers time. The code-rendering feature is available to all ChatGPT users, while the o1 model is reserved for paying customers. Canvas has been fully integrated into the ChatGPT desktop app for macOS, with enterprise and educational users receiving updates in the coming weeks.

The Chinese startup DeepSeek has put pressure on Meta and US chip stocks with its latest R1 model. DeepSeek’s AI chatbot R1 operates more efficiently and cost-effectively than Western offerings, such as OpenAI’s o1. DeepSeek offers its cloud API at prices up to 27 times cheaper than o1 and has made R1 available under an MIT license for commercial use. Meta has reportedly set up crisis teams to analyze DeepSeek’s technologies and adapt its cost-effective methods.

Nepenthes is a tool designed to trap AI web crawlers by leading them into an endless loop or feeding them with meaningless content. It generates pages with links that loop back to themselves and have long load times, consuming crawler resources. However, using Nepenthes may result in a website being removed from search engines like Google.

Paul McCartney has criticized a proposed change to UK copyright law that would allow AI developers to use creative content from the internet without explicit permission, unless creators opt-out. McCartney warns this could lead to a “Wild West” where creators’ rights are disregarded.

Microsoft has integrated its video editing software Clipchamp into Microsoft 365 Copilot. This allows users to create video content and scripts using AI prompts. Clipchamp generates videos using stock materials and AI-generated voiceovers, which can be further edited and shared. The Clipchamp Copilot Video Creator will be available worldwide, particularly targeting business customers, with general availability planned for February 2024.

A study from the University of Jena has highlighted the challenges in distinguishing real from AI-generated tissue images. Participants could classify images faster when correct, but the task remained challenging. The study suggests introducing technical standards to ensure data origin and prevent fraud in scientific publications.