Report: AI is advancing beyond humans, we need new benchmarks

Stanford University released its AI Index Report 2024 which noted that AIâ€™s rapid advancement makes benchmark comparisons with humans increasingly less relevant.

The annual report provides a comprehensive insight into the trends and state of AI developments. The report says that AI models are improving so fast now that the benchmarks we use to measure them are increasingly becoming irrelevant.

A lot of industry benchmarks compare AI models to how good humans are at performing tasks. The Massive Multitask Language Understanding (MMLU) benchmark is a good example.

It uses multiple-choice questions to evaluate LLMs across 57 subjects, including math, history, law, and ethics. The MMLU has been the go-to AI benchmark since 2019.

The human baseline score on the MMLU is 89.8%, and back in 2019, the average AI model scored just over 30%. Just 5 years later, Gemini Ultra became the first model to beat the human baseline with a score of 90.04%.

The report notes that current â€œAI systems routinely exceed human performance on standard benchmarks.â€ The trends in the graph below seem to indicate that the MMLU and other benchmarks need replacing.

AI models have reached and exceeded human baselines in multiple benchmarks. Source: The AI Index 2024 Annual Report

AI models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE so researchers are developing more challenging tests.

One example is the Graduate-Level Google-Proof Q&A Benchmark (GPQA), which allows AI models to be benchmarked against really smart people, rather than average human intelligence.

The GPQA test consists of 400 tough graduate-level multiple-choice questions. Experts who have or are pursuing their PhDs correctly answer the questions 65% of the time.

The GPQA paper says that when asked questions outside their field, â€œhighly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.â€

Last month Anthropic announced that Claude 3 scored just under 60% with 5-shot CoT prompting. Weâ€™re going to need a bigger benchmark.

Claude 3 gets ~60% accuracy on GPQA. Itâ€™s hard for me to understate how hard these questions areâ€”literal PhDs (in different domains from the questions) with access to the internet get 34%.

PhDs *in the same domain* (also with internet access!) get 65% â€“ 75% accuracy. https://t.co/ARAiCNXgU9 pic.twitter.com/PH8J13zIef

â€” david rein (@idavidrein) March 4, 2024

Human evaluations and safety

The report noted that AI still faces significant problems: â€œIt cannot reliably deal with facts, perform complex reasoning, or explain its conclusions.â€

Those limitations contribute to another AI system characteristic that the report says is poorly measured; AI safety. We donâ€™t have effective benchmarks that allow us to say, â€œThis model is safer than that one.â€

Thatâ€™s partly because itâ€™s difficult to measure, and partly because â€œAI developers lack transparency, especially regarding the disclosure of training data and methodologies.â€

The report noted that an interesting trend in the industry is to crowd-source human evaluations of AI performance, rather than benchmark tests.

Ranking a modelâ€™s image aesthetics or prose is difficult to do with a test. As a result, the report says that â€œbenchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD.â€

As AI models watch the human baseline disappear in the rear-view mirror, sentiment may eventually determine which model we choose to use.

The trends indicate that AI models will eventually be smarter than us and harder to measure. We may soon find ourselves saying, â€œI donâ€™t know why, but I just like this one better.â€

The post Report: AI is advancing beyond humans, we need new benchmarks appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

The biggest unanswered questions about Xbox’s next-gen consoles

HCL Commerce V9.1 – The Power of HCL Commerce Search

HCL Commerce V9.1 – The Power of HCL Commerce Search

Community News: Latest PECL Releases (05.20.2025)

Getting Started with Personalization in Sitecore XM Cloud: Enable, Extend, and Execute

Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

Report: AI is advancing beyond humans, we need new benchmarks

Human evaluations and safety

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-5011 – MoonlightL Hexo-Boot Cross-Site Scripting Vulnerability

AI Safety Benchmarks May Not Ensure True Safety: This AI Paper Reveals the Hidden Risks of Safetywashing

6 Best Places To Travel Alone in The USA

Emergence AI Proposes Agent-E: A Web Agent Achieving 73.2% Success Rate with a 20% Improvement in Autonomous Web Navigation

What It Takes to Defend Against Cyber Threats and Dark Web Risks: Hereâ€™s What You Need to Know

shallow-backup – Git-integrated backup tool

Save $260 on Amazon’s 75-inch Omni Series Fire TV this Memorial Day

SonicWALL Connect Tunnel Vulnerability Allows Attackers to Create a DoS Condition

Google Cloud TPUs Now Available for HuggingFace users

Report: AI is advancing beyond humans, we need new benchmarks

Human evaluations and safety

Related Posts