Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 20, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 20, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 20, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 20, 2025

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025

      Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

      May 20, 2025

      Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

      May 20, 2025

      The biggest unanswered questions about Xbox’s next-gen consoles

      May 20, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      HCL Commerce V9.1 – The Power of HCL Commerce Search

      May 20, 2025
      Recent

      HCL Commerce V9.1 – The Power of HCL Commerce Search

      May 20, 2025

      Community News: Latest PECL Releases (05.20.2025)

      May 20, 2025

      Getting Started with Personalization in Sitecore XM Cloud: Enable, Extend, and Execute

      May 20, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025
      Recent

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025

      Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

      May 20, 2025

      Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

      May 20, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»Report: AI is advancing beyond humans, we need new benchmarks

    Report: AI is advancing beyond humans, we need new benchmarks

    April 17, 2024

    Stanford University released its AI Index Report 2024 which noted that AI’s rapid advancement makes benchmark comparisons with humans increasingly less relevant.

    The annual report provides a comprehensive insight into the trends and state of AI developments. The report says that AI models are improving so fast now that the benchmarks we use to measure them are increasingly becoming irrelevant.

    A lot of industry benchmarks compare AI models to how good humans are at performing tasks. The Massive Multitask Language Understanding (MMLU) benchmark is a good example.

    It uses multiple-choice questions to evaluate LLMs across 57 subjects, including math, history, law, and ethics. The MMLU has been the go-to AI benchmark since 2019.

    The human baseline score on the MMLU is 89.8%, and back in 2019, the average AI model scored just over 30%. Just 5 years later, Gemini Ultra became the first model to beat the human baseline with a score of 90.04%.

    The report notes that current “AI systems routinely exceed human performance on standard benchmarks.” The trends in the graph below seem to indicate that the MMLU and other benchmarks need replacing.

    AI models have reached and exceeded human baselines in multiple benchmarks. Source: The AI Index 2024 Annual Report

    AI models have reached performance saturation on established benchmarks such as ImageNet, SQuAD, and SuperGLUE so researchers are developing more challenging tests.

    One example is the Graduate-Level Google-Proof Q&A Benchmark (GPQA), which allows AI models to be benchmarked against really smart people, rather than average human intelligence.

    The GPQA test consists of 400 tough graduate-level multiple-choice questions. Experts who have or are pursuing their PhDs correctly answer the questions 65% of the time.

    The GPQA paper says that when asked questions outside their field, “highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.”

    Last month Anthropic announced that Claude 3 scored just under 60% with 5-shot CoT prompting. We’re going to need a bigger benchmark.

    Hostinger

    Claude 3 gets ~60% accuracy on GPQA. It’s hard for me to understate how hard these questions are—literal PhDs (in different domains from the questions) with access to the internet get 34%.

    PhDs *in the same domain* (also with internet access!) get 65% – 75% accuracy. https://t.co/ARAiCNXgU9 pic.twitter.com/PH8J13zIef

    — david rein (@idavidrein) March 4, 2024

    Human evaluations and safety

    The report noted that AI still faces significant problems: “It cannot reliably deal with facts, perform complex reasoning, or explain its conclusions.”

    Those limitations contribute to another AI system characteristic that the report says is poorly measured; AI safety. We don’t have effective benchmarks that allow us to say, “This model is safer than that one.”

    That’s partly because it’s difficult to measure, and partly because “AI developers lack transparency, especially regarding the disclosure of training data and methodologies.”

    The report noted that an interesting trend in the industry is to crowd-source human evaluations of AI performance, rather than benchmark tests.

    Ranking a model’s image aesthetics or prose is difficult to do with a test. As a result, the report says that “benchmarking has slowly started shifting toward incorporating human evaluations like the Chatbot Arena Leaderboard rather than computerized rankings like ImageNet or SQuAD.”

    As AI models watch the human baseline disappear in the rear-view mirror, sentiment may eventually determine which model we choose to use.

    The trends indicate that AI models will eventually be smarter than us and harder to measure. We may soon find ourselves saying, “I don’t know why, but I just like this one better.”

    The post Report: AI is advancing beyond humans, we need new benchmarks appeared first on DailyAI.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGenAI: A New Headache for SaaS Security Teams
    Next Article Scientists accelerate the search for Parkinson’s treatments using AI

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 21, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5011 – MoonlightL Hexo-Boot Cross-Site Scripting Vulnerability

    May 21, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    AI Safety Benchmarks May Not Ensure True Safety: This AI Paper Reveals the Hidden Risks of Safetywashing

    Development

    6 Best Places To Travel Alone in The USA

    Development

    Emergence AI Proposes Agent-E: A Web Agent Achieving 73.2% Success Rate with a 20% Improvement in Autonomous Web Navigation

    Development

    What It Takes to Defend Against Cyber Threats and Dark Web Risks: Here’s What You Need to Know

    Development

    Highlights

    shallow-backup – Git-integrated backup tool

    January 26, 2025

    shallow-backup lets you easily create lightweight backups of installed packages, applications, fonts and dotfiles. The…

    Save $260 on Amazon’s 75-inch Omni Series Fire TV this Memorial Day

    May 26, 2024

    SonicWALL Connect Tunnel Vulnerability Allows Attackers to Create a DoS Condition

    April 30, 2025

    Google Cloud TPUs Now Available for HuggingFace users

    July 10, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.