Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 15, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 15, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 15, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 15, 2025

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025

      Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

      May 15, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      A cross-platform Markdown note-taking application

      May 15, 2025
      Recent

      A cross-platform Markdown note-taking application

      May 15, 2025

      AI Assistant Demo & Tips for Enterprise Projects

      May 15, 2025

      Celebrating Global Accessibility Awareness Day (GAAD)

      May 15, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025
      Recent

      Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

      May 15, 2025

      NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

      May 15, 2025

      Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

      May 15, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks

    Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks

    July 2, 2024

    Large Language Models (LLMs) have shown impressive performance in a range of tasks in recent years, especially classification tasks. These models demonstrate amazing performance when given gold labels or options that include the right answer. A significant limitation is that if these gold labels are purposefully left out, LLMs would still choose among the possibilities, even if none of them are correct. This raises significant concerns regarding these models’ actual comprehension and intelligence in classification scenarios.

    In the context of LLMs, this absence of uncertainty presents two primary concerns:

    Versatility and Label Processing: LLMs can work with any set of labels, even ones whose accuracy is debatable. To avoid misleading users, they should ideally imitate human behavior by recognizing accurate labels or pointing out when they are absent. Due to their reliance on predetermined labels, traditional classifiers are not as flexible.

    Discriminative vs. Generative Capabilities: Because LLMs are mainly intended to be generative models, they frequently forgo discriminative capabilities. High-performance metrics indicate that classification tasks are easy. However, the existing benchmarks might not accurately reflect human-like behavior, which could overestimate the usefulness of LLMs.

    In recent research, three common categorization tasks have been provided as benchmarks to help with further research.

    BANK77: An intent classification task.

    MC-TEST: A multiple-choice question-answering task.

    EQUINFER: A recently developed task that determines which of four options, based on surrounding paragraphs in scientific papers, is the correct equation.

    This set of benchmarks has been named KNOW-NO, as it covers classification problems with different label sizes, lengths, and scopes, including instance-level and task-level label spaces.

    A new metric named OMNIACCURACY has also been presented to assess the LLMs’ performance with greater accuracy. This statistic evaluates LLMs’ categorization skills by combining their results from two KNOW-NO framework dimensions, which are as follows.

    Accuracy-W/-GOLD: This measures the conventional accuracy when the right label is provided.

    ACCURACY-W/O-GOLD: This measures accuracy when the correct label is not available.

    OMNIACCURACY seeks to better approximate human-level discrimination intelligence in classification tasks by demonstrating the LLMs’ capacity to manage both situations in which correct labels are present and those in which they are not.

    The team has summarized their primary contributions as follows.

    When correct answers are absent from classification tasks, this study is the first to draw attention to the limitations of LLMs. 

    CLASSIFY-W/O-GOLD has been introduced, which is a new framework to assess LLMs and describe this task accordingly.

    The KNOW-NO Benchmark has been presented, which comprises one newly-created task and two well-known categorization tasks. The purpose of this benchmark is to assess LLMs in the CLASSIFY-W/O-GOLD scenario.

    OMNIACCURACY metric has been suggested, which combines outcomes when proper labels are present and absent in order to evaluate LLM performance in classification tasks. It provides a more in-depth assessment of the models’ capabilities, guaranteeing a clear comprehension of how well they function in many situations.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 45k+ ML SubReddit

    The post Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThe Four Components of a Generative AI Workflow: Human, Interface, Data, and LLM
    Next Article MG-LLaVA: An Advanced Multi-Modal Model Adept at Processing Visual Inputs of Multiple Granularities, Including Object-Level Features, Original-Resolution Images, and High-Resolution Data

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4732 – TOTOLINK A3002R/A3002RU HTTP POST Request Handler Buffer Overflow

    May 16, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    DOGE Big Balls Ransomware Outlook

    Security

    Microsoft Edge 135 breaks with ERR_INVALID_URL on First-Run Experience on Windows

    Operating Systems

    How to create system restore points on Linux with Timeshift – and why you should

    News & Updates

    Helldivers 2 and all PlayStation network games are currently down

    News & Updates

    Highlights

    Microsoft Edge gets Copilot AI based New Tab Page, ditches MSN on Windows 11

    April 23, 2025

    Microsoft Edge is getting its first big “AI” upgrade, where Copilot is integrated into the…

    LWiAI Podcast #193 – Sora release, Gemini 2, OpenAI’s AGI Rule, US AI Czar

    December 30, 2024
    Rilasciata Voyager 25.04: Doppio Ambiente Desktop GNOME 48 e Xfce 4.20 in un’Unica Distribuzione

    Rilasciata Voyager 25.04: Doppio Ambiente Desktop GNOME 48 e Xfce 4.20 in un’Unica Distribuzione

    April 20, 2025

    Apache Parquet Java Vulnerability CVE-2025-46762 Exposes Systems to Remote Code Execution Attacks

    May 6, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.