Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks

Large Language Models (LLMs) have shown impressive performance in a range of tasks in recent years, especially classification tasks. These models demonstrate amazing performance when given gold labels or options that include the right answer. A significant limitation is that if these gold labels are purposefully left out, LLMs would still choose among the possibilities, even if none of them are correct. This raises significant concerns regarding these modelsâ€™ actual comprehension and intelligence in classification scenarios.

In the context of LLMs, this absence of uncertainty presents two primary concerns:

Versatility and Label Processing: LLMs can work with any set of labels, even ones whose accuracy is debatable. To avoid misleading users, they should ideally imitate human behavior by recognizing accurate labels or pointing out when they are absent. Due to their reliance on predetermined labels, traditional classifiers are not as flexible.

Discriminative vs. Generative Capabilities: Because LLMs are mainly intended to be generative models, they frequently forgo discriminative capabilities. High-performance metrics indicate that classification tasks are easy. However, the existing benchmarks might not accurately reflect human-like behavior, which could overestimate the usefulness of LLMs.

In recent research, three common categorization tasks have been provided as benchmarks to help with further research.

BANK77: An intent classification task.

MC-TEST: A multiple-choice question-answering task.

EQUINFER: A recently developed task that determines which of four options, based on surrounding paragraphs in scientific papers, is the correct equation.

This set of benchmarks has been named KNOW-NO, as it covers classification problems with different label sizes, lengths, and scopes, including instance-level and task-level label spaces.

A new metric named OMNIACCURACY has also been presented to assess the LLMsâ€™ performance with greater accuracy. This statistic evaluates LLMsâ€™ categorization skills by combining their results from two KNOW-NO framework dimensions, which are as follows.

Accuracy-W/-GOLD: This measures the conventional accuracy when the right label is provided.

ACCURACY-W/O-GOLD: This measures accuracy when the correct label is not available.

OMNIACCURACY seeks to better approximate human-level discrimination intelligence in classification tasks by demonstrating the LLMsâ€™ capacity to manage both situations in which correct labels are present and those in which they are not.

The team has summarized their primary contributions as follows.

When correct answers are absent from classification tasks, this study is the first to draw attention to the limitations of LLMs.Â

CLASSIFY-W/O-GOLD has been introduced, which is a new framework to assess LLMs and describe this task accordingly.

The KNOW-NO Benchmark has been presented, which comprises one newly-created task and two well-known categorization tasks. The purpose of this benchmark is to assess LLMs in the CLASSIFY-W/O-GOLD scenario.

OMNIACCURACY metric has been suggested, which combines outcomes when proper labels are present and absent in order to evaluate LLM performance in classification tasks. It provides a more in-depth assessment of the modelsâ€™ capabilities, guaranteeing a clear comprehension of how well they function in many situations.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

The post Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

Microsoft plans to lay off 3% of its workforce, reportedly targeting management cuts as it changes to fit a “dynamic marketplace”

A cross-platform Markdown note-taking application

A cross-platform Markdown note-taking application

AI Assistant Demo & Tips for Enterprise Projects

Celebrating Global Accessibility Awareness Day (GAAD)

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

Intel’s latest Arc graphics driver is ready for DOOM: The Dark Ages, launching for Premium Edition owners on PC today

NVIDIA’s drivers are causing big problems for DOOM: The Dark Ages, but some fixes are available

Capcom breaks all-time profit records with 10% income growth after Monster Hunter Wilds sold over 10 million copies in a month

Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-4732 – TOTOLINK A3002R/A3002RU HTTP POST Request Handler Buffer Overflow

DOGE Big Balls Ransomware Outlook

Microsoft Edge 135 breaks with ERR_INVALID_URL on First-Run Experience on Windows

How to create system restore points on Linux with Timeshift – and why you should

Helldivers 2 and all PlayStation network games are currently down

Microsoft Edge gets Copilot AI based New Tab Page, ditches MSN on Windows 11

LWiAI Podcast #193 – Sora release, Gemini 2, OpenAI’s AGI Rule, US AI Czar

Rilasciata Voyager 25.04: Doppio Ambiente Desktop GNOME 48 e Xfce 4.20 in un’Unica Distribuzione

Apache Parquet Java Vulnerability CVE-2025-46762 Exposes Systems to Remote Code Execution Attacks

Understanding the Limitations of Large Language Models (LLMs): New Benchmarks and Metrics for Classification Tasks

Related Posts