Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

Reinforcement finetuning uses reward signals to guide the large language model toward desirable behavior. This method sharpens the model’s ability to produce logical and structured outputs by reinforcing correct responses. Yet, the challenge persists in ensuring that these models also know when not to respond—particularly when faced with incomplete or misleading questions that don’t have a definite answer.

The problem arises when language models, after reinforcement finetuning, begin to lose their ability to refuse to answer unclear or ambiguous queries. Instead of signaling uncertainty, the models tend to produce confidently stated but incorrect responses. This phenomenon, identified in the paper as the “hallucination tax,” highlights a growing risk. As models are trained to perform better, they may also become more likely to hallucinate answers in situations where silence would be more appropriate. This is especially hazardous in domains that require high trust and precision.

Tools currently used in training large language models often overlook the importance of refusal behavior. Reinforcement finetuning frameworks tend to reward only correct answers while penalizing incorrect ones, ignoring cases where a valid response should be no answer at all. The reward systems in use do not sufficiently reinforce refusal, resulting in overconfident models. For instance, the paper shows that refusal rates dropped to near zero across multiple models after standard RFT, demonstrating that current training fails to address hallucination properly.

Researchers from the University of Southern California developed the Synthetic Unanswerable Math (SUM) dataset. SUM introduces implicitly unanswerable math problems by modifying existing questions through criteria such as missing key information or creating logical inconsistencies. The researchers used DeepScaleR as the base dataset and employed the o3-mini model to generate high-quality unanswerable questions. This synthetic dataset aims to teach models to recognize when a problem lacks sufficient information and respond accordingly.

SUM’s core technique is to mix answerable and unanswerable problems during training. Questions are modified to become ambiguous or unsolvable while maintaining plausibility. The training prompts instruct models to say “I don’t know” for unanswerable inputs. By introducing only 10% of the SUM data into reinforcement finetuning, models begin to leverage inference-time reasoning to evaluate uncertainty. This structure allows them to refuse answers more appropriately without impairing their performance on solvable problems.

Performance analysis shows significant improvements. After training with SUM, the Qwen2.5-7B model increased its refusal rate from 0.01 to 0.73 on the SUM benchmark and from 0.01 to 0.81 on the UMWP benchmark. On the SelfAware dataset, refusal accuracy rose dramatically from 0.01 to 0.94. Llama-3.1-8B-Instruct showed a similar trend, with refusal rates improving from 0.00 to 0.75 on SUM and from 0.01 to 0.79 on UMWP. Despite these gains in refusal behavior, accuracy on answerable datasets, such as GSM8K and MATH-500, remained stable, with most changes ranging from 0.00 to -0.05. The minimal drop indicates that refusal training can be introduced without major sacrifices in task performance.

This study outlines a clear trade-off between improved reasoning and trustworthiness. Reinforcement finetuning, while powerful, tends to suppress cautious behavior. The SUM dataset corrects this by teaching models to recognize what they cannot solve. With only a small addition to training data, language models become better at identifying the boundaries of their knowledge. This approach marks a significant step in making AI systems not just smarter but also more careful and honest.

Check out the Paper and Dataset on Hugging Face. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning appeared first on MarkTechPost.

Source: Read MoreÂ

Turning User Research Into Real Organizational Change

June 2025: All AI updates from the past month

Building a culture that will drive platform engineering success

Gartner: More than 40% of agentic AI projects will be canceled in the next few years

I FINALLY got my hands on my most anticipated gaming laptop of 2025 — and it’s a 14-inch monster

This gimbal-tracking webcam has TWO cameras and a great price — but it may not be “private” enough

I spent two months using the massive Area-51 gaming rig — both a powerful beast PC and an RGB beauty queen

“Using AI is no longer optional” — Did Microsoft just make Copilot mandatory for its staff as a critical performance metric?

June report 2025

June report 2025

Make your JS functions smarter and cleaner with default parameters

Best Home Interiors in Hyderabad – Top Designers & Affordable Packages

I FINALLY got my hands on my most anticipated gaming laptop of 2025 — and it’s a 14-inch monster

I FINALLY got my hands on my most anticipated gaming laptop of 2025 — and it’s a 14-inch monster

This gimbal-tracking webcam has TWO cameras and a great price — but it may not be “private” enough

I spent two months using the massive Area-51 gaming rig — both a powerful beast PC and an RGB beauty queen

Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Instruction-Following Pruning for Large Language Models

Perficient and DTE at DTECH Midwest: Powering AMI Networks with User-Centric Solutions

CVE-2024-13956 – ASPECT SSL Verification Bypass Authentication Bypass

Live Shopping Explained: Trends, Growth & Brand Impact

RoboCat: A self-improving robotic agent

CVE-2025-32441 – Rack Session Pool Session Hijacking Vulnerability

CVE-2025-45617 – Production SSM User List Unrestricted Access

Understanding the :root Selector and CSS Variables in CSS3

CVE-2025-49511 – Civi Framework CSRF

Teaching AI to Say ‘I Don’t Know’: A New Dataset Mitigates Hallucinations from Reinforcement Finetuning

Related Posts