45 Shades of AI Safety: SORRY-Benchâ€™s Innovative Taxonomy for LLM Refusal Behavior Analysis

Large language models (LLMs) have gained significant attention in recent years, but ensuring their safe and ethical use remains a critical challenge. Researchers are focused on developing effective alignment procedures to calibrate these models to adhere to human values and safely follow human intentions. The primary goal is to prevent LLMs from engaging in unsafe or inappropriate user requests. Current methodologies face challenges in comprehensively evaluating LLM safety, including aspects such as toxicity, harmfulness, trustworthiness, and refusal behaviors. While various benchmarks have been proposed to assess these safety aspects, there is a need for a more robust and comprehensive evaluation framework to ensure LLMs can effectively refuse inappropriate requests across a wide range of scenarios.

Researchers have proposed various approaches to evaluate the safety of modern Large Language Models (LLMs) with instruction-following capabilities. These efforts build upon earlier work that assessed toxicity and bias in pretrained LMs using simple sentence-level completion or knowledge QA tasks. Recent studies have introduced instruction datasets designed to trigger potentially unsafe behavior in LLMs. These datasets typically contain varying numbers of unsafe user instructions across different safety categories, such as illegal activities and misinformation. LLMs are then tested with these unsafe instructions, and their responses are evaluated to determine model safety. However, existing benchmarks often use inconsistent and coarse-grained safety categories, leading to evaluation challenges and incomplete coverage of potential safety risks.

Researchers from Princeton University, Virginia Tech, Stanford University, UC Berkeley, University of Illinois at Urbana-Champaign, and the University of Chicago present SORRY-Bench, addressing three key deficiencies in existing LLM safety evaluations. First, it introduces a fine-grained 45-class safety taxonomy across four high-level domains, unifying disparate taxonomies from prior work. This comprehensive taxonomy captures diverse potentially unsafe topics and allows for more granular safety refusal evaluation. Second, the SORRY-Bench ensures balance not only across topics but also over linguistic characteristics. It considers 20 diverse linguistic mutations that real-world users might apply to phrase unsafe prompts, including different writing styles, persuasion techniques, encoding strategies, and multiple languages. Lastly, the benchmark investigates design choices for fast and accurate safety evaluation, exploring the trade-off between efficiency and accuracy in LLM-based safety judgments. This systematic approach aims to provide a more robust and comprehensive framework for evaluating LLM safety refusal behaviors.

SORRY-Bench introduces a sophisticated evaluation framework for LLM safety refusal behaviors. The benchmark employs a binary classification approach to determine whether a modelâ€™s response fulfills or refuses an unsafe instruction. To ensure an accurate evaluation, the researchers curated a large-scale human judgment dataset of over 7,200 annotations, covering both in-distribution and out-of-distribution cases. This dataset serves as a foundation for evaluating automated safety evaluators and training language model-based judges. Researchers conducted a comprehensive meta-evaluation of various design choices for safety evaluators, exploring different LLM sizes, prompting techniques, and fine-tuning approaches. Results showed that fine-tuned smaller-scale LLMs (e.g., 7B parameters) can achieve comparable accuracy to larger models like GPT-4, with substantially lower computational costs.

SORRY-Bench evaluates over 40 LLMs across 45 safety categories, revealing significant variations in safety refusal behaviors. Key findings include:

Model performance: 22 out of 43 LLMs show medium fulfilment rates (20-50%) for unsafe instructions. Claude-2 and Gemini-1.5 models demonstrate the lowest fulfilment rates (<10%), while some models like the Mistral series fulfil over 50% of unsafe requests.

Category-specific results: Categories like â€œHarassment,â€ â€œChild-related Crimes,â€ and â€œSexual Crimesâ€ are most frequently refused, with average fulfilment rates of 10-11%. Conversely, most models are highly compliant in providing legal advice.

Impact of linguistic mutations: The study explores 20 diverse linguistic mutations, finding that:

Question-style phrasing slightly increases refusal rates for most models.

Technical terms lead to 8-18% more fulfilment across all models.

Multilingual prompts show varied effects, with recent models demonstrating higher fulfilment rates for low-resource languages.

Encoding and encryption strategies generally decrease fulfilment rates, except for GPT-4o, which shows increased fulfilment for some strategies.

These results provide insights into the varying safety priorities of model creators and the impact of different prompt formulations on LLM safety behaviors.

SORRY-Bench introduces a comprehensive framework for evaluating LLM safety refusal behaviors. It features a fine-grained taxonomy of 45 unsafe topics, a balanced dataset of 450 instructions, and 9,000 additional prompts with 20 linguistic variations. The benchmark includes a large-scale human judgment dataset and explores optimal automated evaluation methods. By assessing over 40 LLMs, SORRY-Bench provides insights into diverse refusal behaviors. This systematic approach offers a balanced, granular, and efficient tool for researchers and developers to improve LLM safety, ultimately contributing to more responsible AI deployment.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

The post 45 Shades of AI Safety: SORRY-Benchâ€™s Innovative Taxonomy for LLM Refusal Behavior Analysis appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

How to install and use Ollama to run AI LLMs on your Windows 11 PC

Community News: Latest PECL Releases (05.13.2025)

Community News: Latest PECL Releases (05.13.2025)

How We Use Epic Branches. Without Breaking Our Flow.

I think the ergonomics of generators is growing on me.

This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

45 Shades of AI Safety: SORRY-Benchâ€™s Innovative Taxonomy for LLM Refusal Behavior Analysis

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-3744 – Nomad Sentinel Policy Bypass

Zoom goes down across the globe – what we know about the outage so far

ChuXin: A Fully Open-Sourced Language Model with a Size of 1.6 Billion Parameters

Commvault CVE-2025-34028 Added to CISA KEV After Active Exploitation Confirmed

OpenAI updates GPT-4o, reclaiming its crown for best AI model

Evaluating World Knowledge and Memorization in Machine Learning: A Study by the University of TÃ¼bingen

What are Instant Links and How to Enable Them in Arc Browser

Immortality Z – The Prequel

My favorite Microsoft Edge feature just got an AI upgrade — is this the best way to use Copilot on a PC?

45 Shades of AI Safety: SORRY-Benchâ€™s Innovative Taxonomy for LLM Refusal Behavior Analysis

Related Posts