Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 13, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 13, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 13, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 13, 2025

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025

      How to install and use Ollama to run AI LLMs on your Windows 11 PC

      May 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025
      Recent

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025

      How We Use Epic Branches. Without Breaking Our Flow.

      May 13, 2025

      I think the ergonomics of generators is growing on me.

      May 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025
      Recent

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»45 Shades of AI Safety: SORRY-Bench’s Innovative Taxonomy for LLM Refusal Behavior Analysis

    45 Shades of AI Safety: SORRY-Bench’s Innovative Taxonomy for LLM Refusal Behavior Analysis

    July 2, 2024

    Large language models (LLMs) have gained significant attention in recent years, but ensuring their safe and ethical use remains a critical challenge. Researchers are focused on developing effective alignment procedures to calibrate these models to adhere to human values and safely follow human intentions. The primary goal is to prevent LLMs from engaging in unsafe or inappropriate user requests. Current methodologies face challenges in comprehensively evaluating LLM safety, including aspects such as toxicity, harmfulness, trustworthiness, and refusal behaviors. While various benchmarks have been proposed to assess these safety aspects, there is a need for a more robust and comprehensive evaluation framework to ensure LLMs can effectively refuse inappropriate requests across a wide range of scenarios.

    Researchers have proposed various approaches to evaluate the safety of modern Large Language Models (LLMs) with instruction-following capabilities. These efforts build upon earlier work that assessed toxicity and bias in pretrained LMs using simple sentence-level completion or knowledge QA tasks. Recent studies have introduced instruction datasets designed to trigger potentially unsafe behavior in LLMs. These datasets typically contain varying numbers of unsafe user instructions across different safety categories, such as illegal activities and misinformation. LLMs are then tested with these unsafe instructions, and their responses are evaluated to determine model safety. However, existing benchmarks often use inconsistent and coarse-grained safety categories, leading to evaluation challenges and incomplete coverage of potential safety risks.

    Researchers from Princeton University, Virginia Tech, Stanford University, UC Berkeley, University of Illinois at Urbana-Champaign, and the University of Chicago present SORRY-Bench, addressing three key deficiencies in existing LLM safety evaluations. First, it introduces a fine-grained 45-class safety taxonomy across four high-level domains, unifying disparate taxonomies from prior work. This comprehensive taxonomy captures diverse potentially unsafe topics and allows for more granular safety refusal evaluation. Second, the SORRY-Bench ensures balance not only across topics but also over linguistic characteristics. It considers 20 diverse linguistic mutations that real-world users might apply to phrase unsafe prompts, including different writing styles, persuasion techniques, encoding strategies, and multiple languages. Lastly, the benchmark investigates design choices for fast and accurate safety evaluation, exploring the trade-off between efficiency and accuracy in LLM-based safety judgments. This systematic approach aims to provide a more robust and comprehensive framework for evaluating LLM safety refusal behaviors.

    SORRY-Bench introduces a sophisticated evaluation framework for LLM safety refusal behaviors. The benchmark employs a binary classification approach to determine whether a model’s response fulfills or refuses an unsafe instruction. To ensure an accurate evaluation, the researchers curated a large-scale human judgment dataset of over 7,200 annotations, covering both in-distribution and out-of-distribution cases. This dataset serves as a foundation for evaluating automated safety evaluators and training language model-based judges. Researchers conducted a comprehensive meta-evaluation of various design choices for safety evaluators, exploring different LLM sizes, prompting techniques, and fine-tuning approaches. Results showed that fine-tuned smaller-scale LLMs (e.g., 7B parameters) can achieve comparable accuracy to larger models like GPT-4, with substantially lower computational costs.

    SORRY-Bench evaluates over 40 LLMs across 45 safety categories, revealing significant variations in safety refusal behaviors. Key findings include:

    Model performance: 22 out of 43 LLMs show medium fulfilment rates (20-50%) for unsafe instructions. Claude-2 and Gemini-1.5 models demonstrate the lowest fulfilment rates (<10%), while some models like the Mistral series fulfil over 50% of unsafe requests.

    Category-specific results: Categories like “Harassment,” “Child-related Crimes,” and “Sexual Crimes” are most frequently refused, with average fulfilment rates of 10-11%. Conversely, most models are highly compliant in providing legal advice.

    Impact of linguistic mutations: The study explores 20 diverse linguistic mutations, finding that:

    Question-style phrasing slightly increases refusal rates for most models.

    Technical terms lead to 8-18% more fulfilment across all models.

    Multilingual prompts show varied effects, with recent models demonstrating higher fulfilment rates for low-resource languages.

    Encoding and encryption strategies generally decrease fulfilment rates, except for GPT-4o, which shows increased fulfilment for some strategies.

    These results provide insights into the varying safety priorities of model creators and the impact of different prompt formulations on LLM safety behaviors.

    SORRY-Bench introduces a comprehensive framework for evaluating LLM safety refusal behaviors. It features a fine-grained taxonomy of 45 unsafe topics, a balanced dataset of 450 instructions, and 9,000 additional prompts with 20 linguistic variations. The benchmark includes a large-scale human judgment dataset and explores optimal automated evaluation methods. By assessing over 40 LLMs, SORRY-Bench provides insights into diverse refusal behaviors. This systematic approach offers a balanced, granular, and efficient tool for researchers and developers to improve LLM safety, ultimately contributing to more responsible AI deployment.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 45k+ ML SubReddit

    The post 45 Shades of AI Safety: SORRY-Bench’s Innovative Taxonomy for LLM Refusal Behavior Analysis appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleFI-CBL: A Probabilistic Method for Concept-Based Machine Learning with Expert Rules
    Next Article Accelerated PyTorch inference with torch.compile on AWS Graviton processors

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 13, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-3744 – Nomad Sentinel Policy Bypass

    May 13, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Zoom goes down across the globe – what we know about the outage so far

    News & Updates

    ChuXin: A Fully Open-Sourced Language Model with a Size of 1.6 Billion Parameters

    Development

    Commvault CVE-2025-34028 Added to CISA KEV After Active Exploitation Confirmed

    Development

    OpenAI updates GPT-4o, reclaiming its crown for best AI model

    Development

    Highlights

    Development

    Evaluating World Knowledge and Memorization in Machine Learning: A Study by the University of Tübingen

    April 14, 2024

    Large Language Models (LLMs) have emerged as a cornerstone in artificial intelligence, proficiently managing various…

    What are Instant Links and How to Enable Them in Arc Browser

    June 20, 2024

    Immortality Z – The Prequel

    January 15, 2025

    My favorite Microsoft Edge feature just got an AI upgrade — is this the best way to use Copilot on a PC?

    March 26, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.