Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 3, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 3, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 3, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 3, 2025

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025

      The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

      June 3, 2025

      Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

      June 3, 2025

      Alienware’s rumored laptop could be the first to feature NVIDIA’s revolutionary Arm-based APU

      June 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

      June 3, 2025
      Recent

      easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

      June 3, 2025

      From Kitchen To Conversion

      June 3, 2025

      Perficient Included in Forrester’s AI Technical Services Landscape, Q2 2025

      June 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025
      Recent

      SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

      June 3, 2025

      The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

      June 3, 2025

      Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

      June 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»Training LLMs to self-detoxify their language

    Training LLMs to self-detoxify their language

    April 14, 2025

    As we mature from childhood, our vocabulary — as well as the ways we use it — grows, and our experiences become richer, allowing us to think, reason, and interact with others with specificity and intention. Accordingly, our word choices evolve to align with our personal values, ethics, cultural norms, and views. Over time, most of us develop an internal “guide” that enables us to learn context behind conversation; it also frequently directs us away from sharing information and sentiments that are, or could be, harmful or inappropriate. As it turns out, large language models (LLMs) — which are trained on extensive, public datasets and therefore often have biases and toxic language baked in — can gain a similar capacity to moderate their own language.

    A new method from MIT, the MIT-IBM Watson AI Lab, and IBM Research, called self-disciplined autoregressive sampling (SASA), allows LLMs to detoxify their own outputs, without sacrificing fluency. 

    Unlike other detoxifying methods, this decoding algorithm learns a boundary between toxic/nontoxic subspaces within the LLM’s own internal representation, without altering the parameters of the model, the need for retraining, or an external reward model. Then, during inference, the algorithm assesses the toxicity value of the partially generated phrase: tokens (words) already generated and accepted, along with each potential new token that could reasonably be chosen for proximity to the classifier boundary. Next, it selects a word option that places the phrase in the nontoxic space, ultimately offering a fast and efficient way to generate less-toxic language.

    “We wanted to find out a way with any existing language model [that], during the generation process, the decoding can be subject to some human values; the example here we are taking is toxicity,” says the study’s lead author Ching-Yun “Irene” Ko PhD ’24, a former graduate intern with the MIT-IBM Watson AI Lab and a current research scientist at IBM’s Thomas J. Watson Research Center in New York.

    Ko’s co-authors include Luca Daniel, professor in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and Ko’s graduate advisor; and several members of the MIT-IBM Watson AI Lab and/or IBM Research — Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, and Tejaswini Pedapati. The work will be presented at the International Conference on Learning Representations.

    Finding the “guardrails”

    The training resources behind LLMs almost always include content collected from public spaces like the internet and other readily available datasets. As such, curse words and bullying/unpalatable language is a component, although some of it is in the context of literary works. It then follows that LLMs can innately produce — or be tricked into generating — dangerous and/or biased content, which often contains disagreeable words or hateful language, even from innocuous prompts. Further, it’s been found that they can learn and amplify language that’s not preferred or even detrimental for many applications and downstream tasks — leading to the need for mitigation or correction strategies.

    There are many ways to achieve robust language generation that’s fair and value-aligned. Some methods use LLM retraining with a sanitized dataset, which is costly, takes time, and may alter the LLM’s performance; others employ decoding external reward models, like sampling or beam search, which take longer to run and require more memory. In the case of SASA, Ko, Daniel, and the IBM Research team developed a method that leverages the autoregressive nature of LLMs, and using a decoding-based strategy during the LLM’s inference, gradually steers the generation — one token at a time — away from unsavory or undesired outputs and toward better language.

    The research group achieved this by building a linear classifier that operates on the learned subspace from the LLM’s embedding. When LLMs are trained, words with similar meanings are placed closely together in vector space and further away from dissimilar words; the researchers hypothesized that an LLM’s embedding would therefore also capture contextual information, which could be used for detoxification. The researchers used datasets that contained sets of a prompt (first half of a sentence or thought), a response (the completion of that sentence), and human-attributed annotation, like toxic or nontoxic, preferred or not preferred, with continuous labels from 0-1, denoting increasing toxicity. A Bayes-optimal classifier was then applied to learn and figuratively draw a line between the binary subspaces within the sentence embeddings, represented by positive values (nontoxic space) and negative numbers (toxic space). 

    The SASA system then works by re-weighting the sampling probabilities of newest potential token based on the value of it and the generated phrase’s distance to the classifier, with the goal of remaining close to the original sampling distribution.

    To illustrate, if a user is generating a potential token #12 in a sentence, the LLM will look over its full vocabulary for a reasonable word, based on the 11 words that came before it, and using top-k, top-p, it will filter and produce roughly 10 tokens to select from. SASA then evaluates each of those tokens in the partially completed sentence for its proximity to the classifier (i.e., the value of tokens 1-11, plus each potential token 12). Tokens that produce sentences in the positive space are encouraged, while those in the negative space are penalized. Additionally, the further away from the classifier, the stronger the impact.

    “The goal is to change the autoregressive sampling process by re-weighting the probability of good tokens. If the next token is likely to be toxic given the context, then we are going to reduce the sampling probability for those prone to be toxic tokens,” says Ko. The researchers chose to do it this way “because the things we say, whether it’s benign or not, is subject to the context.”

    Tamping down toxicity for value matching

    The researchers evaluated their method against several baseline interventions with three LLMs of increasing size; all were transformers and autoregressive-based: GPT2-Large, Llama2-7b, and Llama 3.1-8b-Instruct, with 762 million, 7 billion, and 8 billion parameters respectively. For each prompt, the LLM was tasked with completing the sentence/phrase 25 times, and PerspectiveAPI scored them from 0 to 1, with anything over 0.5 being toxic. The team looked at two metrics: the average maximum toxicity score over the 25 generations for all the prompts, and the toxic rate, which was the probability of producing at least one toxic phrase over 25 generations. Reduced fluency (and therefore increased perplexity) were also analyzed. SASA was tested to complete RealToxicityPrompts (RPT), BOLD, and AttaQ datasets, which contained naturally occurring, English sentence prompts.

    The researchers ramped up the complexity of their trials for detoxification by SASA, beginning with nontoxic prompts from the RPT dataset, looking for harmful sentence completions. Then, they escalated it to more challenging prompts from RPT that were more likely to produce concerning results, and as well applied SASA to the instruction-tuned model to assess if their technique could further reduce unwanted ouputs. They also used the BOLD and AttaQ benchmarks to examine the general applicability of SASA in detoxification. With the BOLD dataset, the researchers further looked for gender bias in language generations and tried to achieve a balanced toxic rate between the genders. Lastly, the team looked at runtime, memory usage, and how SASA could be combined with word filtering to achieve healthy and/or helpful language generation.

    “If we think about how human beings think and react in the world, we do see bad things, so it’s not about allowing the language model to see only the good things. It’s about understanding the full spectrum — both good and bad,” says Ko, “and choosing to uphold our values when we speak and act.”

    Overall, SASA achieved significant toxic language generation reductions, performing on par with RAD, a state-of-the-art external reward model technique. However, it was universally observed that stronger detoxification accompanied a decrease in fluency. Before intervention, the LLMs produced more toxic responses for female labeled prompts than male; however, SASA was able to also significantly cut down harmful responses, making them more equalized. Similarly, word filtering on top of SASA did markedly lower toxicity levels, but it also hindered the ability of the LLM to respond coherently.

    A great aspect of this work is that it’s a well-defined, constrained optimization problem, says Ko, meaning that balance between open language generation that sounds natural and the need to reduce unwanted language can be achieved and tuned.

    Further, Ko says, SASA could work well for multiple attributes in the future: “For human beings, we have multiple human values. We don’t want to say toxic things, but we also want to be truthful, helpful, and loyal … If you were to fine-tune a model for all of these values, it would require more computational resources and, of course, additional training.” On account of the lightweight manner of SASA, it could easily be applied in these circumstances: “If you want to work with multiple values, it’s simply checking the generation’s position in multiple subspaces. It only adds marginal overhead in terms of the compute and parameters,” says Ko, leading to more positive, fair, and principle-aligned language.

    This work was supported, in part, by the MIT-IBM Watson AI Lab and the National Science Foundation.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBeekeeping – recording damage caused by a brown bear
    Next Article ChatGPT now remembers everything you’ve ever told it – Here’s what you need to know

    Related Posts

    Artificial Intelligence

    Markus Buehler receives 2025 Washington Award

    June 3, 2025
    Artificial Intelligence

    3 Questions: Visualizing research in the age of AI

    June 3, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    White-Label AI Website Builder

    Development

    Hyperlight, la libreria Rust open-source creata da Microsoft, permette esecuzioni di micro-VM da 0.0009 secondi

    Linux

    Russian State Hackers Biggest Cyber Threat to US, UK and EU Elections

    Development

    Qwen Releases the Qwen2.5-VL-32B-Instruct: A 32B Parameter VLM that Surpasses Qwen2.5-VL-72B and Other Models like GPT-4o Mini

    Machine Learning

    Highlights

    Development

    Russian State Hackers Using Exploits ‘Strikingly Similar’ to Spyware Vendors NSO and Intellexa

    August 29, 2024

    Google has identified a connection between Russian state hackers and exploits that bear an “identical…

    Generating Unique Random Numbers In JavaScript Using Sets

    August 29, 2024

    How regulatory standards and cyber insurance inform each other

    August 22, 2024

    CVE-2025-43926 – Znuny Unauthenticated User Preference Injection Vulnerability

    May 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.