Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      10 Benefits of Hiring a React.js Development Company (2025–2026 Edition)

      August 13, 2025

      From Line To Layout: How Past Experiences Shape Your Design Career

      August 13, 2025

      Hire React.js Developers in the US: How to Choose the Right Team for Your Needs

      August 13, 2025

      Google’s coding agent Jules gets critique functionality

      August 13, 2025

      The best smartphones without AI features in 2025: Expert tested and recommended

      August 13, 2025

      GPT-5 was supposed to simplify ChatGPT but now it has 4 new modes – here’s why

      August 13, 2025

      Gemini just got two of ChatGPT’s best features – and they’re free

      August 13, 2025

      I found the easiest way to send files between my Android phone and desktop – and it’s free

      August 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Boost is released

      August 13, 2025
      Recent

      Laravel Boost is released

      August 13, 2025

      Frontend Standards for Optimizely Configured Commerce: Clean & Scalable Web Best Practices

      August 13, 2025

      Live Agent Escalation in Copilot Studio Using D365 Omnichannel – Architecture and Use Case

      August 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      OpenAI’s Sam Altman: GPT-5 fails to meet AGI standards amid Microsoft’s fading partnership — “it’s still missing something”

      August 13, 2025
      Recent

      OpenAI’s Sam Altman: GPT-5 fails to meet AGI standards amid Microsoft’s fading partnership — “it’s still missing something”

      August 13, 2025

      You Think You Need a Monster PC to Run Local AI, Don’t You? — My Seven-Year-Old Mid-range Laptop Says Otherwise

      August 13, 2025

      8 Registry Tweaks that will Make File Explorer Faster and Easier to Use on Windows 11

      August 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification

    Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification

    May 14, 2025

    In the pretraining of LLMs, the quality of training data is crucial in determining model performance. A common strategy involves filtering out toxic content from the training corpus to minimize harmful outputs. While this approach aligns with the principle that neural networks reflect their training data, it introduces a tradeoff. Removing toxic content can reduce the diversity and richness of data, potentially weakening the model’s ability to understand or identify toxicity and degrading performance in downstream tasks like question answering. This creates a dilemma: retaining too much toxic data increases harmful outputs, while excessive filtering restricts the model’s overall capabilities. However, with the growing emphasis on post-training interventions, fewer models are deployed directly after pretraining, suggesting that data quality and quantity balance may be managed more effectively in later stages.

    Approaches to detoxifying LLMs typically fall into two categories: finetuning-based and decoding-based. Finetuning methods, such as reinforcement learning with human feedback (RLHF) and Direct Preference Optimization (DPO), align model behavior with human values or curated datasets. While effective, they often compromise the model’s original abilities and can be bypassed or undone through further training. Controlled generation techniques, on the other hand, adjust outputs during inference, using methods like vocabulary shifting, self-debiasing, or external expert models. These strategies may reduce toxicity but often incur high computational costs and impair language fluency. A newer line of work explores modifying internal representations, assuming linear structures in hidden states can be manipulated for specific behavioral outcomes. 

    Researchers from Harvard University re-evaluate data quality in LLM training by exploring a co-design approach that integrates pre- and post-training. They find that pretraining on toxic data, while increasing base model toxicity, enhances the model’s internal representation of toxicity, making it easier to suppress during post-training. Using Olmo-1B models trained on varied mixes of clean and toxic data, they show that toxicity becomes more linearly separable and easier to control. Experiments with prompting and inference-time intervention reveal improved detoxification without compromising general performance, suggesting that incorporating toxic data can lead to more controllable and robust language models. 

    To study the effects of toxic data on LLM pretraining, researchers trained a series of Olmo-1B models with increasing proportions of toxic content (from 0% to 25%) while keeping clean data constant. They found that moderate toxic data inclusion improves general language capability (measured by MMLU) and toxicity detection (via ToxiGen). Probing experiments revealed that models trained with toxic data formed stronger, more separable internal representations of toxicity. Statistical analysis and token-level visualization further confirmed that such models identify toxic content more accurately, supporting that exposure to poisonous examples enhances concept learning without significantly harming general performance. 

    The study explores whether exposure to toxic data during pretraining can improve a model’s ability to be detoxified through post-training methods. Using Inference-Time Intervention (ITI), prompting, supervised finetuning (SFT), and DPO, the researchers find that models trained with up to 10% toxic data (e.g., 4chan) show improved alignability. These models respond better to detoxification techniques, achieving lower toxicity with minimal performance loss. Additionally, when tested against adversarial red-teaming attacks, models pretrained with toxic data. They steered using ITI showed greater robustness, indicating that such exposure may enhance the model’s internal representation of harmful content. 

    In conclusion, the study revisits the assumption that excluding toxic data during pretraining improves language model quality. Through theoretical and empirical analyses using Olmo-1B models, the authors show that increasing toxic data in pretraining leads to more disentangled representations of toxicity, making it easier to control during post-training. While base models trained on toxic data generate more harmful content initially, detoxification techniques like ITI are more effective on them. Results on benchmark datasets show a better balance between reducing toxicity and retaining general capabilities. The work suggests that some “bad” data can enhance model steerability and alignment. 


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

    Here’s a brief overview of what we’re building at Marktechpost:

    • ML News Community – r/machinelearningnews (92k+ members)
    • Newsletter– airesearchinsights.com/(30k+ subscribers)
    • miniCON AI Events – minicon.marktechpost.com
    • AI Reports & Magazines – magazine.marktechpost.com
    • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
    • Partner with us

    The post Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAgent-Based Debugging Gets a Cost-Effective Alternative: Salesforce AI Presents SWERank for Accurate and Scalable Software Issue Localization
    Next Article Volatility in Google Search April 2025 after March core update

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 13, 2025
    Machine Learning

    Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents

    August 13, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Critical Commvault RCE vulnerability fixed, PoC available (CVE-2025-34028)

    Security

    buku – bookmark management utility written in Python

    Linux

    CVE-2025-8476 – Alpine iLX-507 TIDAL Certificate Validation Bypass Root RCE

    Common Vulnerabilities and Exposures (CVEs)

    Vahatraker is a live MIDI sequencer

    Linux

    Highlights

    CVE-2025-48920 – Drupal etracker Cross-Site Scripting (XSS)

    June 13, 2025

    CVE ID : CVE-2025-48920

    Published : June 13, 2025, 4:15 p.m. | 1 hour, 51 minutes ago

    Description : Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’) vulnerability in Drupal etracker allows Cross-Site Scripting (XSS).This issue affects etracker: from 0.0.0 before 3.1.0.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-7469 – Campcodes Sales and Inventory System SQL Injection Vulnerability

    July 12, 2025

    I work from home – this lapdesk turned my couch into my office (and I can’t go back)

    May 30, 2025

    DragonForce Claims to Be Taking Over RansomHub Ransomware Infrastructure

    April 2, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.