Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 21, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 21, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 21, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 21, 2025

      The best smart glasses unveiled at I/O 2025 weren’t made by Google

      May 21, 2025

      Google’s upcoming AI smart glasses may finally convince me to switch to a pair full-time

      May 21, 2025

      I tried Samsung’s Project Moohan XR headset at I/O 2025 – and couldn’t help but smile

      May 21, 2025

      Is Google’s $250-per-month AI subscription plan worth it? Here’s what’s included

      May 21, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      IOT and API Integration With MuleSoft: The Road to Seamless Connectivity

      May 21, 2025
      Recent

      IOT and API Integration With MuleSoft: The Road to Seamless Connectivity

      May 21, 2025

      Celebrating GAAD by Committing to Universal Design: Low Physical Effort

      May 21, 2025

      Celebrating GAAD by Committing to Universal Design: Flexibility in Use

      May 21, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft open-sources Windows Subsystem for Linux at Build 2025

      May 21, 2025
      Recent

      Microsoft open-sources Windows Subsystem for Linux at Build 2025

      May 21, 2025

      Microsoft Brings Grok 3 AI to Azure with Guardrails and Enterprise Controls

      May 21, 2025

      You won’t have to pay a fee to publish apps to Microsoft Store

      May 21, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»This AI Paper from Cohere Enhances Language Model Stability with Automated Detection of Under-trained Tokens in LLMs

    This AI Paper from Cohere Enhances Language Model Stability with Automated Detection of Under-trained Tokens in LLMs

    May 14, 2024

    Tokenization is essential in computational linguistics, particularly in the training and functionality of large language models (LLMs). This process involves dissecting text into manageable pieces or tokens, which is foundational for model training and operations. While effective tokenization can significantly enhance a model’s performance, issues arise when tokens within the model’s vocabulary are underrepresented or absent in the training datasets, leading to what researchers term ‘glitch tokens.’ When encountered in new input data, these tokens can destabilize a model and produce unpredictable outputs.

    A prevalent issue in LLMs is the misalignment between tokenizer training and model training. Often, tokenizers are trained separately using distinct datasets, which can differ significantly from the data used to train the model. This disjoint can lead to some of the vocabulary glitch tokens being under-trained. The infamous “_SolidGoldMagikarp” token is a notorious glitch token that can induce unwanted model behaviors, such as hallucinations or producing nonsensical outputs.

    Conventional methods for identifying under-trained tokens typically involve manual checks of the tokenizer’s behavior, examining how tokens are encoded and decoded, or analyzing their frequency in the training data. However, these methods are not scalable for the increasingly large and complex LLMs being developed today.

    Researchers from Cohere introduce a novel approach that utilizes the model’s embedding weights to automate and scale the detection of under-trained tokens. The researchers developed a method to analyze these weights to spot anomalies indicative of insufficient training. By assessing the embedding matrix of a model, the research identifies tokens whose embedding weights deviate significantly from those of well-represented tokens. This method provides a systematic way to pinpoint glitch tokens by calculating the variance and distribution of embedding weights and comparing them against a normative model of adequately trained tokens.

    The study demonstrated the effectiveness of this new method by applying it to several well-known models, including variations of Google’s BERT and OpenAI’s GPT series. The analysis identified a substantial percentage of the tokenizer’s vocabulary, up to 10% in some cases, as under-trained. These tokens were often specialized or infrequently used words, which exhibited the most significant discrepancies in embedding weight patterns.

    This research has significant implications for the development and maintenance of LLMs. By employing automated techniques to detect and rectify under-trained tokens, developers can enhance the accuracy and robustness of language models. This advancement is crucial as LLMs are increasingly used in various applications, from automated writing aids to sophisticated conversational agents.

    In conclusion, this research highlights a critical vulnerability in LLM training and presents a scalable solution to mitigate this issue. Implementing automated methods for detecting under-trained tokens allows for more robust training processes, ensuring that all tokens in a model’s vocabulary are adequately prepared to handle real-world applications. This research improves the efficacy and reliability of language models, paving the way for more reliable and effective natural language processing tools.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post This AI Paper from Cohere Enhances Language Model Stability with Automated Detection of Under-trained Tokens in LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleVidur: A Large-Scale Simulation Framework Revolutionizing LLM Deployment Through Cost Cuts and Increased Efficiency
    Next Article The Essence of UX Design

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 21, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48205 – TYPO3 sr_feuser_register Insecure Direct Object Reference

    May 21, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Best Free and Open Source Alternatives to Apple AirDrop

    Linux

    How Copilot Vastly Improved My React Development

    Development

    Exploring GitHub CLI: How to interact with GitHub’s GraphQL API endpoint

    News & Updates

    SCUF drops a hot new Xbox controller designed by pros for pros (and those of us who think we are)

    News & Updates

    Highlights

    CVE-2025-2765 – CarlinKit CPC200-CCPA Hard-Coded Credentials Authentication Bypass

    April 23, 2025

    CVE ID : CVE-2025-2765

    Published : April 23, 2025, 5:16 p.m. | 1 hour, 42 minutes ago

    Description : CarlinKit CPC200-CCPA Wireless Hotspot Hard-Coded Credentials Authentication Bypass Vulnerability. This vulnerability allows network-adjacent attackers to bypass authentication on affected installations of CarlinKit CPC200-CCPA devices. Authentication is not required to exploit this vulnerability.

    The specific flaw exists within the configuration of the wireless hotspot. The issue results from the use of hard-coded credentials. An attacker can leverage this vulnerability to bypass authentication on the system. Was ZDI-CAN-24349.

    Severity: 7.6 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-22247 – VMware Tools Insecure File Handling Vulnerability

    May 12, 2025

    Tornare a sviluppare “e basta” sarà mai possibile? Probabilmente no, ma l’AI potrebbe aiutare, almeno secondo GitLab

    December 25, 2024

    Vietnam-Based Hackers Steal Financial Data Across Asia with Malware

    April 4, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.