Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 20, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 20, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 20, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 20, 2025

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025

      Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

      May 20, 2025

      Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

      May 20, 2025

      The biggest unanswered questions about Xbox’s next-gen consoles

      May 20, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      HCL Commerce V9.1 – The Power of HCL Commerce Search

      May 20, 2025
      Recent

      HCL Commerce V9.1 – The Power of HCL Commerce Search

      May 20, 2025

      Community News: Latest PECL Releases (05.20.2025)

      May 20, 2025

      Getting Started with Personalization in Sitecore XM Cloud: Enable, Extend, and Execute

      May 20, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025
      Recent

      Helldivers 2: Heart of Democracy update is live, and you need to jump in to save Super Earth from the Illuminate

      May 20, 2025

      Qualcomm’s new Adreno Control Panel will let you fine-tune the GPU for certain games on Snapdragon X Elite devices

      May 20, 2025

      Samsung takes on LG’s best gaming TVs — adds NVIDIA G-SYNC support to 2025 flagship

      May 20, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»This AI Paper from Cohere Enhances Language Model Stability with Automated Detection of Under-trained Tokens in LLMs

    This AI Paper from Cohere Enhances Language Model Stability with Automated Detection of Under-trained Tokens in LLMs

    May 14, 2024

    Tokenization is essential in computational linguistics, particularly in the training and functionality of large language models (LLMs). This process involves dissecting text into manageable pieces or tokens, which is foundational for model training and operations. While effective tokenization can significantly enhance a model’s performance, issues arise when tokens within the model’s vocabulary are underrepresented or absent in the training datasets, leading to what researchers term ‘glitch tokens.’ When encountered in new input data, these tokens can destabilize a model and produce unpredictable outputs.

    A prevalent issue in LLMs is the misalignment between tokenizer training and model training. Often, tokenizers are trained separately using distinct datasets, which can differ significantly from the data used to train the model. This disjoint can lead to some of the vocabulary glitch tokens being under-trained. The infamous “_SolidGoldMagikarp” token is a notorious glitch token that can induce unwanted model behaviors, such as hallucinations or producing nonsensical outputs.

    Conventional methods for identifying under-trained tokens typically involve manual checks of the tokenizer’s behavior, examining how tokens are encoded and decoded, or analyzing their frequency in the training data. However, these methods are not scalable for the increasingly large and complex LLMs being developed today.

    Researchers from Cohere introduce a novel approach that utilizes the model’s embedding weights to automate and scale the detection of under-trained tokens. The researchers developed a method to analyze these weights to spot anomalies indicative of insufficient training. By assessing the embedding matrix of a model, the research identifies tokens whose embedding weights deviate significantly from those of well-represented tokens. This method provides a systematic way to pinpoint glitch tokens by calculating the variance and distribution of embedding weights and comparing them against a normative model of adequately trained tokens.

    The study demonstrated the effectiveness of this new method by applying it to several well-known models, including variations of Google’s BERT and OpenAI’s GPT series. The analysis identified a substantial percentage of the tokenizer’s vocabulary, up to 10% in some cases, as under-trained. These tokens were often specialized or infrequently used words, which exhibited the most significant discrepancies in embedding weight patterns.

    This research has significant implications for the development and maintenance of LLMs. By employing automated techniques to detect and rectify under-trained tokens, developers can enhance the accuracy and robustness of language models. This advancement is crucial as LLMs are increasingly used in various applications, from automated writing aids to sophisticated conversational agents.

    In conclusion, this research highlights a critical vulnerability in LLM training and presents a scalable solution to mitigate this issue. Implementing automated methods for detecting under-trained tokens allows for more robust training processes, ensuring that all tokens in a model’s vocabulary are adequately prepared to handle real-world applications. This research improves the efficacy and reliability of language models, paving the way for more reliable and effective natural language processing tools.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 42k+ ML SubReddit

    The post This AI Paper from Cohere Enhances Language Model Stability with Automated Detection of Under-trained Tokens in LLMs appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleVidur: A Large-Scale Simulation Framework Revolutionizing LLM Deployment Through Cost Cuts and Increased Efficiency
    Next Article The Essence of UX Design

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 21, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5011 – MoonlightL Hexo-Boot Cross-Site Scripting Vulnerability

    May 21, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Cross Country Road Trip: Top 4 Methods To Unleash Your Inner Explorer

    Development

    Acer drops “women and casual gamers” gimmick for RTX 4050 gaming laptop with new 14, 15, 16, and 17-inch Nitro variants

    News & Updates

    SocGholish Reloaded: Darktrace Uncovers Ransomware-Primed Loader Campaign

    Security

    Streamlining Data Queries Using LINQ in Your .NET Applications

    Development

    Highlights

    CVE-2025-4288 – PCMan FTP Server RNFR Command Handler Buffer Overflow Vulnerability

    May 5, 2025

    CVE ID : CVE-2025-4288

    Published : May 5, 2025, 9:15 p.m. | 2 hours, 18 minutes ago

    Description : A vulnerability classified as critical has been found in PCMan FTP Server 2.0.7. This affects an unknown part of the component RNFR Command Handler. The manipulation leads to buffer overflow. It is possible to initiate the attack remotely. The exploit has been disclosed to the public and may be used.

    Severity: 7.3 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    New Android Trojan ‘SoumniBot’ Evades Detection with Clever Tricks

    April 18, 2024

    From AI trainers to ethicists: AI may obsolete some jobs but generate new ones

    June 17, 2024

    Leveraging Traccar for Enhanced Fleet Management App Functionality

    June 27, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.