Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Stop writing tests: Automate fully with Generative AI

      August 19, 2025

      Opsera’s Codeglide.ai lets developers easily turn legacy APIs into MCP servers

      August 19, 2025

      Black Duck Security GitHub App, NuGet MCP Server preview, and more – Daily News Digest

      August 19, 2025

      10 Ways Node.js Development Boosts AI & Real-Time Data (2025-2026 Edition)

      August 18, 2025

      This new Coros watch has 3 weeks of battery life and tracks way more – even fly fishing

      August 20, 2025

      5 ways automation can speed up your daily workflow – and implementation is easy

      August 20, 2025

      This new C-suite role is more important than ever in the AI era – here’s why

      August 20, 2025

      iPhone users may finally be able to send encrypted texts to Android friends with iOS 26

      August 20, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Creating Dynamic Real-Time Features with Laravel Broadcasting

      August 20, 2025
      Recent

      Creating Dynamic Real-Time Features with Laravel Broadcasting

      August 20, 2025

      Understanding Tailwind CSS Safelist: Keep Your Dynamic Classes Safe!

      August 19, 2025

      Sitecore’s Content SDK: Everything You Need to Know

      August 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Why GNOME Replaced Eye of GNOME with Loupe as the Default Image Viewer

      August 19, 2025
      Recent

      Why GNOME Replaced Eye of GNOME with Loupe as the Default Image Viewer

      August 19, 2025

      Microsoft admits it broke “Reset this PC” in Windows 11 23H2 KB5063875, Windows 10 KB5063709

      August 19, 2025

      How to Fix “EA AntiCheat Has Detected an Incompatible Driver” on Windows 11?

      August 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Unveiling Attention Sinks: The Functional Role of First-Token Focus in Stabilizing Large Language Models

    Unveiling Attention Sinks: The Functional Role of First-Token Focus in Stabilizing Large Language Models

    April 9, 2025
    Unveiling Attention Sinks: The Functional Role of First-Token Focus in Stabilizing Large Language Models

    LLMs often show a peculiar behavior where the first token in a sequence draws unusually high attention—known as an “attention sink.” Despite seemingly unimportant, this token frequently dominates attention across many heads in Transformer models. While prior research has explored when and how attention sinks occur, the reasons behind their emergence and functional role remain unclear. These attention patterns are linked to challenges and optimization in LLMs, such as quantization, key-value caching, streaming attention, and even security vulnerabilities, highlighting their significance and the need for deeper understanding.

    Researchers from the University of Oxford, NUS, and Google DeepMind explored why attention sinks—where models focus heavily on the first token—emerge in LLMs. Contrary to past efforts to reduce them, they argue that these sinks serve a functional role by preventing over-mixing of token representations, which can lead to collapse or instability in deep Transformers. The ⟨bos⟩ token often attracts the majority of attention, limiting the spread of perturbations and stabilizing the model. Experiments on models like Gemma 7B and LLaMa 3.1 405B confirm that attention sinks become more prominent in deeper models and longer contexts, supporting their theory.

    The study explores how decoder-only Transformers, the architecture behind most modern language models, use attention mechanisms to process sequences token by token. In such models, each token can only attend to past tokens due to causal masking. A recurring phenomenon in these models is the emergence of “attention sinks”—tokens like the beginning-of-sequence (⟨bos⟩) that disproportionately attract attention across multiple heads and layers. While these sinks were previously seen as artifacts of large key and query activations, this work argues that they are vital in maintaining stable representations, especially in long sequences. By concentrating attention, sinks prevent excessive mixing of information across layers, helping to preserve the uniqueness of token representations.

    The study connects attention sinks to problems like rank collapse and over-squashing, which degrade model performance by compressing diverse inputs into indistinct representations. It uses mathematical tools like Jacobian norms to show how attention sinks reduce sensitivity to perturbations, effectively acting as stabilizers that prevent representational collapse. Experiments on models like Gemma 7B confirm that removing attention sinks increases information diffusion, while their presence maintains sharper, more localized attention patterns. Thus, attention sinks are not just a side effect but a structural feature that supports the Transformer’s ability to handle deep and long-range dependencies.

    The study investigates whether the beginning-of-sequence (⟨bos⟩) token holds any special role in forming attention sinks in language models. Through a series of experiments using different data packing and masking strategies, the researchers find that attention sinks consistently form at the first token of the input, whether or not it is explicitly marked as ⟨bos⟩. However, when ⟨bos⟩ is fixed at the start of every sequence during pretraining, the model learns to rely on it more heavily to stabilize attention and prevent over-mixing of token representations. Removing ⟨bos⟩ during inference in such models leads to a collapse in sink formation and a significant drop in performance. This highlights that although the first token always plays a role in anchoring attention, the training setup—especially the consistent presence of ⟨bos⟩—greatly strengthens this effect.

    In conclusion, the study argues that attention sinks are a structural solution to challenges like over-squashing and excessive mixing in deep Transformers. Directing attention toward the initial token—typically ⟨bos⟩—helps the model reduce its sensitivity to input noise and retain distinct token representations over long contexts. The findings also show that context length, model depth, and training configurations significantly affect how and where sinks form. By offering theoretical insights and empirical validation, the work presents attention sinks not as quirks but as components contributing to large language models’ stability and efficiency.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post Unveiling Attention Sinks: The Functional Role of First-Token Focus in Stabilizing Large Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle Releases Agent Development Kit (ADK): An Open-Source AI Framework Integrated with Gemini to Build, Manage, Evaluate and Deploy Multi Agents
    Next Article TorchSim: A Next-Generation PyTorch-Native Atomistic Simulation Engine for the MLIP Era

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 19, 2025
    Machine Learning

    Streamline employee training with an intelligent chatbot powered by Amazon Q Business

    August 19, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-43007 – SAP Service Parts Management Privilege Escalation Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Elixir WebRTC is an implementation of the W3C WebRTC API

    Linux

    CVE-2025-50740 – AutoConnect Arduino Library XSS Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47850 – JetBrains YouTrack Attachment Visibility Bypass

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-23178 – Apache HTTP Server SSL/TLS Channel Hijacking Vulnerability

    April 29, 2025

    CVE ID : CVE-2025-23178

    Published : April 29, 2025, 4:15 p.m. | 31 minutes ago

    Description : CWE-923: Improper Restriction of Communication Channel to Intended Endpoints

    Severity: 7.6 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CutePeaks – cross platform Sanger Trace file viewer

    July 2, 2025

    Victoria’s Secret Website Down After Security Incident

    May 29, 2025

    Microsoft Build: GitHub Copilot coding agent, Azure AI Foundry updates, support for MCP, and more

    May 19, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.