Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 31, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 31, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 31, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 31, 2025

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025

      Xbox Game Pass just had its strongest content quarter ever, but can we expect this level of quality forever?

      May 31, 2025

      Gaming on a dual-screen laptop? I tried it with Lenovo’s new Yoga Book 9i for 2025 — Here’s what happened

      May 31, 2025

      We got Markdown in Notepad before GTA VI

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025
      Recent

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025

      Filament Is Now Running Natively on Mobile

      May 31, 2025

      How Remix is shaking things up

      May 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025
      Recent

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025

      Xbox Game Pass just had its strongest content quarter ever, but can we expect this level of quality forever?

      May 31, 2025

      Gaming on a dual-screen laptop? I tried it with Lenovo’s new Yoga Book 9i for 2025 — Here’s what happened

      May 31, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»From Softmax to SSMax: Enhancing Attention and Key Information Retrieval in Transformers

    From Softmax to SSMax: Enhancing Attention and Key Information Retrieval in Transformers

    February 4, 2025

    Transformer-based language models process text by analyzing word relationships rather than reading in order. They use attention mechanisms to focus on keywords, but handling longer text is challenging. The Softmax function, which distributes attention, weakens as the input size grows, causing attention fading. This reduces the model’s focus on important words, making it harder to learn from long texts. As the attention values get smaller, the details become unclear, thus rendering the model ineffective for larger inputs. Unless there is a modification in the attention mechanism, the model does not focus on essential information and, therefore, fails to work well on larger text inputs.

    Current methods to improve length generalization in Transformer-based models include positional encoding, sparse attention, extended training on longer texts, and enhanced attention mechanisms. These methods are not scalable and require a lot of computational resources, making them inefficient for handling long inputs. The Softmax function, used in the case of attention distribution in Transformers, degrades as the input size grows. For more tokens, Softmax generates more flat distributions of probabilities that lead to decreasing the emphasis on keywords. Such a phenomenon is known as attention fading, severely limiting the model’s ability to process long text.

    To mitigate attention fading in Transformers, a researcher from The University of Tokyo proposed Scalable-Softmax (SSMax), which modifies the Softmax function to maintain attention on important tokens even when the input size increases. Unlike Softmax, which causes attention to spread thinly as the input grows, SSMax adjusts the scaling factor based on the input size, ensuring that the highest value remains dominant. This avoids loss of focus on key information in larger contexts. This framework incorporates a scaling factor that involves the size of the input, which alters the formula for calculating attention by using a logarithm. The model dynamically adapts to concentrate on relevant elements when variations apply and distributes attention when similar values are used. SSMax integrates easily into existing architectures with minimal changes, requiring only a simple multiplication in the attention computation. 

    To evaluate the impact of replacing Softmax with Scalable-Softmax (SSMax) in attention layers, the researcher conducted experiments on training efficiency, long-context generalization, key information retrieval, and attention allocation. They tested six configurations, including standard Softmax, SSMax with and without a scaling parameter, SSMax with a bias parameter, and two models where Softmax was replaced with SSMax after or during pretraining. SSMax consistently improved training efficiency and long-context generalization, reducing test loss across extended sequence lengths. The Needle-In-A-Haystack test revealed that SSMax significantly enhanced key information retrieval in long contexts. However, removing the scaling parameter or adding a bias degraded performance. Models where Softmax was replaced with SSMax post-training or late in pretraining, showed partial improvements but failed to match fully trained SSMax models.

    In summary, this proposed method improved transformer attention, which defeats attention fading and strengthens length generalization, making models more effective in long-context tasks. Its adaptability benefited newly trained and existing models, positioning it as a strong alternative to Softmax. The future can optimize SSMax for efficiency and integrate it into emerging Transformer models to enhance long-context understanding in real-world applications.

    Hostinger

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

    🚨 Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.

    The post From Softmax to SSMax: Enhancing Attention and Key Information Retrieval in Transformers appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle DeepMind Researchers Unlock the Potential of Decoding-Based Regression for Tabular and Density Estimation Tasks
    Next Article University of Bath Researchers Developed an Efficient and Stable Machine Learning Training Method for Neural ODEs with O(1) Memory Footprint

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 31, 2025
    Machine Learning

    Multimodal Foundation Models Fall Short on Physical Reasoning: PHYX Benchmark Highlights Key Limitations in Visual and Symbolic Integration

    May 31, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    DirectoryTree Authorization is a Native Role and Permission Management Package for Laravel

    Development

    It took 15 YEARS for Minecraft to let us craft this item, but it’s finally coming (plus new music and more)

    News & Updates

    CVE-2022-46655 – Apache HTTP Server Command Injection

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4833 – TOTOLINK A702R/A3002R/A3002RU HTTP POST Request Handler Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Artificial Intelligence

    Scaling up learning across many different robot types

    May 29, 2025

    Robots are great specialists, but poor generalists. Typically, you have to train a model for…

    NuminaMath 1.5: Second Iteration of NuminaMath Advancing AI-Powered Mathematical Problem Solving with Enhanced Competition-Level Datasets, Verified Metadata, and Improved Reasoning Capabilities

    February 11, 2025

    Advancing Single-Cell Genomics with Self-Supervised Learning: Techniques, Applications, and Insights

    January 28, 2025

    LWiAI Podcast #188 – ChatGPT+Search, OpenAI+AMD, SimpleQA, Ï€0

    November 12, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.