Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Double-Edged Sustainability Sword Of AI In Web Design

      August 20, 2025

      Top 12 Reasons Enterprises Choose Node.js Development Services for Scalable Growth

      August 20, 2025

      GitHub’s coding agent can now be launched from anywhere on platform using new Agents panel

      August 20, 2025

      Stop writing tests: Automate fully with Generative AI

      August 19, 2025

      I’m a diehard Pixel fan, but I’m not upgrading to the Pixel 10. Here’s why

      August 21, 2025

      Google Pixel Watch 4 vs. Samsung Galaxy Watch 8: I compared the two best Androids, and here’s the winner

      August 21, 2025

      Get a free Amazon gift card up to $300 when you preorder a new Google Pixel 10 phone – here’s how

      August 21, 2025

      Everything announced at Made by Google 2025: Pixel 10 Pro, Fold, Watch 4, and more

      August 21, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Copy Errors as Markdown to Share With AI in Laravel 12.25

      August 21, 2025
      Recent

      Copy Errors as Markdown to Share With AI in Laravel 12.25

      August 21, 2025

      Deconstructing the Request Lifecycle in Sitecore Headless – Part 2: SSG and ISR Modes in Next.js

      August 20, 2025

      Susan Etlinger, AI Analyst and Industry Watcher on Building Trust

      August 20, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      TerraMaster D1 SSD Plus Review: Experience a Faster External SSD

      August 20, 2025
      Recent

      TerraMaster D1 SSD Plus Review: Experience a Faster External SSD

      August 20, 2025

      Microsoft is investigating Windows 11 KB5063878 SSD data corruption/failure issue

      August 20, 2025

      Microsoft Surface Won’t Turn On: 6 Tested Solutions to Fix

      August 20, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors

    Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors

    April 2, 2025

    Large Language Models (LLMs) significantly benefit from attention mechanisms, enabling the effective retrieval of contextual information. Nevertheless, traditional attention methods primarily depend on single token attention, where each attention weight is computed from a single pair of query and key vectors. This design inherently constrains the model’s ability to discern contexts requiring the integration of multiple token signals, thereby limiting its effectiveness on complex linguistic dependencies. For example, identifying sentences simultaneously containing both “Alice” and “rabbit” is challenging because conventional attention mechanisms struggle to integrate multiple separate attention signals efficiently without substantially increasing model complexity.

    Meta AI addresses this limitation by introducing Multi-Token Attention (MTA), an advanced attention mechanism that conditions attention weights simultaneously on multiple query and key vectors. MTA integrates convolution operations over queries, keys, and attention heads, thus enhancing the precision and efficiency of contextual information retrieval. Specifically, the MTA framework consists of two convolutional components: key-query convolution, which aggregates multiple token signals within individual attention heads, and head mixing convolution, which facilitates information sharing among different attention heads. Additionally, the implementation employs group normalization with depth-dependent scaling to stabilize gradient flow, further improving model training stability and efficacy.

    At a technical level, MTA modifies conventional attention calculations by incorporating a two-dimensional convolution operation on the attention logits prior to softmax normalization. This convolution allows adjacent queries and keys to influence attention scores mutually, thus enabling the attention mechanism to identify contextual relationships involving multiple tokens more precisely. Consequently, the model efficiently aggregates local token interactions without substantially increasing the number of parameters or the dimensionality of attention vectors. Moreover, head convolution promotes effective knowledge transfer among attention heads, selectively amplifying relevant context signals while mitigating less pertinent information. Collectively, these enhancements yield a more robust attention mechanism capable of capturing complex multi-token interactions.

    Empirical evaluations validate the efficacy of MTA across several benchmarks. In a structured motivating task explicitly designed to illustrate the shortcomings of single-token attention mechanisms, MTA demonstrated near-perfect performance, achieving an error rate of only 0.1%, in contrast to standard Transformer models that exhibited error rates above 50%. Further large-scale experiments involving an 880M-parameter model trained on 105 billion tokens showed MTA consistently outperforming baseline architectures. MTA achieved superior validation perplexity scores across datasets such as arXiv, GitHub, and Wikipedia. Specifically, in tasks requiring extended context comprehension, such as Needle-in-the-Haystack and BabiLong benchmarks, MTA significantly exceeded the performance of standard Transformer models. In the Needle-in-the-Haystack task with 4K token contexts containing multiple needles, MTA attained accuracies ranging from 67% to 97.6%, surpassing standard models by substantial margins.

    In summary, Multi-Token Attention (MTA) presents a refined advancement in attention mechanisms by addressing fundamental limitations of traditional single-token attention. Leveraging convolutional operations to concurrently integrate multiple query-key interactions, MTA enhances the ability of language models to handle intricate contextual dependencies. These methodological improvements facilitate more precise and efficient performance, particularly in scenarios involving complex token interactions and long-range contextual understanding. Through targeted modifications to standard attention mechanisms, MTA contributes meaningfully to the evolution of more sophisticated, accurate, and computationally efficient language models.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleRilasciato Mozilla Thunderbird 137: tutte le novità del client email open-source
    Next Article A Comprehensive Guide to LLM Routing: Tools and Frameworks

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 21, 2025
    Machine Learning

    Enhance AI agents using predictive ML models with Amazon SageMaker AI and Model Context Protocol (MCP)

    August 21, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Windows Recall finally launches, Oblivion Remastered surprise drops, and Drug Dealer Simulator devs speak out

    News & Updates

    CVE-2025-46824 – Discourse Code Review Plugin Cross-Site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4489 – Campcodes Online Food Ordering System SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-3801 – Songquanpeng One-Api Cross Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-4631 – WordPress Profitori Plugin Privilege Escalation Vulnerability

    May 31, 2025

    CVE ID : CVE-2025-4631

    Published : May 31, 2025, 7:15 a.m. | 44 minutes ago

    Description : The Profitori plugin for WordPress is vulnerable to Privilege Escalation due to a missing capability check on the stocktend_object endpoint in versions 2.0.6.0 to 2.1.1.3. This makes it possible to trigger the save_object_as_user() function for objects whose ‘_datatype’ is set to ‘users’,. This allows unauthenticated attackers to write arbitrary strings straight into the user’s wp_capabilities meta field, potentially elevating the privileges of an existing user account or a newly created one to that of an administrator.

    Severity: 9.8 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Building Smarter APIs with OpenAPI, AWS Bedrock & SageMaker Studio in Drupal 10

    June 20, 2025

    This crazy soccer RPG series is coming to Xbox consoles for the very first time

    April 11, 2025

    How Lumi streamlines loan approvals with Amazon SageMaker AI

    April 4, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.