Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»FlashAttention-3 Released: Achieves Unprecedented Speed and Precision with Advanced Hardware Utilization and Low-Precision Computing

    FlashAttention-3 Released: Achieves Unprecedented Speed and Precision with Advanced Hardware Utilization and Low-Precision Computing

    July 12, 2024

    FlashAttention-3, the latest release in the FlashAttention series, has been designed to address the inherent bottlenecks of the attention layer in Transformer architectures. These bottlenecks are crucial for the performance of large language models (LLMs) and applications requiring long-context processing.

    The FlashAttention series, including its predecessors FlashAttention and FlashAttention-2, has revolutionized how attention mechanisms operate on GPUs by minimizing memory reads and writes. Most libraries have widely adopted this innovation to accelerate Transformer training and inference, significantly contributing to the dramatic increase in LLM context length in recent years. For instance, the context length has grown from 2-4K tokens in models like GPT-3 to 128K tokens in GPT-4 and even up to 1 million tokens in models such as Llama 3.

    Despite these advancements, FlashAttention-2 could only achieve 35% utilization of the theoretical maximum FLOPs on the H100 GPU, highlighting a gap between potential and actual performance. FlashAttention-3 seeks to bridge this gap by leveraging new hardware capabilities in modern GPUs. Specifically, it introduces three main techniques to enhance attention speed on Hopper GPUs: exploiting the asynchrony of Tensor Cores and TMA to overlap computation and data movement, interleaving block-wise matrix multiplication and softmax operations, and utilizing incoherent processing to leverage hardware support for FP8 low-precision computations.

    Image Source

    One of the standout features of FlashAttention-3 is its ability to exploit the asynchrony of Tensor Cores and TMA. This allows for overlapping the overall computation and data movement through warp specialization and interleaving operations. Warp specialization involves separate producer and consumer warps managing TMA and WGMMA operations. FlashAttention-3 employs inter-warpgroup and intra-warpgroup overlapping of GEMM (general matrix multiply) and softmax operations. This pingpong scheduling technique ensures that while one warpgroup performs GEMM operations, another can handle softmax calculations, thus optimizing the utilization of GPU resources.

    FlashAttention-3 significantly uses low-precision FP8 computations, which double the Tensor Core throughput compared to FP16. This innovation increases computational speed and accuracy by reducing quantization error through incoherent processing. By applying the Hadamard transform with random signs to spread outliers, FlashAttention-3 effectively reduces quantization error, making it a robust solution for high-performance LLMs.

    FlashAttention-3 is 1.5 to 2 times faster than FlashAttention-2 with FP16, reaching up to 740 TFLOPS, 75% of the theoretical maximum FLOPs on H100 GPUs. With FP8, FlashAttention-3 achieves close to 1.2 PFLOPS, a significant leap in performance with 2.6 times smaller error compared to baseline FP8 attention.

    Image Source

    These advancements are underpinned by utilizing NVIDIA’s CUTLASS library, which provides powerful abstractions that allow FlashAttention-3 to harness Hopper GPUs’ capabilities. By rewriting FlashAttention to incorporate these new features, Dao AI Lab has unlocked substantial efficiency gains, enabling new model capabilities such as extended context lengths and improved inference speeds.

    In conclusion, the release of FlashAttention-3 represents a paradigm shift in designing and implementing attention mechanisms in large language models. Dao AI Lab has demonstrated how targeted optimizations can lead to significant performance enhancements by closely aligning algorithmic innovations with hardware advancements. As the field continues to evolve, such breakthroughs will be crucial in pushing what is possible with large language models and their applications in various domains.

    Check out the Blog, Paper, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 46k+ ML SubReddit

    The post FlashAttention-3 Released: Achieves Unprecedented Speed and Precision with Advanced Hardware Utilization and Low-Precision Computing appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMapping Neural Networks to Graph Structures: Enhancing Model Selection and Interpretability through Network Science
    Next Article How Smooth Is Attention?

    Related Posts

    Machine Learning

    Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

    May 16, 2025
    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    LockBit ransomware gang breached, secrets exposed

    Development

    OpenAI hesitant to release accurate ChatGPT text detector

    Artificial Intelligence

    Lost in translation: AI chatbots still too English-language centric, Stanford study finds

    Development

    Generating Gender Alternatives in Machine Translation

    Development

    Highlights

    Windows 10 KB5053606 fixes SgrmBroker, direct download .msu

    March 16, 2025

    Windows 10 KB5053606 is now available with fixes ahead of the October 14, 2025 deadline.…

    Best Free Tools for Remote Team Collaboration

    June 10, 2024

    Case Study: Oscar Pico Portfolio — 2024

    June 4, 2024

    SMART Filtering: Enhancing Benchmark Quality and Efficiency for NLP Model Evaluation

    November 4, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.