FlashAttention-3 Released: Achieves Unprecedented Speed and Precision with Advanced Hardware Utilization and Low-Precision Computing

FlashAttention-3, the latest release in the FlashAttention series, has been designed to address the inherent bottlenecks of the attention layer in Transformer architectures. These bottlenecks are crucial for the performance of large language models (LLMs) and applications requiring long-context processing.

The FlashAttention series, including its predecessors FlashAttention and FlashAttention-2, has revolutionized how attention mechanisms operate on GPUs by minimizing memory reads and writes. Most libraries have widely adopted this innovation to accelerate Transformer training and inference, significantly contributing to the dramatic increase in LLM context length in recent years. For instance, the context length has grown from 2-4K tokens in models like GPT-3 to 128K tokens in GPT-4 and even up to 1 million tokens in models such as Llama 3.

Despite these advancements, FlashAttention-2 could only achieve 35% utilization of the theoretical maximum FLOPs on the H100 GPU, highlighting a gap between potential and actual performance. FlashAttention-3 seeks to bridge this gap by leveraging new hardware capabilities in modern GPUs. Specifically, it introduces three main techniques to enhance attention speed on Hopper GPUs: exploiting the asynchrony of Tensor Cores and TMA to overlap computation and data movement, interleaving block-wise matrix multiplication and softmax operations, and utilizing incoherent processing to leverage hardware support for FP8 low-precision computations.

Image Source

One of the standout features of FlashAttention-3 is its ability to exploit the asynchrony of Tensor Cores and TMA. This allows for overlapping the overall computation and data movement through warp specialization and interleaving operations. Warp specialization involves separate producer and consumer warps managing TMA and WGMMA operations. FlashAttention-3 employs inter-warpgroup and intra-warpgroup overlapping of GEMM (general matrix multiply) and softmax operations. This pingpong scheduling technique ensures that while one warpgroup performs GEMM operations, another can handle softmax calculations, thus optimizing the utilization of GPU resources.

FlashAttention-3 significantly uses low-precision FP8 computations, which double the Tensor Core throughput compared to FP16. This innovation increases computational speed and accuracy by reducing quantization error through incoherent processing. By applying the Hadamard transform with random signs to spread outliers, FlashAttention-3 effectively reduces quantization error, making it a robust solution for high-performance LLMs.

FlashAttention-3 is 1.5 to 2 times faster than FlashAttention-2 with FP16, reaching up to 740 TFLOPS, 75% of the theoretical maximum FLOPs on H100 GPUs. With FP8, FlashAttention-3 achieves close to 1.2 PFLOPS, a significant leap in performance with 2.6 times smaller error compared to baseline FP8 attention.

Image Source

These advancements are underpinned by utilizing NVIDIAâ€™s CUTLASS library, which provides powerful abstractions that allow FlashAttention-3 to harness Hopper GPUsâ€™ capabilities. By rewriting FlashAttention to incorporate these new features, Dao AI Lab has unlocked substantial efficiency gains, enabling new model capabilities such as extended context lengths and improved inference speeds.

In conclusion, the release of FlashAttention-3 represents a paradigm shift in designing and implementing attention mechanisms in large language models. Dao AI Lab has demonstrated how targeted optimizations can lead to significant performance enhancements by closely aligning algorithmic innovations with hardware advancements. As the field continues to evolve, such breakthroughs will be crucial in pushing what is possible with large language models and their applications in various domains.

Check out the Blog, Paper, and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 46k+ ML SubReddit

The post FlashAttention-3 Released: Achieves Unprecedented Speed and Precision with Advanced Hardware Utilization and Low-Precision Computing appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

FlashAttention-3 Released: Achieves Unprecedented Speed and Precision with Advanced Hardware Utilization and Low-Precision Computing

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

LockBit ransomware gang breached, secrets exposed

OpenAI hesitant to release accurate ChatGPT text detector

Lost in translation: AI chatbots still too English-language centric, Stanford study finds

Generating Gender Alternatives in Machine Translation

Windows 10 KB5053606 fixes SgrmBroker, direct download .msu

Best Free Tools for Remote Team Collaboration

Case Study: Oscar Pico Portfolio â€” 2024

SMART Filtering: Enhancing Benchmark Quality and Efficiency for NLP Model Evaluation

FlashAttention-3 Released: Achieves Unprecedented Speed and Precision with Advanced Hardware Utilization and Low-Precision Computing

Related Posts