Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference

    FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference

    July 26, 2024

    Large Language Models (LLMs) face deployment challenges due to latency issues caused by memory bandwidth constraints. Researchers use weight-only quantization to address this, compressing LLM parameters to lower precision. This approach improves latency and reduces GPU memory requirements. Implementing this effectively requires custom mixed-type matrix-multiply kernels that move, dequantize, and process weights efficiently. Existing kernels like bits and bytes, Marlin, and BitBLAS have shown significant speed-ups but are often limited to 4-bit quantization. Recent advancements in odd-bit and non-uniform quantization methods highlight the need for more flexible kernels that can support a wider range of settings to maximize the potential of weight quantization in LLM deployment.

    Researchers have attempted to solve the LLM deployment challenges using weight-only quantization. Uniform quantization converts full-precision weights to lower-precision intervals, while non-uniform methods like lookup table (LUT) quantization offer more flexibility. Existing kernels like bits and bytes, Marlin, and BitBLAS move quantized weights from main memory to on-chip SRAM, performing matrix multiplications after de-quantizing to floating-point. These show significant speed-ups but often specialize in 4-bit uniform quantization, with LUT-quantization kernels underperforming. Non-uniform methods like SqueezeLLM and NormalFloat face trade-offs between lookup table size and quantization granularity. Also, non-uniformly quantized operations can’t utilize GPU accelerators optimized for floating-point calculations. This highlights the need for efficient kernels that can utilize quantized representations to minimize memory movement and GPU-native floating-point matrix multiplications, balancing the benefits of quantization with hardware optimization.

    Researchers from Massachusetts Institute of Technology, High School of Mathematics Plovdiv and Carnegie Mellon University, MBZUAI, Petuum Inc. introduce an innovative approach that,  flexible lookup-table engine (FLUTE) for deploying weight-quantized LLMs, focusing on low-bit and non-uniform quantization. It addresses three main challenges: handling sub-8-bit matrices, optimizing lookup table-based dequantization, and improving workload distribution for small batches and low-bit-width weights. FLUTE overcomes these issues through three key strategies: offline weight restructuring, a shared-memory lookup table for efficient dequantization, and Stream-K partitioning for optimized workload distribution. This approach enables FLUTE to effectively manage the complexities of low-bit and non-uniform quantization in LLM deployment, improving efficiency and performance in scenarios where traditional methods fall short.

    FLUTE is an innovative approach for, flexible mixed-type matrix multiplications in weight-quantized LLMs. It addresses key challenges in deploying low-bit and non-uniform quantized models through three main strategies:

    Offline Matrix Restructuring: FLUTE reorders quantized weights to optimize for Tensor Core operations, handling non-standard bit widths (e.g., 3-bit) by splitting weights into bit-slices and combining them in registers.

    Vectorized Lookup in Shared Memory: To optimize dequantization, FLUTE uses a vectorized lookup table stored in shared memory, accessing two elements simultaneously. It also employs table duplication to reduce bank conflicts.

    Stream-K Workload Partitioning: FLUTE implements Stream-K decomposition to evenly distribute workload across SMs, mitigating wave quantization issues in low-bit and low-batch scenarios.

    These innovations allow FLUTE to efficiently fuse dequantization and matrix multiplication operations, optimizing memory usage and computational throughput. The kernel employs a sophisticated pipeline of data movement between global memory, shared memory, and registers, utilizing GPU hardware capabilities for maximum performance in weight-quantized LLM deployments.

    FLUTE shows impressive performance across various matrix shapes on both A6000 and A100 GPUs. On the A6000, it occasionally approaches the theoretical maximum speedup of 4x. This performance is also consistent across different batch sizes, unlike other LUT-compatible kernels which typically achieve similar speedups only at a batch size of 1 and then degrade rapidly as batch size increases. Also, FLUTE’s performance compares well even to Marlin, a kernel highly specialized for FP16 input and uniform-quantized INT4 weights. This demonstrates FLUTE’s ability to efficiently handle both uniform and non-uniform quantization schemes.

    FLUTE demonstrates superior performance in LLM deployment across various quantization settings. The learned NF quantization approach outperforms standard methods and combines well with AWQ. FLUTE’s flexibility allows for experiments with different bit widths and group sizes, nearly matching 16-bit baseline perplexity with small group sizes. End-to-end latency tests using vLLM framework showed meaningful speedups across various configurations, including with Gemma-2 models. A group size of 64 was found to balance quality and speed effectively. Overall, FLUTE proves to be a versatile and efficient solution for quantized LLM deployment, offering improved performance across multiple scenarios.

    FLUTE is a CUDA kernel designed to accelerate LLM inference through fused quantized matrix multiplications. It offers flexibility in mapping quantized to de-quantized values via lookup tables and supports various bit widths and group sizes. FLUTE’s performance is demonstrated through kernel-level benchmarks and end-to-end evaluations on state-of-the-art LLMs like LLaMA-3 and Gemma-2. Tested on A6000 and A100 GPUs in single and tensor parallel setups, FLUTE shows efficiency across unquantized, 3-bit, and 4-bit configurations. This versatility and performance make FLUTE a promising solution for accelerating LLM inference using advanced quantization techniques.

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 47k+ ML SubReddit

    Find Upcoming AI Webinars here

    The post FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAmazon SageMaker inference launches faster auto scaling for generative AI models
    Next Article Self-Route: A Simple Yet Effective AI Method that Routes Queries to RAG or Long Context LC based on Model Self-Reflection

    Related Posts

    Machine Learning

    LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

    May 17, 2025
    Machine Learning

    This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

    May 17, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    KDE Plasma: addio alla versione LTS, più supporto per tutte le versioni e nuova telemetria trasparente

    Linux

    MM-Ego: Towards Building Egocentric Multimodal LLMs

    Machine Learning

    Ignition is a Modern Startup Applications Utility for Linux

    Linux

    Don’t Be a Sitting Duck: The Cybersecurity Checklist You Need Right Now

    Development

    Highlights

    Facebook porn scam infects 110k users in 48 hours

    April 9, 2025

    A new porn scam is spreading startlingly quickly through Facebook – one that has managed…

    CVE-2025-2851 – GL.iNet RPC Handler Buffer Overflow

    April 26, 2025

    Microsoft wants mobile cases to not only protect the device, but also personalize its settings and features

    December 2, 2024

    Smashing Security podcast #379: Private nights, evil twins, and crypto home invasions

    July 3, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.