Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions

    CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions

    December 7, 2024

    LLMs are driving major advances in research and development today. A significant shift has been observed in research objectives and methodologies toward an LLM-centric approach. However, they are associated with high expenses, making LLMs for large-scale utilization inaccessible to many. It is, therefore, a significant challenge to reduce the latency of operations, especially in dynamic applications that demand responsiveness.

    KV cache is used for autoregressive decoding in LLMs. It stores key-value pairs in multi-headed attention during the pre-filling phase of inference. During the decoding stage, new KV pairs get appended to the memory. KV cache stores the intermediate key and value activations in the attention mechanism, thus reducing complexity from quadratic to linear order. KV cache allows for improved efficiency but grows linearly with batch size, sequence length, and model size. The growing memory size of the KV cache exceeds the handling capacity of GPUs, and transferring it to the CPU introduces several bottlenecks, increasing latency while reducing throughput.

    PCIe interfaces become a limiting factor, especially when transferring the cache from the CPU to the GPU for computation. Slow PCIe interfaces can result in latency exceeding normal levels by an order of magnitude, leading to substantial GPU idle time.

    Previous work has attempted to mitigate the issue of slow PCIe performance. Still, these approaches often fail due to mismatched data transfer and GPU computation times, particularly with large batch and context sizes. Others depended on CPU resources, which again became a limiting factor. This article discusses a novel approach to PCIe and GPU optimization.

    University of Southern California researchers propose an efficient CPU-GPU I/O-aware LLM inference method for optimized PCIe utilization. It leverages partial KV cache recomputation and asynchronous overlapping to address the system bottleneck of loading large KV caches. Their process involves transferring smaller activation segments of the cache to the GPU rather than transferring the entire KV cache. The GPU then reconstructs the whole cache memory from these smaller activation bits. The key lies in computing attention scores that ensure minimal information loss.

    The authors propose a fully automated method for determining recomputation and communication splits. This work consists of three modules to minimize GPU latency:

    1. Profiler Module: Collects system hardware information, such as PCIe bandwidth and GPU processing speed.
    2. Scheduler Module: Formulates the problem as a linear programming task to determine the optimal KV split point using hardware information and user configuration. The objective is to maximize the overlap between computation and communication processes.
    3. Runtime Module: Coordinates data transfer between the two devices and manages memory allocations.

    The Scheduler Module, which is responsible for finding the optimal KV split, works in two ways:

    Row-by-Row Schedule: Reduces latency with a row-by-row execution plan. Here, the GPU begins reconstructing the KV cache while the remaining activations are asynchronously loading. Column-by-Column Schedule: Maximizes throughput and accommodates significant batch size inference by reusing model weights across batches. It overlaps the transmission of KV cache and activations with the computation of MHA (multi-headed attention) across multiple batches instead of processing each layer sequentially in a batch.Further using a six-process communication parallelism strategy, the Runtime Module enables concurrent GPU computation and CPU-GPU communication.

    The authors tested the proposed framework for efficient LLM inference using an NVIDIA A100 GPU connected to a CPU via a PCIe 4.0 x16 interface. Experiments were conducted with two objectives to assess the framework’s performance:

    • Latency-Oriented Workload: The proposed method outperformed baselines, reducing latency by 35.8%.
    • Throughput-Oriented Workload: The method achieved up to a 29% improvement relative to the baseline.

    Conclusion:

    The CPU-GPU I/O-aware LLM inference method efficiently reduces latency while increasing throughput in LLM inference. It leverages partial KV cache recomputation and overlaps it with data transmission to minimize idle GPU time and enhance efficiency.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 [Partner with us]: ‘Next Magazine/Report- Open Source AI in Production’

    The post CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper from UCLA Unveils ‘2-Factor Retrieval’ for Revolutionizing Human-AI Decision-Making in Radiology
    Next Article Top 20 Guardrails to Secure LLM Applications

    Related Posts

    Machine Learning

    Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

    May 16, 2025
    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Why is Microsoft firing its employees—again? Third round of layoff is happening

    Development

    Massive Mirai Botnet Exploited Zero-Day Vulnerability in AVTECH Cameras

    Development

    Butler – companion for Home Assistant

    Linux
    Microsoft Edge just got a big performance boost, but can it be the only app I use on Windows 11?

    Microsoft Edge just got a big performance boost, but can it be the only app I use on Windows 11?

    News & Updates

    Highlights

    Development

    Medical content creation in the age of generative AI

    July 3, 2024

    Generative AI and transformer-based large language models (LLMs) have been in the top headlines recently.…

    Kudu is a distributed data storage engine

    April 15, 2025

    Methods to setup parallel test execution in NUnit with Selenium WebDriver C#

    July 6, 2024

    Thanks to Nvidia, there’s a new generation of PCs coming, and they’ll be running Linux

    January 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.