CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions

LLMs are driving major advances in research and development today. A significant shift has been observed in research objectives and methodologies toward an LLM-centric approach. However, they are associated with high expenses, making LLMs for large-scale utilization inaccessible to many. It is, therefore, a significant challenge to reduce the latency of operations, especially in dynamic applications that demand responsiveness.

KV cache is used for autoregressive decoding in LLMs. It stores key-value pairs in multi-headed attention during the pre-filling phase of inference. During the decoding stage, new KV pairs get appended to the memory. KV cache stores the intermediate key and value activations in the attention mechanism, thus reducing complexity from quadratic to linear order. KV cache allows for improved efficiency but grows linearly with batch size, sequence length, and model size. The growing memory size of the KV cache exceeds the handling capacity of GPUs, and transferring it to the CPU introduces several bottlenecks, increasing latency while reducing throughput.

PCIe interfaces become a limiting factor, especially when transferring the cache from the CPU to the GPU for computation. Slow PCIe interfaces can result in latency exceeding normal levels by an order of magnitude, leading to substantial GPU idle time.

Previous work has attempted to mitigate the issue of slow PCIe performance. Still, these approaches often fail due to mismatched data transfer and GPU computation times, particularly with large batch and context sizes. Others depended on CPU resources, which again became a limiting factor. This article discusses a novel approach to PCIe and GPU optimization.

University of Southern California researchers propose an efficient CPU-GPU I/O-aware LLM inference method for optimized PCIe utilization. It leverages partial KV cache recomputation and asynchronous overlapping to address the system bottleneck of loading large KV caches. Their process involves transferring smaller activation segments of the cache to the GPU rather than transferring the entire KV cache. The GPU then reconstructs the whole cache memory from these smaller activation bits. The key lies in computing attention scores that ensure minimal information loss.

The authors propose a fully automated method for determining recomputation and communication splits. This work consists of three modules to minimize GPU latency:

Profiler Module: Collects system hardware information, such as PCIe bandwidth and GPU processing speed.
Scheduler Module: Formulates the problem as a linear programming task to determine the optimal KV split point using hardware information and user configuration. The objective is to maximize the overlap between computation and communication processes.
Runtime Module: Coordinates data transfer between the two devices and manages memory allocations.

The Scheduler Module, which is responsible for finding the optimal KV split, works in two ways:

Row-by-Row Schedule: Reduces latency with a row-by-row execution plan. Here, the GPU begins reconstructing the KV cache while the remaining activations are asynchronously loading. Column-by-Column Schedule: Maximizes throughput and accommodates significant batch size inference by reusing model weights across batches. It overlaps the transmission of KV cache and activations with the computation of MHA (multi-headed attention) across multiple batches instead of processing each layer sequentially in a batch.Further using a six-process communication parallelism strategy, the Runtime Module enables concurrent GPU computation and CPU-GPU communication.

The authors tested the proposed framework for efficient LLM inference using an NVIDIA A100 GPU connected to a CPU via a PCIe 4.0 x16 interface. Experiments were conducted with two objectives to assess the frameworkâ€™s performance:

Latency-Oriented Workload: The proposed method outperformed baselines, reducing latency by 35.8%.
Throughput-Oriented Workload: The method achieved up to a 29% improvement relative to the baseline.

Conclusion:

The CPU-GPU I/O-aware LLM inference method efficiently reduces latency while increasing throughput in LLM inference. It leverages partial KV cache recomputation and overlaps it with data transmission to minimize idle GPU time and enhance efficiency.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 60k+ ML SubReddit.

[Partner with us]: â€˜Next Magazine/Report- Open Source AI in Productionâ€™

The post CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions

Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

Why is Microsoft firing its employeesâ€”again? Third round of layoff is happening

Massive Mirai Botnet Exploited Zero-Day Vulnerability in AVTECH Cameras

Butler – companion for Home Assistant

Microsoft Edge just got a big performance boost, but can it be the only app I use on Windows 11?

Medical content creation in the age of generative AI

Kudu is a distributed data storage engine

Methods to setup parallel test execution in NUnit with Selenium WebDriver C#

Thanks to Nvidia, there’s a new generation of PCs coming, and they’ll be running Linux

CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions

Related Posts