Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»KAIST and DeepAuto AI Researchers Propose InfiniteHiP: A Game-Changing Long-Context LLM Framework for 3M-Token Inference on a Single GPU

    KAIST and DeepAuto AI Researchers Propose InfiniteHiP: A Game-Changing Long-Context LLM Framework for 3M-Token Inference on a Single GPU

    February 16, 2025

    In large language models (LLMs), processing extended input sequences demands significant computational and memory resources, leading to slower inference and higher hardware costs. The attention mechanism, a core component, further exacerbates these challenges due to its quadratic complexity relative to sequence length. Also, maintaining the previous context using a key-value (KV) cache results in high memory overheads, limiting scalability. 

    A key limitation of LLMs is their inability to handle sequences longer than their trained context window. Most models degrade in performance when faced with extended inputs due to inefficient memory management and growing attention computation costs. Existing solutions often rely on fine-tuning, which is resource-intensive and requires high-quality long-context datasets. Without an efficient method for context extension, tasks like document summarization, retrieval-augmented generation, and long-form text generation remain constrained.

    Several approaches have been proposed to tackle the problem of long-context processing. FlashAttention2 (FA2) optimizes memory consumption by minimizing redundant operations during attention computation, yet it does not address computational inefficiency. Some models employ selective token attention, either statically or dynamically, to reduce processing overhead. KV cache eviction strategies have been introduced to remove older tokens selectively, but they risk permanently discarding important contextual information. HiP Attention is another approach that attempts to offload infrequently used tokens to external memory; however, it lacks efficient cache management, leading to increased latency. Despite these advances, no method has effectively addressed all three key challenges: 

    1. Long-context generalization
    2. Efficient memory management
    3. Computational efficiency

    Researchers from the KAIST, and DeepAuto.ai introduced InfiniteHiP, an advanced framework that enables efficient long-context inference while mitigating memory bottlenecks. The model achieves this through a hierarchical token pruning algorithm, which dynamically removes less relevant context tokens. This modular pruning strategy selectively retains tokens that contribute the most to attention computations, significantly reducing processing overhead. The framework also incorporates adaptive RoPE (Rotary Positional Embeddings) adjustments, allowing models to generalize to longer sequences without additional training. Also, InfiniteHiP employs a novel KV cache offloading mechanism, transferring less frequently accessed tokens to host memory while ensuring efficient retrieval. These techniques enable the model to process up to 3 million tokens on a 48GB GPU, making it the most scalable long-context inference method.

    The core innovation of InfiniteHiP is its multi-stage pruning mechanism, which consistently improves context selection throughout multiple stages. Tokens are first divided into fixed-length pieces, and each piece is processed based on its attention computation contribution. A top-K selection approach ensures that only the most critical tokens are retained and others are dropped. The method followed by InfiniteHiP, unlike other hierarchical pruning models, is entirely parallelized, which renders it computationally effective. The KV cache management system optimizes memory utilization by dynamically offloading less important context tokens while maintaining retrieval flexibility. The model also utilizes multiple RoPE interpolation methods at different attention layers, thus facilitating smooth adaptation to long sequences.

    The model demonstrates an 18.95× speedup in attention decoding for a one million-token context compared to traditional methods without additional training. The KV cache offloading technique reduces GPU memory consumption by up to 96%, making it practical for large-scale applications. In benchmark evaluations such as LongBench and ∞Bench, InfiniteHiP consistently outperforms state-of-the-art methods, achieving a 9.99% higher relative score than InfLLM. Also, decoding throughput is increased by 3.2× on consumer GPUs (RTX 4090) and 7.25× on enterprise-grade GPUs (L40S).

    Hostinger

    In conclusion, the research team successfully addressed the major bottlenecks of long-context inference with InfiniteHiP. The framework enhances LLM capabilities by integrating hierarchical token pruning, KV cache offloading, and RoPE generalization. This breakthrough enables pre-trained models to process extended sequences without losing context or increasing computational costs. The method is scalable, hardware-efficient, and applicable to various AI applications requiring long-memory retention.


    Check out the Paper, Source Code and Live Demo. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post KAIST and DeepAuto AI Researchers Propose InfiniteHiP: A Game-Changing Long-Context LLM Framework for 3M-Token Inference on a Single GPU appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper from IBM and MIT Introduces SOLOMON: A Neuro-Inspired Reasoning Network for Enhancing LLM Adaptability in Semiconductor Layout Design
    Next Article No, Microsoft is NOT dropping Windows 11 support for Intel 8th, 9th, and 10th Gen chips

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Page Object Model: Base Page necessary?

    Development

    Made by Google 2024: What to expect with Pixel 9 Pro Fold, Watch, Android 15, AI, and more

    Development

    Satya Nadella pumps the brakes on the AGI race to deliver “real-world impact” as OpenAI’s Sam Altman eyes self-replication

    News & Updates

    CVE-2025-46228 – Bastien Ho Eventpost DOM-Based Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)
    GetResponse

    Highlights

    Quick Hit #13

    August 26, 2024

    Happy birthday, Chris Coyier — and thank you for CSS-Tricks as well as everything you…

    CVE-2025-36546 – F5OS SSH Key-Based Authentication Privilege Escalation

    May 7, 2025

    MAINGEAR’s new APEX desktops push the limits of pre-built gaming PCs with custom cooling loops and hidden cables

    January 7, 2025

    Valuepitch

    August 30, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.