Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 21, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 21, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 21, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 21, 2025

      Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

      May 21, 2025

      Windows 11 is getting AI Actions in File Explorer — here’s how to try them right now

      May 21, 2025

      Is The Alters on Game Pass?

      May 21, 2025

      I asked Copilot’s AI to predict the outcome of the Europa League final, and now I’m just sad

      May 21, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Celebrating GAAD by Committing to Universal Design: Equitable Use

      May 21, 2025
      Recent

      Celebrating GAAD by Committing to Universal Design: Equitable Use

      May 21, 2025

      GAAD and Universal Design in Healthcare – A Deeper Look

      May 21, 2025

      GAAD and Universal Design in Pharmacy – A Deeper Look

      May 21, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

      May 21, 2025
      Recent

      Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

      May 21, 2025

      Windows 11 is getting AI Actions in File Explorer — here’s how to try them right now

      May 21, 2025

      Is The Alters on Game Pass?

      May 21, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs

    SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs

    July 28, 2024

    Recent advancements in LLM capabilities have increased their usability by enabling them to do a broader range of general activities autonomously. The existing methods for expressing and running LM programs could be more efficient, although they are widely used. There are two main obstacles to effective LM program utilization: The non-deterministic character of LLMs makes programming LM programs tedious and complex. Incorporating parallelism mechanisms, dealing with many input modalities, brittle output parsing, experimental adjustment of prompts, and substantial string manipulation are commonplace tasks in LM software development. This complexity greatly diminishes the readability of even the most basic applications. Second, and most crucially, LM program execution wastes memory and computational resources due to redundant calculations.

    A group of researchers from Stanford University, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University introduced SGLang, a Structured Generation Language for LLMs, to take on these problems. The basic premise is to make use of LM programs’ multi-call structure in a systematic way to speed up their execution. This system comprises a language for the front end and a runtime for the back end. While the runtime speeds up the execution of LM programs, the front end makes LM program programming easier. Both components can operate separately or in tandem for optimal performance. Primitives for controlling parallelism (fork and join) and generation (extend, gen, and select) are provided. Because SGLang works with Python’s libraries and control flow, users may easily build sophisticated prompting processes using the language’s natural syntax. 

    The team also presented a compiler and an interpreter for SGLang. By appropriately controlling synchronization and intra-program parallelism, the interpreter ensures that primitive operations are sent to the stream for asynchronous execution and that the prompt state is managed as a stream. Further optimizations can be achieved by tracing and compiling the SGLang program. To speed up the execution of SGLang applications, the researchers suggest several new optimizations on the runtime side. Automatic KV cache reuse across several generation calls is made possible by the first technique, RadixAttention. Current inference engines wastefully trash a request’s KV cache when processing is finished, which makes it impossible to reuse the cache for subsequent calls and drastically slows down execution. In its place, the system stores all requests within a radix tree in an LRU cache of the KV cache. This method employs a radix tree for efficient matching, inserting, and evicting and handles the KV cache similarly to a conventional cache. It efficiently enables the runtime to manage different reuse patterns using a cache-aware scheduling approach. 

    A compressed finite state machine is the second method; it allows for restricted decoding of structured outputs to happen more quickly. By hiding the likelihood of forbidden tokens, current systems can only decode a single token at a time, as they only obey the restrictions for the next token. Rather, our approach examines the limitations and constructs a compressed finite-state machine. This method streamlines decoding by combining numerous token paths into one shorter one whenever feasible. This allows for faster decoding of multiple tokens simultaneously. 

    Finally, an API-only model, such as OpenAI’s GPT-4, can be optimized for multi-call programs using SGLang. For this, they present a third technique called API speculative execution. Agent control, reasoning, retrieval-augmented generation pipelines, JSON decoding, multiturn chat, multi-modality processing, and few-shot learning benchmarks are some of the LLM applications created using SGLang. 

    On NVIDIA A10G and A100 GPUs, the team evaluated the performance with various models, such as Llama7B/70B, Mistral-8x7B, LLaVA-v1.5-7B (picture), and LLaVA-NeXT-34B (video). Based on the experimental results, SGLang outperforms existing programming and inference systems, such as Guidance, vLLM, and LMQL, throughput by up to 6.4 across various workloads, models, and hardware configurations.

    Even though SGLang has come a long way, certain restrictions still point to interesting places to go from here in terms of research. Among these improvements are the following: adding support for more output modalities to SGLang, making RadixAttention work on different levels of the memory hierarchy (e.g., DRAM and Disk), making RadixAttention work with fuzzy semantic matching, adding higher-level primitives to SGLang, fixing cache-aware scheduling’s starvation problem, and making the SGLang compiler better at scheduling and memory planning, among other advanced static optimizations.

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 47k+ ML SubReddit

    Find Upcoming AI Webinars here

    The post SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLoRA-Pro: A Groundbreaking Machine Learning Approach to Bridging the Performance Gap Between Low-Rank Adaptation and Full Fine-Tuning
    Next Article Dribbble’s Improved Search Algorithm

    Related Posts

    Machine Learning

    This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code Alignment

    May 22, 2025
    Machine Learning

    Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

    May 22, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    “We Don’t Care About Pro-Coders Anymore”: The Bold Statement That’s Shaking the Tech World!

    Artificial Intelligence

    Essential Questions to Ask Before Beginning UX Research

    Development

    AMD says its Ryzen AI Max CPU is faster than an RTX 4090 — is this the best AI mobile processor ever?

    News & Updates

    PromeAI Review: Can It Compete With Other AI Generators?

    Development

    Highlights

    Saved Places on Google Maps Disappeared [6 Tested Fixes]

    June 24, 2024

    Have your saved places on Google Maps suddenly disappeared? You’re certainly not alone. Luckily, I…

    How to Quickly Fix Minecraft Realms 502 Bad Gateway Error

    May 21, 2025

    What Is Conversational AI- Complete Guide

    May 7, 2025

    Rilasciato OpenShot 3.3: Tutto Quello che Devi Sapere

    December 23, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.