SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs

Recent advancements in LLM capabilities have increased their usability by enabling them to do a broader range of general activities autonomously. The existing methods for expressing and running LM programs could be more efficient, although they are widely used. There are two main obstacles to effective LM program utilization: The non-deterministic character of LLMs makes programming LM programs tedious and complex. Incorporating parallelism mechanisms, dealing with many input modalities, brittle output parsing, experimental adjustment of prompts, and substantial string manipulation are commonplace tasks in LM software development. This complexity greatly diminishes the readability of even the most basic applications. Second, and most crucially, LM program execution wastes memory and computational resources due to redundant calculations.

A group of researchers from Stanford University, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University introduced SGLang, a Structured Generation Language for LLMs, to take on these problems. The basic premise is to make use of LM programsâ€™ multi-call structure in a systematic way to speed up their execution. This system comprises a language for the front end and a runtime for the back end. While the runtime speeds up the execution of LM programs, the front end makes LM program programming easier. Both components can operate separately or in tandem for optimal performance. Primitives for controlling parallelism (fork and join) and generation (extend, gen, and select) are provided. Because SGLang works with Pythonâ€™s libraries and control flow, users may easily build sophisticated prompting processes using the languageâ€™s natural syntax.Â

The team also presented a compiler and an interpreter for SGLang. By appropriately controlling synchronization and intra-program parallelism, the interpreter ensures that primitive operations are sent to the stream for asynchronous execution and that the prompt state is managed as a stream. Further optimizations can be achieved by tracing and compiling the SGLang program. To speed up the execution of SGLang applications, the researchers suggest several new optimizations on the runtime side. Automatic KV cache reuse across several generation calls is made possible by the first technique, RadixAttention. Current inference engines wastefully trash a requestâ€™s KV cache when processing is finished, which makes it impossible to reuse the cache for subsequent calls and drastically slows down execution. In its place, the system stores all requests within a radix tree in an LRU cache of the KV cache. This method employs a radix tree for efficient matching, inserting, and evicting and handles the KV cache similarly to a conventional cache. It efficiently enables the runtime to manage different reuse patterns using a cache-aware scheduling approach.Â

A compressed finite state machine is the second method; it allows for restricted decoding of structured outputs to happen more quickly. By hiding the likelihood of forbidden tokens, current systems can only decode a single token at a time, as they only obey the restrictions for the next token. Rather, our approach examines the limitations and constructs a compressed finite-state machine. This method streamlines decoding by combining numerous token paths into one shorter one whenever feasible. This allows for faster decoding of multiple tokens simultaneously.Â

Finally, an API-only model, such as OpenAIâ€™s GPT-4, can be optimized for multi-call programs using SGLang. For this, they present a third technique called API speculative execution. Agent control, reasoning, retrieval-augmented generation pipelines, JSON decoding, multiturn chat, multi-modality processing, and few-shot learning benchmarks are some of the LLM applications created using SGLang.Â

On NVIDIA A10G and A100 GPUs, the team evaluated the performance with various models, such as Llama7B/70B, Mistral-8x7B, LLaVA-v1.5-7B (picture), and LLaVA-NeXT-34B (video). Based on the experimental results, SGLang outperforms existing programming and inference systems, such as Guidance, vLLM, and LMQL, throughput by up to 6.4 across various workloads, models, and hardware configurations.

Even though SGLang has come a long way, certain restrictions still point to interesting places to go from here in terms of research. Among these improvements are the following: adding support for more output modalities to SGLang, making RadixAttention work on different levels of the memory hierarchy (e.g., DRAM and Disk), making RadixAttention work with fuzzy semantic matching, adding higher-level primitives to SGLang, fixing cache-aware schedulingâ€™s starvation problem, and making the SGLang compiler better at scheduling and memory planning, among other advanced static optimizations.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

Windows 11 is getting AI Actions in File Explorer — here’s how to try them right now

Is The Alters on Game Pass?

I asked Copilot’s AI to predict the outcome of the Europa League final, and now I’m just sad

Celebrating GAAD by Committing to Universal Design: Equitable Use

Celebrating GAAD by Committing to Universal Design: Equitable Use

GAAD and Universal Design in Healthcare – A Deeper Look

GAAD and Universal Design in Pharmacy – A Deeper Look

Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

Google DeepMind’s CEO says Gemini’s upgrades could lead to AGI — but he still thinks society isn’t “ready for it”

Windows 11 is getting AI Actions in File Explorer — here’s how to try them right now

Is The Alters on Game Pass?

SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs

This AI Paper Introduces MathCoder-VL and FigCodifier: Advancing Multimodal Mathematical Reasoning with Vision-to-Code Alignment

Technology Innovation Institute TII Releases Falcon-H1: Hybrid Transformer-SSM Language Models for Scalable, Multilingual, and Long-Context Understanding

“We Don’t Care About Pro-Coders Anymore”: The Bold Statement That’s Shaking the Tech World!

Essential Questions to Ask Before Beginning UX Research

AMD says its Ryzen AI Max CPU is faster than an RTX 4090 — is this the best AI mobile processor ever?

PromeAI Review: Can It Compete With Other AI Generators?

Saved Places on Google Maps Disappeared [6 Tested Fixes]

How to Quickly Fix Minecraft Realms 502 Bad Gateway Error

What Is Conversational AI- Complete Guide

Rilasciato OpenShot 3.3: Tutto Quello che Devi Sapere

SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs

Related Posts