Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

Reinforcement learning (RL) plays a crucial role in scaling language models, enabling them to solve complex tasks such as competition-level mathematics and programming through deeper reasoning. However, achieving stable and reliable training dynamics is a challenge when scaling RL with larger computational resources. Current state-of-the-art algorithms, such as GRPO, struggle with serious stability issues during the training of gigantic language models, often resulting in catastrophic failures. These instabilities arise from incorrect use of importance sampling weight applications, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes model collapse and hinders progress.

Existing methods like PPO and GRPO rely on mechanisms like clipping to address off-policy learning challenges where responses are taken from outdated policies. However, these approaches face limitations due to their ill-posed objectives, particularly in large models handling long-response tasks. GRPO’s token-level importance sampling introduces high-variance noise and irreversible model collapse. Attempts to recover from collapse through hyperparameter tuning or checkpoint restoration fail, highlighting a fundamental design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the need for a new approach that optimizes directly at the sequence level to ensure stability and scalability.

Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to train LLMs. GSPO’s main innovation lies in its theoretically grounded importance ratio, derived from sequence likelihood, which aligns with the principles of importance sampling. Moreover, it calculates normalized rewards as advantages for multiple responses to a query, promoting consistency between sequence-level rewards and optimization goals. Empirical evaluations reveal that GSPO significantly outperforms GRPO in stability, efficiency, and overall performance. By resolving stability challenges in training large Mixture-of-Experts (MoE) models, GSPO eliminates the need for complex stabilization techniques.

Researchers use a cold-start model fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the training reward curves and the model performance curves across AIME’24, LiveCodeBench, and CodeForces benchmarks. During training, rollout data in each batch is split into four mini-batches for gradient updates. GSPO clips entire responses rather than individual tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This leads to a two-order-of-magnitude difference in clipped token fractions compared to GRPO. Despite removing more tokens for gradient estimation, GSPO achieves higher training efficiency. This result highlights the inefficiency of GRPO’s noisy token-level estimates.

GSPO offers significant advantages for MoE training by stabilizing the process through consistent expert activations across gradient updates, unlike GRPO, which struggles with expert-activation volatility. This removes the need for complex solutions like Routing Replay, simplifying the infrastructure and allowing models to utilize their full capacity. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it more robust to precision mismatch. This enables direct use of inference engine likelihoods, avoiding costly recomputation and improving efficiency in partial rollouts and multi-turn RL. GSPO also streamlines RL infrastructure for large-scale language model training.

In conclusion, researchers introduced Group Sequence Policy Optimization (GSPO), an RL algorithm designed for training LLMs. GSPO builds on the principles of importance sampling and introduces sequence-level clipping, rewarding, and optimization to overcome the instability and inefficiency seen in GRPO. Its superior performance in training stability, efficiency, and scalability, particularly for MoE models, emphasizes its importance as a strong algorithmic foundation. The advancements made possible by GSPO have played a key role in the remarkable performance of the Qwen3 models. Building on GSPO as a foundational approach, researchers plan to expand RL methods, opening the door for groundbreaking progress in AI.

Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Identify a Nap

Ambient Animations In Web Design: Principles And Implementation (Part 1)

Benchmarking AI-assisted developers (and their tools) for superior AI governance

Digital.ai launches White-box Cryptography Agent to enable stronger application security

Development Release: MX Linux 25 Beta 1

DistroWatch Weekly, Issue 1140

Distribution Release: DietPi 9.17

Development Release: Zorin OS 18 Beta

Stop using .reverse().find(): meet findLast()

Stop using .reverse().find(): meet findLast()

@ts-ignore is almost always the worst option

MutativeJS v1.3.0 is out with massive performance gains

How I Configure Polybar to Customize My Linux Desktop

How I Configure Polybar to Customize My Linux Desktop

Development Release: MX Linux 25 Beta 1

DistroWatch Weekly, Issue 1140

Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

Elmo has been hacked, claims Trump is in Epstein files, calls for Jews to be exterminated

Analyze Laravel Codebases with the Laravel Introspect Package

Error’d: Cuts Like a Knife

Create professional virtual property tours in minutes using AI technology. Perfect for real estate agents, featuring multilingual narration, custom branding, and automated video generation. Boost your property listings with engaging virtual tours. #proptech #realestateAI #virtualtours

The death of spreadsheets: 6 reasons why AI will soon be the dominant business reporting tool

Transformer Meets Diffusion: How the Transfusion Architecture Empowers GPT-4o’s Creativity

Paid proxy servers vs free proxies: Is paying for a proxy service worth it?

Windows BitLocker Bypass Vulnerability Let Attackers Bypass Security Feature

Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

Related Posts