Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: Identify a Nap

      September 23, 2025

      Ambient Animations In Web Design: Principles And Implementation (Part 1)

      September 23, 2025

      Benchmarking AI-assisted developers (and their tools) for superior AI governance

      September 23, 2025

      Digital.ai launches White-box Cryptography Agent to enable stronger application security

      September 23, 2025

      Development Release: MX Linux 25 Beta 1

      September 22, 2025

      DistroWatch Weekly, Issue 1140

      September 21, 2025

      Distribution Release: DietPi 9.17

      September 21, 2025

      Development Release: Zorin OS 18 Beta

      September 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Stop using .reverse().find(): meet findLast()

      September 23, 2025
      Recent

      Stop using .reverse().find(): meet findLast()

      September 23, 2025

      @ts-ignore is almost always the worst option

      September 22, 2025

      MutativeJS v1.3.0 is out with massive performance gains

      September 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      How I Configure Polybar to Customize My Linux Desktop

      September 23, 2025
      Recent

      How I Configure Polybar to Customize My Linux Desktop

      September 23, 2025

      Development Release: MX Linux 25 Beta 1

      September 22, 2025

      DistroWatch Weekly, Issue 1140

      September 21, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

    Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

    August 7, 2025

    Reinforcement learning (RL) plays a crucial role in scaling language models, enabling them to solve complex tasks such as competition-level mathematics and programming through deeper reasoning. However, achieving stable and reliable training dynamics is a challenge when scaling RL with larger computational resources. Current state-of-the-art algorithms, such as GRPO, struggle with serious stability issues during the training of gigantic language models, often resulting in catastrophic failures. These instabilities arise from incorrect use of importance sampling weight applications, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes model collapse and hinders progress.

    Existing methods like PPO and GRPO rely on mechanisms like clipping to address off-policy learning challenges where responses are taken from outdated policies. However, these approaches face limitations due to their ill-posed objectives, particularly in large models handling long-response tasks. GRPO’s token-level importance sampling introduces high-variance noise and irreversible model collapse. Attempts to recover from collapse through hyperparameter tuning or checkpoint restoration fail, highlighting a fundamental design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the need for a new approach that optimizes directly at the sequence level to ensure stability and scalability.

    Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to train LLMs. GSPO’s main innovation lies in its theoretically grounded importance ratio, derived from sequence likelihood, which aligns with the principles of importance sampling. Moreover, it calculates normalized rewards as advantages for multiple responses to a query, promoting consistency between sequence-level rewards and optimization goals. Empirical evaluations reveal that GSPO significantly outperforms GRPO in stability, efficiency, and overall performance. By resolving stability challenges in training large Mixture-of-Experts (MoE) models, GSPO eliminates the need for complex stabilization techniques.

    Researchers use a cold-start model fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the training reward curves and the model performance curves across AIME’24, LiveCodeBench, and CodeForces benchmarks. During training, rollout data in each batch is split into four mini-batches for gradient updates. GSPO clips entire responses rather than individual tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This leads to a two-order-of-magnitude difference in clipped token fractions compared to GRPO. Despite removing more tokens for gradient estimation, GSPO achieves higher training efficiency. This result highlights the inefficiency of GRPO’s noisy token-level estimates.

    GSPO offers significant advantages for MoE training by stabilizing the process through consistent expert activations across gradient updates, unlike GRPO, which struggles with expert-activation volatility. This removes the need for complex solutions like Routing Replay, simplifying the infrastructure and allowing models to utilize their full capacity. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it more robust to precision mismatch. This enables direct use of inference engine likelihoods, avoiding costly recomputation and improving efficiency in partial rollouts and multi-turn RL. GSPO also streamlines RL infrastructure for large-scale language model training.

    In conclusion, researchers introduced Group Sequence Policy Optimization (GSPO), an RL algorithm designed for training LLMs. GSPO builds on the principles of importance sampling and introduces sequence-level clipping, rewarding, and optimization to overcome the instability and inefficiency seen in GRPO. Its superior performance in training stability, efficiency, and scalability, particularly for MoE models, emphasizes its importance as a strong algorithmic foundation. The advancements made possible by GSPO have played a key role in the remarkable performance of the Qwen3 models. Building on GSPO as a foundational approach, researchers plan to expand RL methods, opening the door for groundbreaking progress in AI.


    Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle AI Releases DeepPolisher: A New Deep Learning Tool that Improves the Accuracy of Genome Assemblies by Precisely Correcting Base-Level Errors
    Next Article The DIVA logistics agent, powered by Amazon Bedrock

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Elmo has been hacked, claims Trump is in Epstein files, calls for Jews to be exterminated

    Development

    Analyze Laravel Codebases with the Laravel Introspect Package

    Development

    Error’d: Cuts Like a Knife

    News & Updates

    Create professional virtual property tours in minutes using AI technology. Perfect for real estate agents, featuring multilingual narration, custom branding, and automated video generation. Boost your property listings with engaging virtual tours. #proptech #realestateAI #virtualtours

    Web Development

    Highlights

    The death of spreadsheets: 6 reasons why AI will soon be the dominant business reporting tool

    May 12, 2025

    AI will replace spreadsheets as the dominant method for utilization reporting and revenue forecasting by…

    Transformer Meets Diffusion: How the Transfusion Architecture Empowers GPT-4o’s Creativity

    April 6, 2025

    Paid proxy servers vs free proxies: Is paying for a proxy service worth it?

    June 13, 2025

    Windows BitLocker Bypass Vulnerability Let Attackers Bypass Security Feature

    July 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.