Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Elastic simplifies log analytics for SREs and developers with launch of Log Essentials

      August 7, 2025

      OpenAI launches GPT-5

      August 7, 2025

      Melissa brings its data quality solutions to Azure with new SSIS integration

      August 7, 2025

      Automating Design Systems: Tips And Resources For Getting Started

      August 6, 2025

      This $180 mini projector has no business being this good for the price

      August 7, 2025

      GPT-5 is finally here, and you can access it for free today – no subscription needed

      August 7, 2025

      Changing this Android setting instantly doubled my phone speed (Samsung and Google models included)

      August 7, 2025

      ChatGPT can now talk nerdy to you – plus more personalities and other upgrades beyond GPT-5

      August 7, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Advanced Application Architecture through Laravel’s Service Container Management

      August 7, 2025
      Recent

      Advanced Application Architecture through Laravel’s Service Container Management

      August 7, 2025

      Switch Between Personas in Laravel With the MultiPersona Package

      August 7, 2025

      AI-Driven Smart Tagging and Metadata in AEM Assets

      August 7, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Bill Gates on AI’s Impact: ‘Be Curious, Read, and Use the Latest Tools’

      August 7, 2025
      Recent

      Bill Gates on AI’s Impact: ‘Be Curious, Read, and Use the Latest Tools’

      August 7, 2025

      Halo Infinite’s Fall Update: New Features and Modes to Revive the Game?

      August 7, 2025

      Forza Motorsport’s Future in Jeopardy: Fans Demand Clarity from Microsoft

      August 7, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

    Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models

    August 7, 2025

    Reinforcement learning (RL) plays a crucial role in scaling language models, enabling them to solve complex tasks such as competition-level mathematics and programming through deeper reasoning. However, achieving stable and reliable training dynamics is a challenge when scaling RL with larger computational resources. Current state-of-the-art algorithms, such as GRPO, struggle with serious stability issues during the training of gigantic language models, often resulting in catastrophic failures. These instabilities arise from incorrect use of importance sampling weight applications, which introduce high-variance noise. This noise accumulates with longer responses and is worsened by clipping mechanisms. This causes model collapse and hinders progress.

    Existing methods like PPO and GRPO rely on mechanisms like clipping to address off-policy learning challenges where responses are taken from outdated policies. However, these approaches face limitations due to their ill-posed objectives, particularly in large models handling long-response tasks. GRPO’s token-level importance sampling introduces high-variance noise and irreversible model collapse. Attempts to recover from collapse through hyperparameter tuning or checkpoint restoration fail, highlighting a fundamental design flaw. The mismatch between token-level corrections and sequence-level rewards emphasizes the need for a new approach that optimizes directly at the sequence level to ensure stability and scalability.

    Researchers from Alibaba Inc. have proposed Group Sequence Policy Optimization (GSPO), an RL algorithm designed to train LLMs. GSPO’s main innovation lies in its theoretically grounded importance ratio, derived from sequence likelihood, which aligns with the principles of importance sampling. Moreover, it calculates normalized rewards as advantages for multiple responses to a query, promoting consistency between sequence-level rewards and optimization goals. Empirical evaluations reveal that GSPO significantly outperforms GRPO in stability, efficiency, and overall performance. By resolving stability challenges in training large Mixture-of-Experts (MoE) models, GSPO eliminates the need for complex stabilization techniques.

    Researchers use a cold-start model fine-tuned from Qwen3-30B-A3B-Base for the experiment, reporting the training reward curves and the model performance curves across AIME’24, LiveCodeBench, and CodeForces benchmarks. During training, rollout data in each batch is split into four mini-batches for gradient updates. GSPO clips entire responses rather than individual tokens, with clipping ranges set to 3e-4 and 4e-4 in its formulation. This leads to a two-order-of-magnitude difference in clipped token fractions compared to GRPO. Despite removing more tokens for gradient estimation, GSPO achieves higher training efficiency. This result highlights the inefficiency of GRPO’s noisy token-level estimates.

    GSPO offers significant advantages for MoE training by stabilizing the process through consistent expert activations across gradient updates, unlike GRPO, which struggles with expert-activation volatility. This removes the need for complex solutions like Routing Replay, simplifying the infrastructure and allowing models to utilize their full capacity. In RL infrastructure, GSPO’s sequence-level optimization reduces dependency on token-level likelihoods, making it more robust to precision mismatch. This enables direct use of inference engine likelihoods, avoiding costly recomputation and improving efficiency in partial rollouts and multi-turn RL. GSPO also streamlines RL infrastructure for large-scale language model training.

    In conclusion, researchers introduced Group Sequence Policy Optimization (GSPO), an RL algorithm designed for training LLMs. GSPO builds on the principles of importance sampling and introduces sequence-level clipping, rewarding, and optimization to overcome the instability and inefficiency seen in GRPO. Its superior performance in training stability, efficiency, and scalability, particularly for MoE models, emphasizes its importance as a strong algorithmic foundation. The advancements made possible by GSPO have played a key role in the remarkable performance of the Qwen3 models. Building on GSPO as a foundational approach, researchers plan to expand RL methods, opening the door for groundbreaking progress in AI.


    Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post Alibaba Introduces Group Sequence Policy Optimization (GSPO): An Efficient Reinforcement Learning Algorithm that Powers the Qwen3 Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleGoogle AI Releases DeepPolisher: A New Deep Learning Tool that Improves the Accuracy of Genome Assemblies by Precisely Correcting Base-Level Errors
    Next Article The DIVA logistics agent, powered by Amazon Bedrock

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 7, 2025
    Machine Learning

    Google DeepMind Introduces Genie 3: A General Purpose World Model that can Generate an Unprecedented Diversity of Interactive Environments

    August 7, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Microsoft Gaming CEO Phil Spencer is trying hard to avoid leaking all his Xbox plans

    News & Updates

    Bar Assistant – manage your home bar

    Linux

    Phishers built fake Okta and Microsoft 365 login sites with AI – here’s how to protect yourself

    News & Updates

    CVE-2025-23179 – Apache Struts Deserialization Credentials Exposure

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-3962 – Withstars Books-Management-System Cross-Site Scripting Vulnerability

    April 27, 2025

    CVE ID : CVE-2025-3962

    Published : April 27, 2025, 7:15 a.m. | 1 hour ago

    Description : A vulnerability classified as problematic was found in withstars Books-Management-System 1.0. This vulnerability affects unknown code of the file /api/comment/add of the component Comment Handler. The manipulation of the argument content leads to cross site scripting. The attack can be initiated remotely. The exploit has been disclosed to the public and may be used. This vulnerability only affects products that are no longer supported by the maintainer.

    Severity: 3.5 | LOW

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    AnyCable Laravel Broadcaster

    June 16, 2025

    CVE-2025-6615 – D-Link DIR-619L Stack-Based Buffer Overflow Vulnerability

    June 25, 2025

    Xbox Free Play Days: Try Sword Art Online, For Honor, and More This Weekend

    July 24, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.