Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      What I Wish Someone Told Me When I Was Getting Into ARIA

      June 17, 2025

      SD Times 100

      June 17, 2025

      Managing the growing risk profile of agentic AI and MCP in the enterprise

      June 17, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 16, 2025

      Funny Windows 11 bug brings back classic Windows boot sound from 20 years ago

      June 17, 2025

      Windows 11 news and updates in June: Microsoft’s AI agent in Settings makes adjusting your PC easier than ever

      June 17, 2025

      uBlock Origin ships to Edge for Android as Google kills it on Chrome

      June 17, 2025

      Windows Hello face unlock no longer works in the dark, and Microsoft says it’s not a bug

      June 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (06.17.2025)

      June 17, 2025
      Recent

      Community News: Latest PECL Releases (06.17.2025)

      June 17, 2025

      Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

      June 17, 2025

      How Inclusive Design Leading and Creating Solutions for Universal Design

      June 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Funny Windows 11 bug brings back classic Windows boot sound from 20 years ago

      June 17, 2025
      Recent

      Funny Windows 11 bug brings back classic Windows boot sound from 20 years ago

      June 17, 2025

      Windows 11 news and updates in June: Microsoft’s AI agent in Settings makes adjusting your PC easier than ever

      June 17, 2025

      uBlock Origin ships to Edge for Android as Google kills it on Chrome

      June 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning

    Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning

    May 2, 2025

    Large language models (LLMs) face significant challenges when trained as autonomous agents in interactive environments. Unlike static tasks, agent settings require sequential decision-making, cross-turn memory maintenance, and adaptation to stochastic environmental feedback. These capabilities are essential for developing effective planning assistants, robotics applications, and tutoring agents that can self-improve through experience. While reinforcement learning (RL) has been applied to LLMs using rule-based rewards, training self-evolving agents that can reason and adapt remains underexplored. Current approaches suffer from training instability, complex reward signal interpretation, and limited generalisation across varying prompts or changing environments, particularly during multi-turn interactions with unpredictable feedback. The fundamental question emerges: which design elements are crucial for creating LLM agents that learn effectively and maintain stability throughout their evolution?

    Through diverse methodologies, RL has significantly advanced LLMs’ reasoning capabilities. PPO maintains training stability by clipping policy updates, while GRPO enhances systematic problem-solving abilities. SAC employs entropy-regularised objectives for robust exploration, and meta tokens facilitate structured thinking. PRM and MCTS-based approaches have further improved systematic reasoning. Simultaneously, chain-of-thought techniques like STaR iteratively utilise small rationale examples alongside larger datasets. At the same time, DAPO, Dr. GRPO, and Open Reasoner Zero demonstrate that minimalist RL techniques with decoupled clipping and simple reward schemes can substantially enhance reasoning performance.

    LLM agent architectures have evolved from basic reasoning-action frameworks to structured planning approaches and complex multi-agent systems. Testing environments range from specialised platforms like Sokoban and FrozenLake to general-purpose frameworks like HuggingGPT, enabling applications from web navigation to coding assistance and embodied tasks. Despite these advances, challenges persist in architectural complexity and self-correction, particularly for diverse multi-step reasoning tasks where maintaining coherence across interactions remains problematic.

    Researchers have approached agent learning through StarPO (State-Thinking-Actions-Reward Policy Optimisation), a unified framework for trajectory-level agent training with flexible control over reasoning processes, reward mechanisms, and prompt structures. Building on this framework, they developed RAGEN, a modular system implementing complete training loops for analysing LLM agent dynamics in multi-turn stochastic environments. To isolate learning factors from confounding variables like pretrained knowledge, evaluation focuses on three controlled gaming environments: Bandit (single-turn, stochastic), Sokoban (multi-turn, deterministic), and Frozen Lake (multi-turn, stochastic). These minimalistic environments require policy learning through interaction rather than relying on pre-existing knowledge. The analysis reveals three critical dimensions of agent learning: gradient stability issues in multi-turn reinforcement learning, the importance of rollout frequency and diversity in shaping agent evolution, and the need for carefully designed reward signals to develop genuine reasoning capabilities rather than shallow action selection or hallucinated thinking processes.

    StarPO represents a unique framework designed specifically for optimising multi-turn interaction trajectories in LLM agents. Unlike traditional approaches that treat each action independently, StarPO optimises entire trajectories—including observations, reasoning traces, actions, and feedback—as coherent units. This trajectory-level approach is particularly suited for interactive environments where agents must maintain memory across turns and adapt to stochastic feedback. StarPO’s objective function focuses on maximising expected rewards across complete trajectories rather than individual steps, making it directly compatible with autoregressive LLMs through decomposition into token-level likelihoods. The framework integrates reasoning-guided structured outputs that combine both intermediate thinking processes and executable actions, enabling agents to develop more sophisticated decision-making capabilities while maintaining learning stability in complex environments.

    Experimental results reveal that StarPO-S significantly outperforms vanilla StarPO across multiple agent tasks. By implementing uncertainty-based instance filtering, KL term removal, and asymmetric clipping, StarPO-S effectively delays performance collapse and enhances final task outcomes. The stabilised approach demonstrates particular effectiveness in complex environments like FrozenLake and Sokoban, where retaining only 25-50% of high-variance rollouts dramatically improves training stability while reducing computational requirements by up to 50%.

    Task diversity and interaction granularity significantly impact performance. Models trained with higher task diversity and 4-6 actions per turn demonstrate superior generalisation capabilities across novel vocabulary and larger environments. Also, frequent rollout updates prove critical for maintaining alignment between optimisation targets and policy behavior. Agents trained with up-to-date rollouts every 1-10 updates achieve faster convergence and higher success rates compared to those relying on outdated trajectory data.

    Symbolic reasoning benefits vary substantially between single-turn and multi-turn tasks. While reasoning traces significantly improve generalisation in single-turn Bandit environments, they provide limited advantage in complex multi-turn settings like Sokoban and FrozenLake. Analysis shows reasoning length consistently declines during training, suggesting models gradually suppress their thought processes when rewards are sparse and delayed. This highlights the need for reward mechanisms that directly reinforce intermediate reasoning steps rather than relying solely on outcome-based feedback.

    This research establishes reinforcement learning as a viable approach for training language agents in complex, stochastic environments. StarPO-S represents a significant advancement in stabilising multi-turn agent training through uncertainty-based sampling and exploration encouragement. By transitioning from human supervision to verifiable outcome-based rewards, this framework creates opportunities for developing more capable AI systems across theorem proving, software engineering, and scientific discovery. Future work should focus on multi-modal inputs, enhanced training efficiency, and applications to increasingly complex domains with verifiable objectives.


    Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Training LLM Agents Just Got More Stable: Researchers Introduce StarPO-S and RAGEN to Tackle Multi-Turn Reasoning and Collapse in Reinforcement Learning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCan one combine FlaUI and Selenium?
    Next Article Building a more accessible GitHub CLI

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 17, 2025
    Machine Learning

    Innovate business logic by implementing return of control in Amazon Bedrock Agents

    June 17, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    HTML Email Accessibility Report 2025

    News & Updates

    CVE-2025-5982 – GitLab EE IP Access Restriction Bypass Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    JavaScript Algorithms – great for interview prep and foundational learning

    Development
    Exploring Pages, Links, Tags, and Block References in Logseq

    Exploring Pages, Links, Tags, and Block References in Logseq

    Linux

    Highlights

    Making it stick: How to get the most out of cybersecurity training

    April 10, 2025

    Security awareness training doesn’t have to be a snoozefest – games and stories can help…

    Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks

    April 4, 2025

    Snowflake Charts New AI Territory: Cortex AISQL & Snowflake Intelligence Poised to Reshape Data Analytics

    June 3, 2025

    CVE-2025-4290 – PCMan FTP Server Buffer Overflow Vulnerability

    May 5, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.