Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      10 Benefits of Hiring a React.js Development Company (2025–2026 Edition)

      August 13, 2025

      From Line To Layout: How Past Experiences Shape Your Design Career

      August 13, 2025

      Hire React.js Developers in the US: How to Choose the Right Team for Your Needs

      August 13, 2025

      Google’s coding agent Jules gets critique functionality

      August 13, 2025

      The best smartphones without AI features in 2025: Expert tested and recommended

      August 13, 2025

      GPT-5 was supposed to simplify ChatGPT but now it has 4 new modes – here’s why

      August 13, 2025

      Gemini just got two of ChatGPT’s best features – and they’re free

      August 13, 2025

      The HP OmniBook 5 laptop offers 34 hours of battery life – and it’s 60% off today only

      August 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Boost is released

      August 13, 2025
      Recent

      Laravel Boost is released

      August 13, 2025

      Frontend Standards for Optimizely Configured Commerce: Clean & Scalable Web Best Practices

      August 13, 2025

      Live Agent Escalation in Copilot Studio Using D365 Omnichannel – Architecture and Use Case

      August 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      OpenAI’s Sam Altman: GPT-5 fails to meet AGI standards amid Microsoft’s fading partnership — “it’s still missing something”

      August 13, 2025
      Recent

      OpenAI’s Sam Altman: GPT-5 fails to meet AGI standards amid Microsoft’s fading partnership — “it’s still missing something”

      August 13, 2025

      You Think You Need a Monster PC to Run Local AI, Don’t You? — My Seven-Year-Old Mid-range Laptop Says Otherwise

      August 13, 2025

      8 Registry Tweaks that will Make File Explorer Faster and Easier to Use on Windows 11

      August 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents

    Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents

    August 13, 2025

    The landscape of software engineering automation is evolving rapidly, driven by advances in Large Language Models (LLMs). However, most approaches to training capable agents rely on proprietary models or costly teacher-based methods, leaving open-weight LLMs with limited capabilities in real-world scenarios. A team of researchers from Nebius AI and Humanoid introduced a reinforcement learning framework for training long-context, multi-turn software engineering agents using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm. The research explains a technical breakthrough in applying reinforcement learning (RL) to open-source LLMs for genuine, multi-turn software engineering tasks—moving beyond the single-turn, bandit-style settings that dominate RL for LLMs today.

    Beyond Single-Turn Reinforcement Learning RL

    Most RL methods for LLMs optimize for tasks such as mathematical reasoning or one-shot code generation, where agent actions are rewarded only at the conclusion and environments do not provide intermediate feedback. However, software engineering (SWE) is fundamentally different: it requires agents to operate over long sequences of actions, interpret rich feedback (compiler errors, test logs), and maintain context over hundreds of thousands of tokens—far exceeding typical single-step interaction loops.

    Core Challenges in RL for SWE

    • Long-Horizon Reasoning: Agents must sustain logical coherence across many steps, often requiring context windows beyond 100k tokens.
    • Stateful Environment Feedback: Actions yield meaningful, non-trivial observations (e.g., shell command outputs, test suite results) that guide subsequent decisions.
    • Sparse/Delayed Rewards: Success signals typically emerge only at the end of complex interactions, complicating credit assignment.
    • Evaluation Complexity: Measuring progress requires full trajectory unrolling and can be noisy due to test flakiness.

    The Technical Recipe: Modified DAPO and Agent Design

    The research team demonstrates a two-stage learning pipeline for training a Qwen2.5-72B-Instruct agent:

    1. Rejection Fine-Tuning (RFT)

    The journey begins with supervised fine-tuning. The agent is run across 7,249 rigorously filtered SWE tasks (from the SWE-REBENCH dataset). Successful interaction traces—where the agent passes the environmental test suite—are used to fine-tune the model, particularly masking invalid environment-formatting actions during training. This alone boosts baseline accuracy from 11% to 20% on the SWE-bench Verified benchmark.

    2. Reinforcement Learning Using Modified DAPO

    Building on Decoupled Advantage Policy Optimization (DAPO), several key modifications are introduced for scalability and stability:

    • Asymmetric Clipping: Prevents collapse in policy entropy, maintaining exploration.
    • Dynamic Sample Filtering: Focuses optimization on trajectories with actual learning signal.
    • Length Penalties: Discourages excessive episode length, helping the agent avoid getting stuck in loops.
    • Token-Level Averaging: Every token in every trajectory contributes equally to the gradient, empowering longer trajectories to influence updates.

    The agent utilizes a ReAct-style loop, which lets it combine reasoning steps with tool usage. Its supported toolkit includes arbitrary shell commands, precise code edits, navigation/search utilities, and a submit action to signal episode completion. Each interaction is grounded in a robust sandboxed environment, initialized from real repository snapshots and backed by a GitHub-style issue prompt.

    Scaling to Long Contexts and Real Benchmarks

    Initially trained with a context length of 65k tokens (already double that of most open models), performance stalls at 32%. A second RL phase expands the context to 131k tokens and doubles the episode length ceiling, focusing subsequent training on only the most beneficial tasks from the pool. This enables scaling to longer stack traces and diff histories inherent to real-world debugging and patching tasks.

    Results: Closing the Gap with Baselines

    • The final RL-trained agent attains 39% Pass@1 accuracy on the SWE-bench Verified benchmark, doubling the rejection fine-tuned baseline, and matching the performance of cutting-edge open-weight models such as DeepSeek-V3-0324, all without teacher-based supervision.
    • On held-out SWE-rebench splits, scores remain competitive (35% for May, 31.7% for June), indicating the method’s robustness.
    • When compared head-to-head with top open baselines and specialized SWE agents, the RL agent matches or outperforms several models, confirming the effectiveness of the RL methodology in this domain.
    Pass@1 SWE-bench VerifiedPass@10Pass@1 SWE-rebench MayPass@10
    Qwen2.5-72B-Instruct (RL, final)39.04%58.4%35.0%52.5%
    DeepSeek-V3-032439.56%62.2%36.75%60.0%
    Qwen3-235B no-thinking25.84%54.4%27.25%57.5%
    Llama4 Maverick15.84%47.2%19.0%50.0%

    Pass@1 scores are averaged over 10 runs and reported as mean ± standard error.

    Key Insights

    • Credit Assignment: RL in this sparse-reward regime remains fundamentally challenging. The paper suggests future work with reward shaping, step-level critics, or prefix-based rollouts for more granular feedback.
    • Uncertainty Estimation: Real-world agents need to know when to abstain or express confidence. Techniques like output entropy or explicit confidence scoring are next steps.
    • Infrastructure: Training utilized context parallelism (splitting long sequences over GPUs) on 16 H200 nodes, with distributed orchestration via Kubernetes and Tracto AI, and vLLM for fast inference.

    Conclusion

    This research validates RL as a potent paradigm for building autonomous software engineers using open-weight LLMs. By conquering long-horizon, multi-turn, real-environment tasks, the methodology paves the way for scalable, teacher-free agent development—directly leveraging the power of interaction rather than static instruction. With further refinements, such RL pipelines promise efficient, reliable, and versatile automation for the future of software engineering.


    Check out the Paper here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    🇬 Star us on GitHub
    🇸 Sponsor us

    The post Nebius AI Advances Open-Weight LLMs Through Reinforcement Learning for Capable SWE Agents appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMistral AI Unveils Mistral Medium 3.1: Enhancing AI with Superior Performance and Usability
    Next Article How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 13, 2025
    Machine Learning

    Mistral AI Unveils Mistral Medium 3.1: Enhancing AI with Superior Performance and Usability

    August 13, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    WordPress AI Engine Flaw (CVE-2025-5071): Critical Bug Allows Subscriber-Level Account Takeover

    Security

    ⚡ Weekly Recap: iPhone Spyware, Microsoft 0-Day, TokenBreak Hack, AI Data Leaks and More

    Security

    Rilasciata Linux Kamarada 15.6: Una Distribuzione GNU/Linux Basata su openSUSE

    Linux

    Use Amazon Bedrock Intelligent Prompt Routing for cost and latency benefits

    Machine Learning

    Highlights

    Doom: The Dark Ages, Homeworld 3, and More Titles Now Supported by Nvidia’s Latest Driver with Dlss 4

    May 17, 2025

    NVIDIA’s GeForce Game Ready Driver 576.40 boosts performance for DOOM: The Dark Ages and adds…

    BulkFolderLocker – Fast & Easy Privacy Protector Locks Folders in Bulk on Windows in 3 Seconds

    August 5, 2025

    JSHint – static code analysis tool

    July 25, 2025

    CVE-2025-23180 – Apache HTTP Server Unprivileged Process Execution

    April 29, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.