Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025

      I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      May report 2025

      June 2, 2025
      Recent

      May report 2025

      June 2, 2025

      Write more reliable JavaScript with optional chaining

      June 2, 2025

      Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025
      Recent

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper Introduces GRPO-based Open-RS: A Low-Cost Reinforcement Learning Framework to Enhance Reasoning in Small Language Models

    This AI Paper Introduces GRPO-based Open-RS: A Low-Cost Reinforcement Learning Framework to Enhance Reasoning in Small Language Models

    March 26, 2025

    One particular focus on large language models has been improving their logical thinking and problem-solving skills. Reinforcement learning (RL) is increasingly used in this space for massive models and compact versions that can perform well in restricted computing environments. One major challenge in this field is improving a model’s reasoning capability without relying on extremely large infrastructure or excessive training time. Leading models require expensive hardware and proprietary data pipelines, putting them out of reach for smaller labs or companies. This raises the question of whether smaller models can be enhanced using cost-efficient approaches and achieve performance comparable to their larger counterparts on challenging tasks like math reasoning.

    Several methods have been explored to address this. Chain-of-thought prompting helps guide models through problem steps. Search algorithms such as Beam Search and Monte Carlo Tree Search are also used to improve the logical flow of answers. Reinforcement learning itself has been tested in multiple settings. However, many of these approaches are still bound by the same issues: they depend on massive datasets or lead to unstable performance in small-scale setups. Furthermore, the results often fail to match those of proprietary models like OpenAI’s o1-preview.

    Research introduced by a team from Knovel Engineering Lab in Singapore and VNU University of Science in Vietnam focused on overcoming these problems. The researchers used a 1.5-billion-parameter model named DeepSeek-R1-Distill-Qwen-1.5B. They adopted the Group Relative Policy Optimization (GRPO) algorithm for their setup, training the model using four NVIDIA A40 GPUs with 48 GB VRAM each, all within a strict 24-hour limit. Their key objective was to enhance the model’s reasoning without large financial or computational investment. Their training consumed only $42 in computing costs, a drastic reduction compared to baselines that require thousands of dollars.

    The team assembled a dataset of 39,659 mathematics-specific questions to achieve this by refining two existing datasets—open-s1 and open-deep scale. The filtering process involved removing trivial or noisy questions using different models such as Qwen2.5-7B-Instruct and DeepSeek-R1-Distill-Qwen-1.5B. The reward system was rule-based and focused on three components: correctness of answers (using boxed notation), structural formatting (enforced with tags), and output length (rewarded with a cosine function to promote concise reasoning). The GRPO algorithm was used to sample group responses and apply score-based optimization, avoiding the need for a critical model and thus reducing computational demands further.

    The performance of this approach was tested across five benchmark datasets: AMC23, AIME24, MATH-500, OlympiadBench, and Minerva. In one experiment, using just the open-s1 dataset, the model’s AMC23 accuracy improved from 63% to 70% within the first 100 global steps but later declined. In another trial that combined 7,000 samples of mixed difficulty, the accuracy on AMC23 rose to 80%, and AIME24 reached 46.7%. The model named Open-RS2, trained in that setup, also showed competitive scores on OlympiadBench (52.4%) and MATH-500 (85%). In the final experiment, the cosine reward helped regulate output length to a range of 1000–3500 tokens, and the model maintained 72.5% accuracy on AMC23 and 84.4% on MATH-500.

    This research showed that effective reasoning in small language models is achievable even with limited resources. The problem of training small models without significant hardware investment was addressed with a low-cost and efficient training strategy. The proposed method used reinforcement learning and curated data to deliver surprisingly strong results. With continued improvements in reward design and optimization stability, small models may soon rival their larger counterparts in practical reasoning tasks.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    The post This AI Paper Introduces GRPO-based Open-RS: A Low-Cost Reinforcement Learning Framework to Enhance Reasoning in Small Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUnderstanding and Mitigating Failure Modes in LLM-Based Multi-Agent Systems
    Next Article How to Transfer Your UXR Skills to Different Roles

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Best tips I use to boost search efficiency on File Explorer for Windows 11

    News & Updates

    PersonaGym: A Dynamic AI Framework for Comprehensive Evaluation of LLM Persona Agents

    Development

    CVE-2025-2168 – Elementor Store Kit CSRF

    Common Vulnerabilities and Exposures (CVEs)

    VEnhancer: A Generative Space-Time Enhancement Method for Video Generation

    Development
    Hostinger

    Highlights

    Development

    20 Best New Websites, May 2024

    May 14, 2024

    Welcome to May’s compilation of the best sites on the web. This month we’re focused…

    Taxi From Maidstone to Heathrow Airport

    February 16, 2025

    CVE-2025-46627 – Tenda RX2 Pro Weak Credential Vulnerability

    May 1, 2025

    Offshore Software Development Excellence: What Every CTO Needs to Know

    July 29, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.