Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Microsoft adds Copilot-powered debugging features for .NET in Visual Studio

      August 21, 2025

      Blackstone portfolio company R Systems Acquires Novigo Solutions, Strengthening its Product Engineering and Full-Stack Agentic-AI Capabilities

      August 21, 2025

      HoundDog.ai Launches Industry’s First Privacy-by-Design Code Scanner for AI Applications

      August 21, 2025

      The Double-Edged Sustainability Sword Of AI In Web Design

      August 20, 2025

      How VPNs are helping people evade increased censorship – and much more

      August 22, 2025

      Google’s AI Mode can now find restaurant reservations for you – how it works

      August 22, 2025

      Best early Labor Day TV deals 2025: Save up to 50% on Samsung, LG, and more

      August 22, 2025

      Claude wins high praise from a Supreme Court justice – is AI’s legal losing streak over?

      August 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Preserving Data Integrity with Laravel Soft Deletes for Recovery and Compliance

      August 22, 2025
      Recent

      Preserving Data Integrity with Laravel Soft Deletes for Recovery and Compliance

      August 22, 2025

      Quickly Generate Forms based on your Eloquent Models with Laravel Formello

      August 22, 2025

      Pest 4 is Released

      August 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      FOSS Weekly #25.34: Mint 22.2 Features, FreeVPN Fiasco, Windows Update Killing SSDs, AI in LibreOffice and More

      August 21, 2025
      Recent

      FOSS Weekly #25.34: Mint 22.2 Features, FreeVPN Fiasco, Windows Update Killing SSDs, AI in LibreOffice and More

      August 21, 2025

      You’ll need standalone Word, PowerPoint, Excel on iOS, as Microsoft 365 app becomes a Copilot wrapper

      August 21, 2025

      Microsoft to Move Copilot Previews to iOS While Editing Returns to Office Apps

      August 21, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Scalable Reinforcement Learning with Verifiable Rewards: Generative Reward Modeling for Unstructured, Multi-Domain Tasks

    Scalable Reinforcement Learning with Verifiable Rewards: Generative Reward Modeling for Unstructured, Multi-Domain Tasks

    April 5, 2025

    Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs’ reasoning and coding abilities, particularly in domains where structured reference answers allow clear-cut verification. This approach relies on reference-based signals to determine if a model’s response aligns with a known correct answer, typically through binary correctness labels or graded scores. RLVR has mainly been applied to areas like math and coding, where rule-based or tool-assisted verification is straightforward. However, expanding RLVR to more complex and less structured tasks has been difficult due to challenges in verifying open-ended or ambiguous reference responses. Although generative models and closed-source LLMs like GPT-4o have been explored as verifiers, these solutions often remain domain-specific and require extensive annotated datasets for training.

    Recent developments aim to broaden RLVR applications by introducing generative reward modeling, where LLMs use their generative abilities to produce judgments and justifications. These models can be trained without detailed rationales, instead relying on the confidence of the verifier’s outputs to generate stable reward signals. This technique supports reinforcement learning in tasks with noisy or ambiguous labels. Furthermore, researchers are exploring RLVR in a wider variety of domains using more free-form reference answers—sourced from expert annotations and pretraining data or generated by LLMs—moving beyond narrowly defined tasks like math and logic puzzles. These efforts mark a significant step toward scalable and domain-general RLVR training.

    Tencent AI Lab and Soochow University researchers are exploring extending RLVR to complex, unstructured domains like medicine, chemistry, and education. They show that binary correctness judgments remain consistent across LLMs when expert-written references are available. To address the limitations of binary rewards in free-form tasks, they introduce soft, generative model-based reward signals. Using compact 7B models, they train cross-domain reward verifiers without requiring extensive domain-specific annotation. Their RLVR framework significantly outperforms top open-source models in reasoning tasks and scales effectively. They also release a 570k-example dataset to support further research in multi-domain RLVR.

    The method uses expert-written reference answers to guide reward estimation for reinforcement learning. Responses are evaluated using a generative LLM verifier, which outputs binary (0/1) or soft rewards based on the likelihood of correctness. Rewards are normalized using z-score normalization for stable training and better learning dynamics. The authors train a compact (7B) generative reward model using judgments collected during RL exploration to avoid relying solely on large models. These binary labels are obtained from a larger LLM and used to fine-tune the smaller verifier. This approach balances performance and efficiency while increasing robustness to noise and formatting variations.

    The study uses two large-scale Chinese QA datasets—one with 773k free-form math questions across school levels and another with 638k multi-subject college-level questions from ExamQA. These datasets feature complex, unstructured answers that challenge rule-based reward methods. The researchers trained a 7B reward model (RM-7B) using 160k distilled samples and tested various RL approaches. Results show that RL with model-based rewards outperforms rule-based methods and supervised fine-tuning (SFT), especially in reasoning tasks. Notably, RM-7B achieves performance close to the larger 72B model, highlighting its efficiency. Binary rewards outperform soft rewards in rule-based settings due to semantic mismatch issues.

    In conclusion, the study simplifies reward modeling by training a generative model to output binary scores (1 or 0) without relying on chain-of-thought reasoning. While CoT aids in reasoning, its necessity for verifying semantic similarity remains unclear. Unlike past work that relied on format-based scoring, this approach avoids strict answer formatting, reducing manual effort. The research extends RLVR beyond structured domains to areas like medicine and economics, where reference answers are less defined. Using a 7B model, it shows that soft, model-based rewards enhance performance in free-form tasks, outperforming larger models and improving RLVR’s adaptability and scalability.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post Scalable Reinforcement Learning with Verifiable Rewards: Generative Reward Modeling for Unstructured, Multi-Domain Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeta AI Just Released Llama 4 Scout and Llama 4 Maverick: The First Set of Llama 4 Models
    Next Article NVIDIA AI Released AgentIQ: An Open-Source Library for Efficiently Connecting and Optimizing Teams of AI Agents

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 22, 2025
    Machine Learning

    The “Super Weight:” How Even a Single Parameter can Determine a Large Language Model’s Behavior

    August 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Take It Down Act Expected to Become Law Despite Concerns

    Development

    Taskade Autopilot is now live

    Web Development

    CVE-2025-7643 – WordPress Attachment Manager Remote File Deletion Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    “So many amazing Forgers have just given up” — Halo Infinite just lost its best Forge creators over lack of support

    News & Updates

    Highlights

    CVE-2025-44895 – D-Link FW-WGS-804HPT Stack Overflow Vulnerability

    May 21, 2025

    CVE ID : CVE-2025-44895

    Published : May 21, 2025, 2:15 p.m. | 35 minutes ago

    Description : FW-WGS-804HPT v1.305b241111 was discovered to contain a stack overflow via the ipv4Aclkey parameter in the web_acl_ipv4BasedAceAdd function.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Amazon Gaming Week is LIVE — 7 hand-picked deals with price-busting competitors that you can’t miss!

    April 28, 2025

    CVE-2025-43003 – SAP S/4 HANA Configuration Privilege Escalation

    May 13, 2025

    Xbox is so obsessed with cloud gaming and PC hybrids — it might miss the GTA 6 gold rush that could actually sell consoles

    August 15, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.