Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025

      I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      May report 2025

      June 2, 2025
      Recent

      May report 2025

      June 2, 2025

      Write more reliable JavaScript with optional chaining

      June 2, 2025

      Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025
      Recent

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization

    Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization

    April 7, 2025

    Reinforcement Learning RL has become a widely used post-training method for LLMs, enhancing capabilities like human alignment, long-term reasoning, and adaptability. A major challenge, however, is generating accurate reward signals in broad, less structured domains, as current high-quality reward models are largely built on rule-based systems or verifiable tasks such as math and coding. In general applications, reward criteria are more diverse and subjective, lacking clear ground truths. To address this, generalist reward models (RMs) are being explored for broader applicability. However, these models must balance input flexibility and scalability during inference, particularly in producing reliable, high-quality rewards across varied tasks and domains.

    Existing reward modeling approaches include scalar, semi-scalar, and generative techniques, each with flexibility and inference-time performance trade-offs. For instance, pairwise models are limited to relative comparisons, while scalar models struggle with producing diverse feedback. Generative reward models (GRMs) offer richer, more flexible outputs, making them more suited for evaluating various responses. Recent work has explored training GRMs through offline RL, integrating tools and external knowledge to improve reward quality. However, few methods directly address how RMs can scale efficiently during inference. This has led to research on methods like sampling-based scaling, chain-of-thought prompting, and reward-guided aggregation, aiming to co-scale policy models and reward models during inference. These developments hold promise for more robust, general-purpose reward systems in LLMs.

    DeepSeek-AI and Tsinghua University researchers explore enhancing reward models RM for general queries by improving inference-time scalability using increased computing and better learning techniques. They employ pointwise GRM for flexible input handling and propose a learning method—Self-Principled Critique Tuning (SPCT)—which helps GRMs generate adaptive principles and accurate critiques during online reinforcement learning. They apply parallel sampling and introduce a meta RM to scale effectively and refine the voting process. Their DeepSeek-GRM models outperform existing benchmark methods, offering higher reward quality and scalability, with plans for open-sourcing despite challenges in some complex tasks.

    The researchers introduce SPCT, a method designed to enhance pointwise GRMs by enabling them to generate adaptive principles and accurate critiques. SPCT consists of two stages: rejective fine-tuning for initializing principle and critique generation and rule-based RL for refinement. Instead of treating principles as preprocessing, they are generated dynamically during inference. This promotes scalability by improving reward granularity. Additionally, inference-time performance is boosted through parallel sampling and voting, supported by a meta reward model (meta RM) that filters out low-quality outputs. Overall, SPCT improves reward accuracy, robustness, and scalability in GRMs.

    Using standard metrics, the study evaluates various RM methods across benchmarks like Reward Bench, PPE, RMB, and ReaLMistake. DeepSeek-GRM-27B consistently outperforms baselines and rivals strong public models like GPT-4o. Inference-time scaling, especially with voting and meta reward models, significantly boosts performance—achieving results comparable to much larger models. Ablation studies highlight the importance of components like principle generation and non-hinted sampling. Training-time scaling shows diminishing returns compared to inference-time strategies. Overall, DeepSeek-GRM, enhanced with SPCT and meta RM, offers robust, scalable reward modeling with reduced domain bias and strong generalization.

    In conclusion, the study presents SPCT, a method that improves inference-time scalability for GRMs through rule-based online reinforcement learning. SPCT enables adaptive principle and critique generation, enhancing reward quality across diverse tasks. DeepSeek-GRM models outperform several baselines and strong public models, especially when paired with a meta reward model for inference-time scaling. Using parallel sampling and flexible input handling, these GRMs achieve strong performance without relying on larger model sizes. Future work includes integrating GRMs into RL pipelines, co-scaling with policy models, and serving as reliable offline evaluators.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMMSearch-R1: End-to-End Reinforcement Learning for Active Image Search in LMMs
    Next Article Blockchain & Neuroscience: Unlocking the Future of Brain-Tech Innovation

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Case Study: Ciel Rose

    News & Updates

    See-Through Parallel Universes with Your Mind’s Eye – The Course Guidebook: Chapter 4

    Artificial Intelligence

    Social Media Policy

    Development

    How to Customize Zoom Level in Optimizely CMS Spire

    Development

    Highlights

    CVE-2025-43011 – SAP Landscape Transformation Authorization Bypass Vulnerability

    May 13, 2025

    CVE ID : CVE-2025-43011

    Published : May 13, 2025, 1:15 a.m. | 1 hour, 49 minutes ago

    Description : Under certain conditions, SAP Landscape Transformation’s PCL Basis module does not perform the necessary authorization checks, allowing authenticated users to access restricted functionalities or data. This can lead to a high impact on confidentiality with no impact on the integrity or availability of the application.

    Severity: 7.7 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Snowflake Breach Victims: 165 Organizations Identified So Far

    June 10, 2024

    AI and automation shift the cybersecurity balance toward attackers

    May 2, 2025

    Best Free and Open Source Software: June 2024 Updates

    July 1, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.