Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Twilio’s Event Triggered Journeys, OutSystem’s Agent Workbench, and more – Daily News Digest

      July 15, 2025

      Harness Infrastructure as Code Management expands with features that facilitate better reusability

      July 15, 2025

      Akka introduces platform for distributed agentic AI

      July 14, 2025

      Design Patterns For AI Interfaces

      July 14, 2025

      I tested Dyson’s pricey flagship headphones and they’re better than I expected

      July 15, 2025

      60% of managers use AI to make decisions now, including whom to promote and fire – does yours?

      July 15, 2025

      4 ways Google Wallet can make your life easier (and more organized)

      July 15, 2025

      How to maximize your Windows 11 PC’s battery life in 9 easy steps

      July 15, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 15, 2025
      Recent

      The details of TC39’s last meeting

      July 15, 2025

      How Agentic AI is Reshaping Marketing and CX Operations

      July 15, 2025

      We’re Moving! NodeSource Distributions Now Have a New Home – With Extended Support

      July 15, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Blender 4.5 LTS Released with ‘Full Vulkan Support’

      July 15, 2025
      Recent

      Blender 4.5 LTS Released with ‘Full Vulkan Support’

      July 15, 2025

      nsupdate.info – software used to implement a free dynamic DNS service

      July 15, 2025

      Windows 11’s handheld gaming UI references spotted ahead of Xbox Ally

      July 15, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

    Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

    May 27, 2025

    Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it faces significant constraints due to dependence on training queries with verifiable answers. This requirement limits applications to large-scale training on general-domain queries where verification proves intractable. Further, current reward models, categorized into scalar and generative types, cannot effectively scale test-time compute for reward estimation. Existing approaches apply uniform computational resources across all inputs, lacking adaptability to allocate additional resources to challenging queries requiring nuanced analysis.

    Formulation strategies and scoring schemes characterize reward models. Numeric approaches assign scalar scores to query-response pairs, while generative methods produce natural language feedback. Scoring follows absolute evaluation of individual pairs or discriminative comparison of candidate responses. Generative reward models, aligned with the LLM-as-a-Judge paradigm, offer interpretable feedback but face reliability concerns due to biased judgments. Inference-time scaling methods dynamically adjust computational resources, including parallel strategies like multi-sampling and horizon-based scaling for extended reasoning traces. However, they lack systematic adaptation to input complexity, limiting their effectiveness across diverse query types.

    Researchers from Microsoft Research, Tsinghua University, and Peking University have proposed Reward Reasoning Models (RRMs), which perform explicit reasoning before producing final rewards. This reasoning phase allows RRMs to adaptively allocate additional computational resources when evaluating responses to complex tasks. RRMs introduce a dimension for enhancing reward modeling by scaling test-time compute while maintaining general applicability across diverse evaluation scenarios. Through chain-of-thought reasoning, RRMs utilize additional test-time compute for complex queries where appropriate rewards are not immediately apparent. This encourages RRMs to self-evolve reward reasoning capabilities without explicit reasoning traces as training data.

    RRMs utilize the Qwen2 model with a Transformer-decoder backbone, formulating reward modeling as text completion where RRMs autoregressively generate thinking processes followed by final judgments. Each input contains a query and two responses to determine preference without allowing ties. Researchers use the RewardBench repository to guide systematic analysis across evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments, both combinable with majority voting for enhanced test-time compute utilization. This samples RRMs multiple times for pairwise comparisons, performing majority voting to obtain robust comparison results.

    Evaluation results show that RRMs achieve competitive performance against strong baselines on RewardBench and PandaLM Test benchmarks, with RRM-32B attaining 98.6% accuracy in reasoning categories. Comparing with DirectJudge models trained on identical data reveals substantial performance gaps, indicating RRMs effectively use test-time compute for complex queries. In reward-guided best-of-N inference, RRMs surpass all baseline models without additional test-time compute, with majority voting providing substantial improvements across evaluated subsets. Post-training experiments show steady downstream performance improvements on MMLU-Pro and GPQA. Scaling experiments across 7B, 14B, and 32B models confirm that longer thinking horizons consistently improve accuracy.

    In conclusion, researchers introduced RRMs to perform explicit reasoning processes before reward assignment to address computational inflexibility in existing reward modeling approaches. Rule-based-reward RL enables RRMs to develop complex reasoning capabilities without requiring explicit reasoning traces as supervision. RRMs efficiently utilize test-time compute through parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including reward-guided best-of-N inference and post-training feedback, demonstrates their potential as strong alternatives to traditional scalar reward models in alignment techniques.


    Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleResearchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search
    Next Article AI for Sustainability: How Smart Technology Is Powering the Green Revolution🌱

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 15, 2025
    Machine Learning

    PREAMBLE: Private and Efficient Aggregation via Block Sparse Vectors

    July 15, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Code a Dropbox Clone with NextJS

    Development

    How to Guide Your Clients to the Best WordPress Solutions

    Learning Resources

    Microsoft might kill the Surface Laptop Studio as production is quietly halted

    News & Updates

    Working with OEM Agent software for Amazon RDS for Oracle

    Databases

    Highlights

    CVE-2025-40652 – CoverManager Stored Cross-Site Scripting (XSS) Vulnerability

    May 26, 2025

    CVE ID : CVE-2025-40652

    Published : May 26, 2025, 1:15 p.m. | 3 hours, 42 minutes ago

    Description : Stored Cross-Site Scripting (XSS) vulnerability in the CoverManager booking software. This allows an attacker to inject malicious scripts into the application, which are permanently stored on the server. The malicious scripts are executed in the browser of any user visiting the affected page without the user having to take any further action. This can allow the attacker to steal sensitive information, such as session cookies, login credentials, and perform actions on behalf of the affected user.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Stack up to 3 years of Xbox Game Pass Core and save nearly $50 in this last minute deal — Peace of mind for all your online multiplayer games

    July 11, 2025

    CVE-2024-13930 – ASPECT DoS Denial of Service

    May 22, 2025

    CVE-2025-35995 – BIG-IP PEM Denial of Service Vulnerability

    May 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.