Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Microsoft Graph CLI to be retired

      September 2, 2025

      The state of DevOps and AI: Not just hype

      September 1, 2025

      A Breeze Of Inspiration In September (2025 Wallpapers Edition)

      August 31, 2025

      10 Top Generative AI Development Companies for Enterprise Node.js Projects

      August 30, 2025

      I asked AI to modify mission-critical code, and what happened next haunts me

      September 3, 2025

      Why you should delete your browser extensions right now – or do this to stay safe

      September 3, 2025

      Dolby Vision 2 comes with big upgrades – here’s which TVs get them first

      September 3, 2025

      This one small feature makes this travel charger my favorite for business trips

      September 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laracon AU 2025 Talk Titles Revealed

      September 3, 2025
      Recent

      Laracon AU 2025 Talk Titles Revealed

      September 3, 2025

      Stop Writing Bad Controllers: Laravel Custom Collections Transform Your Code

      September 3, 2025

      Handle ownership relationships between Eloquent models with Laravel Ownable

      September 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Lenovo Legion Go 2 confirmed with Ryzen Z2 Extreme, 1200p OLED 144Hz display & 74Wh battery

      September 2, 2025
      Recent

      Lenovo Legion Go 2 confirmed with Ryzen Z2 Extreme, 1200p OLED 144Hz display & 74Wh battery

      September 2, 2025

      How to Open Ports in Firewall on Windows Server

      September 2, 2025

      Google TV Remote Not Working? 5 Quick Fixes

      September 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

    Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

    May 27, 2025

    Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it faces significant constraints due to dependence on training queries with verifiable answers. This requirement limits applications to large-scale training on general-domain queries where verification proves intractable. Further, current reward models, categorized into scalar and generative types, cannot effectively scale test-time compute for reward estimation. Existing approaches apply uniform computational resources across all inputs, lacking adaptability to allocate additional resources to challenging queries requiring nuanced analysis.

    Formulation strategies and scoring schemes characterize reward models. Numeric approaches assign scalar scores to query-response pairs, while generative methods produce natural language feedback. Scoring follows absolute evaluation of individual pairs or discriminative comparison of candidate responses. Generative reward models, aligned with the LLM-as-a-Judge paradigm, offer interpretable feedback but face reliability concerns due to biased judgments. Inference-time scaling methods dynamically adjust computational resources, including parallel strategies like multi-sampling and horizon-based scaling for extended reasoning traces. However, they lack systematic adaptation to input complexity, limiting their effectiveness across diverse query types.

    Researchers from Microsoft Research, Tsinghua University, and Peking University have proposed Reward Reasoning Models (RRMs), which perform explicit reasoning before producing final rewards. This reasoning phase allows RRMs to adaptively allocate additional computational resources when evaluating responses to complex tasks. RRMs introduce a dimension for enhancing reward modeling by scaling test-time compute while maintaining general applicability across diverse evaluation scenarios. Through chain-of-thought reasoning, RRMs utilize additional test-time compute for complex queries where appropriate rewards are not immediately apparent. This encourages RRMs to self-evolve reward reasoning capabilities without explicit reasoning traces as training data.

    RRMs utilize the Qwen2 model with a Transformer-decoder backbone, formulating reward modeling as text completion where RRMs autoregressively generate thinking processes followed by final judgments. Each input contains a query and two responses to determine preference without allowing ties. Researchers use the RewardBench repository to guide systematic analysis across evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments, both combinable with majority voting for enhanced test-time compute utilization. This samples RRMs multiple times for pairwise comparisons, performing majority voting to obtain robust comparison results.

    Evaluation results show that RRMs achieve competitive performance against strong baselines on RewardBench and PandaLM Test benchmarks, with RRM-32B attaining 98.6% accuracy in reasoning categories. Comparing with DirectJudge models trained on identical data reveals substantial performance gaps, indicating RRMs effectively use test-time compute for complex queries. In reward-guided best-of-N inference, RRMs surpass all baseline models without additional test-time compute, with majority voting providing substantial improvements across evaluated subsets. Post-training experiments show steady downstream performance improvements on MMLU-Pro and GPQA. Scaling experiments across 7B, 14B, and 32B models confirm that longer thinking horizons consistently improve accuracy.

    In conclusion, researchers introduced RRMs to perform explicit reasoning processes before reward assignment to address computational inflexibility in existing reward modeling approaches. Rule-based-reward RL enables RRMs to develop complex reasoning capabilities without requiring explicit reasoning traces as supervision. RRMs efficiently utilize test-time compute through parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including reward-guided best-of-N inference and post-training feedback, demonstrates their potential as strong alternatives to traditional scalar reward models in alignment techniques.


    Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

    The post Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleResearchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search
    Next Article AI for Sustainability: How Smart Technology Is Powering the Green Revolution🌱

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-52776 – Thanhtungtnt Video List Manager Cross-site Scripting

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5336 – WordPress Click to Chat Stored Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-54381 – BentoML SSRF Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    5 AI Tools Revolutionizing Web Development

    Web Development

    Highlights

    CVE-2025-52714 – Shinetheme Traveler SQL Injection Vulnerability

    July 16, 2025

    CVE ID : CVE-2025-52714

    Published : July 16, 2025, 12:15 p.m. | 5 hours, 59 minutes ago

    Description : Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’) vulnerability in shinetheme Traveler allows SQL Injection. This issue affects Traveler: from n/a through n/a.

    Severity: 9.3 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Windows 11 KB5058411 update errors still lack official fixes

    May 21, 2025

    REPORT: We’re getting a DOOM Xbox Elite Controller and Xbox Series X console wrap to celebrate ‘The Dark Ages’

    April 16, 2025

    CVE-2025-7787 – Xuxueli xxl-job Server-Side Request Forgery (SSRF) Vulnerability

    July 18, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.