Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

Reinforcement learning (RL) has emerged as a fundamental approach in LLM post-training, utilizing supervision signals from human feedback (RLHF) or verifiable rewards (RLVR). While RLVR shows promise in mathematical reasoning, it faces significant constraints due to dependence on training queries with verifiable answers. This requirement limits applications to large-scale training on general-domain queries where verification proves intractable. Further, current reward models, categorized into scalar and generative types, cannot effectively scale test-time compute for reward estimation. Existing approaches apply uniform computational resources across all inputs, lacking adaptability to allocate additional resources to challenging queries requiring nuanced analysis.

Formulation strategies and scoring schemes characterize reward models. Numeric approaches assign scalar scores to query-response pairs, while generative methods produce natural language feedback. Scoring follows absolute evaluation of individual pairs or discriminative comparison of candidate responses. Generative reward models, aligned with the LLM-as-a-Judge paradigm, offer interpretable feedback but face reliability concerns due to biased judgments. Inference-time scaling methods dynamically adjust computational resources, including parallel strategies like multi-sampling and horizon-based scaling for extended reasoning traces. However, they lack systematic adaptation to input complexity, limiting their effectiveness across diverse query types.

Researchers from Microsoft Research, Tsinghua University, and Peking University have proposed Reward Reasoning Models (RRMs), which perform explicit reasoning before producing final rewards. This reasoning phase allows RRMs to adaptively allocate additional computational resources when evaluating responses to complex tasks. RRMs introduce a dimension for enhancing reward modeling by scaling test-time compute while maintaining general applicability across diverse evaluation scenarios. Through chain-of-thought reasoning, RRMs utilize additional test-time compute for complex queries where appropriate rewards are not immediately apparent. This encourages RRMs to self-evolve reward reasoning capabilities without explicit reasoning traces as training data.

RRMs utilize the Qwen2 model with a Transformer-decoder backbone, formulating reward modeling as text completion where RRMs autoregressively generate thinking processes followed by final judgments. Each input contains a query and two responses to determine preference without allowing ties. Researchers use the RewardBench repository to guide systematic analysis across evaluation criteria, including instruction fidelity, helpfulness, accuracy, harmlessness, and detail level. RRMs support multi-response evaluation through ELO rating systems and knockout tournaments, both combinable with majority voting for enhanced test-time compute utilization. This samples RRMs multiple times for pairwise comparisons, performing majority voting to obtain robust comparison results.

Evaluation results show that RRMs achieve competitive performance against strong baselines on RewardBench and PandaLM Test benchmarks, with RRM-32B attaining 98.6% accuracy in reasoning categories. Comparing with DirectJudge models trained on identical data reveals substantial performance gaps, indicating RRMs effectively use test-time compute for complex queries. In reward-guided best-of-N inference, RRMs surpass all baseline models without additional test-time compute, with majority voting providing substantial improvements across evaluated subsets. Post-training experiments show steady downstream performance improvements on MMLU-Pro and GPQA. Scaling experiments across 7B, 14B, and 32B models confirm that longer thinking horizons consistently improve accuracy.

In conclusion, researchers introduced RRMs to perform explicit reasoning processes before reward assignment to address computational inflexibility in existing reward modeling approaches. Rule-based-reward RL enables RRMs to develop complex reasoning capabilities without requiring explicit reasoning traces as supervision. RRMs efficiently utilize test-time compute through parallel and sequential scaling approaches. The effectiveness of RRMs in practical applications, including reward-guided best-of-N inference and post-training feedback, demonstrates their potential as strong alternatives to traditional scalar reward models in alignment techniques.

Check out the Paper and Models on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment appeared first on MarkTechPost.

Source: Read MoreÂ

10 Top Generative AI Development Companies for Enterprise Node.js Projects

Prompting Is A Design Act: How To Brief, Guide And Iterate With AI

Best React.js Development Services in 2025: Features, Benefits & What to Look For

August 2025: AI updates from the past month

This 3-in-1 charger has a retractable superpower that’s a must for travel

How a legacy hardware company reinvented itself in the AI age

The 13+ best Walmart Labor Day deals 2025: Sales on Apple, Samsung, LG, and more

You can save up to $700 on my favorite Bluetti power stations for Labor Day

Call for Speakers – JS Conf Armenia 2025

Call for Speakers – JS Conf Armenia 2025

Streamlining Application Automation with Laravel’s Task Scheduler

A Fluent Path Builder for PHP and Laravel

Windows 11 KB5064081 24H2 adds taskbar clock, direct download links for .msu offline installer

Windows 11 KB5064081 24H2 adds taskbar clock, direct download links for .msu offline installer

My Family Cinema not Working? 12 Quick Fixes

Super-linter – collection of linters and code analyzers

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Introducing auto scaling on Amazon SageMaker HyperPod

Despite Microsoft’s multi-billion-dollar OpenAI investment, these viral ChatGPT 4o Ghibli memes are another example of just how far behind Copilot is

CVE-2024-53591 – Seclore Brute Force Authentication Bypass

CVE-2025-25032 – IBM Cognos Analytics Memory Exhaustion Denial of Service

ChatGPT Plus is free for students now – how to grab this deal before finals

Top Smart Car Accessories in 2025: Enhancing Connectivity and Safety on the Road

CVE-2024-52894 – IBM Db2 Denial of Service

Universal Design in Pharmacies – WCAG – Perceivable

Palworld developers at Pocketpair showed off gliding six months prior to Nintendo’s original patent application

Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment

Related Posts