Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Psychology Of Color In UX Design And Digital Products

      August 15, 2025

      This week in AI dev tools: Claude Sonnet 4’s larger context window, ChatGPT updates, and more (August 15, 2025)

      August 15, 2025

      Sentry launches MCP monitoring tool

      August 14, 2025

      10 Benefits of Hiring a React.js Development Company (2025–2026 Edition)

      August 13, 2025

      Your smart home device just got a performance and security boost for free

      August 18, 2025

      Ultrahuman brings advanced cycle and ovulation tracking to its smart ring

      August 18, 2025

      DistroWatch Weekly, Issue 1135

      August 17, 2025

      14 secret phone codes that unlock hidden features on your Android and iPhone

      August 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Air Quality Prediction System using Python ML

      August 17, 2025
      Recent

      Air Quality Prediction System using Python ML

      August 17, 2025

      AI’s Hidden Thirst: The Water Behind Tech

      August 16, 2025

      Minesweeper game in 100 lines of pure JavaScript – easy tutorial

      August 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      DistroWatch Weekly, Issue 1135

      August 17, 2025
      Recent

      DistroWatch Weekly, Issue 1135

      August 17, 2025

      Ubuntu’s New “Dangerous” Daily Builds – What Are They?

      August 17, 2025

      gofmt – formats Go programs

      August 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper Introduces a Machine Learning Framework to Estimate the Inference Budget for Self-Consistency and GenRMs (Generative Reward Models)

    This AI Paper Introduces a Machine Learning Framework to Estimate the Inference Budget for Self-Consistency and GenRMs (Generative Reward Models)

    April 10, 2025
    This AI Paper Introduces a Machine Learning Framework to Estimate the Inference Budget for Self-Consistency and GenRMs (Generative Reward Models)

    Large Language Models (LLMs) have demonstrated significant advancements in reasoning capabilities across diverse domains, including mathematics and science. However, improving these reasoning abilities at test time remains a challenge researchers are actively addressing. The primary focus lies in developing methods to scale test-time compute effectively while maximising reasoning performance. Current methodologies include generating multiple chains-of-thought (CoTs) solutions for problems and implementing voting or selection mechanisms to identify the best solutions. Although these approaches have shown promise, they often require considerable computational resources and may not consistently identify optimal solutions when incorrect reasoning pathways dominate. Finding efficient ways to enhance LLM reasoning while minimizing computational overhead represents a critical challenge for the field’s advancement.

    Previous research has explored various approaches to enhance LLM reasoning capabilities. Generative Reward Models (GenRM) have emerged as a promising technique, framing verification as a next-token prediction task. These models enable test-time scaling by generating multiple verification chains-of-thought and aggregating their verdicts to score solutions. Initial comparisons between GenRM with Best-of-N (BoN) selection and Self-Consistency (SC) showed that GenRM appeared more efficient, achieving comparable performance with fewer solution candidates. However, these evaluations were conducted with fixed numbers of solutions rather than fixed computational budgets. This methodology creates misleading conclusions in practical scenarios where inference compute is limited, as it fails to account for the substantial computational costs associated with generating multiple verifications for each candidate solution. The key limitation of existing approaches is their failure to consider the true computational efficiency when comparing verification-based methods with simpler majority voting techniques.

    The proposed method introduces a comprehensive framework for accurately estimating the inference computational budget required by Self-Consistency and GenRMs. This framework enables a fair, compute-matched analysis that compares these test-time scaling strategies under fixed computational constraints. The approach assumes a single Large Language Model serves dual functions as both the solution generator and generative verifier, with verification capabilities activated either through specialized prompting or task-specific fine-tuning. By establishing this unified framework, researchers can systematically analyze the performance trade-offs between generating more solution candidates for Self-Consistency versus allocating compute resources to verification processes in GenRMs. The comparative analysis focuses on measuring effectiveness based on the total number of solutions and verifications generated by the LLM, providing clear metrics for computational efficiency across different reasoning approaches.

    The methodology employs a compute-matched analysis framework with a detailed architectural design for comparing test-time scaling strategies. For an autoregressive LLM with P parameters performing 2P FLOPs per output token, the total inference compute is calculated using the formula C(S, V) = S(1+λV), where S represents the number of solutions, V the number of verifications, and λ the ratio of tokens per verification to tokens per solution. This framework enables systematic evaluation of both Self-Consistency and Generative Reward Models under equivalent computational constraints. The architecture includes scaling solutions for SC across S ∈ {2^0, 2^1, …, 2^N} and evaluating GenRM across combinations of solutions and verifications S, V ∈ {S × V}. Also, the research introduces inference scaling laws for GenRM through a six-step methodology that determines optimal allocation between solutions and verifications. This process involves computing success rates across increasing verification counts, plotting results against compute budgets, and fitting power laws to establish relationships between optimal solution counts (S_opt ∝ C^a) and verification counts (V_opt ∝ C^b).

    The results demonstrate a clear pattern when comparing the performance of Generative Reward Models against Self-Consistency across different computational budgets. SC exhibits superior performance in low-compute scenarios, making it the more efficient choice when computational resources are limited. Conversely, GenRM begins to outperform SC only after reaching approximately 8× the computational budget, requiring an additional 128× inference compute to achieve a modest performance improvement of 3.8% over SC. These findings prove robust across diverse experimental conditions, including various model families such as Llama and Qwen, different model sizes ranging from 7B to 70B parameters, specialized thinking models like QwQ-32B, and different reasoning tasks, including mathematics. The performance patterns remain consistent regardless of the specific LLM architecture employed, indicating the broad applicability of these comparative insights across the spectrum of language models and reasoning tasks.

    The study introduces GenRMs as an innovative approach to scaling test-time compute through verification processes. Previous research demonstrated that scaling both solutions and verifications could outperform SC, but often neglected to account for the computational costs of verification. This comprehensive investigation reveals a clear pattern: SC proves more effective at lower computational budgets, while GenRMs deliver superior performance when higher computational resources are available. These findings maintain consistency across multiple model families, including specialized thinking models, various parameter sizes from 7B to 70B, and diverse reasoning tasks. In addition, the research establishes robust inference scaling laws that optimize budget allocation between solution generation and verification processes within GenRM frameworks. These insights provide valuable practical guidance for researchers and practitioners seeking to implement compute-efficient scaling strategies to maximize reasoning performance in large language models.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post This AI Paper Introduces a Machine Learning Framework to Estimate the Inference Budget for Self-Consistency and GenRMs (Generative Reward Models) appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticlePixtral Large is now available in Amazon Bedrock
    Next Article Do LLMs Know Internally When They Follow Instructions?

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 17, 2025
    Machine Learning

    Introducing Amazon Bedrock AgentCore Identity: Securing agentic AI at scale

    August 15, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Best practices to handle AWS DMS tasks during PostgreSQL upgrades

    Databases

    CVE-2025-46538 – Webplanetsoft Inline Text Popup Cross-site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Update ASAP: Google Fixes Android Flaw (CVE-2025-27363) Exploited by Attackers

    Development

    The AI Fix #44: AI-generated malware, and a stunning AI breakthrough

    Development

    Highlights

    CVE-2025-7401 – WordPress Premium Age Verification/Restriction Remote File Read/Write Vulnerability

    July 11, 2025

    CVE ID : CVE-2025-7401

    Published : July 11, 2025, 5:15 a.m. | 4 hours, 22 minutes ago

    Description : The Premium Age Verification / Restriction for WordPress plugin for WordPress is vulnerable to arbitrary file read and write due to the existence of an insufficiently protected remote support functionality in remote_tunnel.php in all versions up to, and including, 3.0.2. This makes it possible for unauthenticated attackers to read from or write to arbitrary files on the affected site’s server which may make the exposure of sensitive information or remote code execution possible.

    Severity: 9.8 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-7614 – TOTOLINK T6 Command Injection Vulnerability

    July 14, 2025

    Windows 11 Gets New UI Customization: Reposition System Indicator Bar to Top-Center or Top-Left

    June 24, 2025

    The best anti-Prime Day deals 2025 from Best Buy, Walmart, & more: Top sales from Amazon’s competition

    June 18, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.