Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Top 15 Enterprise Use Cases That Justify Hiring Node.js Developers in 2025

      July 31, 2025

      The Core Model: Start FROM The Answer, Not WITH The Solution

      July 31, 2025

      AI-Generated Code Poses Major Security Risks in Nearly Half of All Development Tasks, Veracode Research Reveals   

      July 31, 2025

      Understanding the code modernization conundrum

      July 31, 2025

      Onboarding your AI peer programmer: Setting up GitHub Copilot coding agent for success

      July 31, 2025

      Quality Over Speed: A Case for Perfectionism

      July 31, 2025

      UK Quantum computing is going universal through scaling

      July 31, 2025

      CodeSOD: What a CAD

      July 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The details of TC39’s last meeting

      July 31, 2025
      Recent

      The details of TC39’s last meeting

      July 31, 2025

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025

      Create Apple Wallet Passes in Laravel

      July 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Ubuntu 25.10 Snapshot 3 is Available to Download

      July 31, 2025
      Recent

      Ubuntu 25.10 Snapshot 3 is Available to Download

      July 31, 2025

      Proton’s New 2FA Authenticator App Supports Ubuntu

      July 31, 2025

      TUXEDO Computers Presenta l’Ultrabook InfinityBook Pro 15 Gen10

      July 31, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

    ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

    April 29, 2025

    Reasoning with LLMs can benefit from utilizing more test compute, which depends on high-quality process reward models (PRMs) to select promising paths for search or ranking. PRMs score problem-solution pairs to indicate whether the solution is correct, and have been implemented as discriminative classifiers. However, these models require extensive resources, including human annotation, gold step-by-step solutions, or computationally intensive rollouts. LLM-as-a-judge approaches offer advantages in data efficiency and interpretability, but they perform poorly compared to specialized reward models for complex reasoning tasks, failing to recognize incorrect reasoning. This creates a challenge to maintain data-efficiency and interpretability advantages while achieving the superior performance of discriminative PRMs.

    Research approaches to solve process verification challenges have followed three main paths. Discriminative PRMs function as classifiers that predict numerical correctness scores for each reasoning step, requiring extensive step-level annotations. Generative PRMs frame verification as a language-generation task, producing correctness decisions as natural language tokens accompanied by verification chain-of-thought (CoT). These models compute correctness scores through conditional token probabilities like P(“correct”), making them inherently interpretable and scalable. Test-time scaling techniques like Best-of-N selection and tree-based search improve reasoning performance using additional inference-time compute. The effectiveness of these approaches depends heavily on verifier quality for scoring solutions.

    Researchers from the University of Michigan, Mila, LG AI Research, and the University of Illinois Urbana-Champaign have proposed THINKPRM, a long CoT verifier fine-tuned on significantly fewer process labels than those required by discriminative PRMs. It uses the inherent reasoning abilities of long CoT models to outperform both LLM-as-a-Judge and discriminative verifiers while using only 1% of process labels in PRM800K across several challenging benchmarks. Under equal token budgets, THINKPRM scales verification compute more effectively than LLM-as-a-Judge, outperforming it by 7.2% on a ProcessBench subset, highlighting the value of generative, long CoT PRMs for scaling test-time verification compute with minimal supervision.

    The THINKPRM is evaluated against DiscPRM, the same base model finetuned with binary cross-entropy on the entire PRM800K dataset containing 712K process labels from 98K problem-solution pairs. Additional comparisons include unweighted majority voting and verifier-weighted majority for best-of-N experiments. The results are shown on three math reasoning tasks: 100 problems from MATH-500 covering all difficulty levels, 2024 American Invitational Mathematics Examination (AIME) problems, and out-of-domain tasks including physics problems from GPQA-Diamond and a 200-problem subset from LiveCodeBench v5. For MATH-500, researchers used THINKPRM-1.5B and THINKPRM-14B with two different generator models.

    On best-of-N selection with MATH500, THINKPRM achieves higher or comparable reasoning accuracy to DiscPRM across all sampling budgets. Under verifier-guided search on MATH-500, THINKPRM-1.5B outperforms discPRM by approximately 5 percentage points and surpasses LLM-as-a-judge using the same base model (R1-Qwen-1.5B). THINKPRM-1.5B’s scaling curve exceeds all baselines when compared to strong off-the-shelf PRMs like RLHFFlow-Deepseek-PRM and MATH-Shepherd-PRM, outperforming RLHFFlow-Deepseek-PRM by over 7% at 16 beams. For out-of-domain evaluation, THINKPRM shows better scaling than DiscPRM on GPQA-physics, outperforming it by 8%, while on LiveCodeBench, THINKPRM surpasses DiscPRM by 4.5%.

    In conclusion, researchers introduced THINKPRM, a generative process reward model trained with minimal supervision on synthetic data, allowing efficient and scalable verification of step-by-step reasoning. Researchers show that lightweight fine-tuning of generative PRMs on as few as 8K process labels can improve upon zero-shot LLM-as-a-judge baselines. THINKPRM also surpasses discriminative PRMs trained with orders of magnitude more process labels, highlighting the advantages of utilizing generative language-modeling objectives for interpretability, scalability, and data efficiency. The results underscore the potential of generative PRMs to scale verification compute at test-time effectively, benefiting challenging domains such as mathematical and scientific reasoning.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs
    Next Article WhatsApp Launches Private Processing to Enable AI Features While Protecting Message Privacy

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 29, 2025
    Machine Learning

    Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

    July 29, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Linus Torvalds critica duramente i file system che non distinguono tra maiuscole e minuscole!

    Linux

    How to Audit Android Accessibility with the Accessibility Scanner App

    Development

    CVE-2025-4339 – WordPress TheGem Theme Unauthenticated Theme Option Update Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-22886 – Apache OpenHarmony Memory Leak Denial of Service

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-49599 – Huawei EG8141A5 EG8145V5 EG8145V5-V2 Firewall Bypass Vulnerability

    June 6, 2025

    CVE ID : CVE-2025-49599

    Published : June 6, 2025, 5:15 p.m. | 2 hours, 33 minutes ago

    Description : Huawei EG8141A5 devices through V5R019C00S100, EG8145V5 devices through V5R019C00S100, and EG8145V5-V2 devices through V5R021C00S184 allow the Epuser account to disable ONT firewall functionality, e.g., to remove the default blocking of the SSH and TELNET TCP ports, aka HWNO-56Q3.

    Severity: 4.1 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Placemark is a web-based tool for geospatial data

    May 9, 2025

    Pinot is a real-time analytics platform

    May 3, 2025

    Microsoft 365 Web Apps Get Simple Edit Access Request Option

    July 5, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.