Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 14, 2025

      This week in AI dev tools: Apple’s Foundations Model framework, Mistral’s first reasoning model, and more (June 13, 2025)

      June 13, 2025

      Open Talent platforms emerging to match skilled workers to needs, study finds

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025

      Here’s why network infrastructure is vital to maximizing your company’s AI adoption

      June 14, 2025

      The AI video tool behind the most viral social trends right now

      June 14, 2025

      Got a new password manager? How to clean up the password mess you left in the cloud

      June 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025
      Recent

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025

      Tunnel Run game in 170 lines of pure JS

      June 14, 2025

      Integrating Drupal with Salesforce SSO via SAML and Dynamic User Sync

      June 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025
      Recent

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025

      Is Chrome Copying Edge? ‘Omnibox Tools’ Bring Edge-Style Address Bar Shortcuts

      June 14, 2025

      Windows 11 24H2’s new Start Menu auto-changes size based on screen resolution

      June 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

    ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification

    April 29, 2025

    Reasoning with LLMs can benefit from utilizing more test compute, which depends on high-quality process reward models (PRMs) to select promising paths for search or ranking. PRMs score problem-solution pairs to indicate whether the solution is correct, and have been implemented as discriminative classifiers. However, these models require extensive resources, including human annotation, gold step-by-step solutions, or computationally intensive rollouts. LLM-as-a-judge approaches offer advantages in data efficiency and interpretability, but they perform poorly compared to specialized reward models for complex reasoning tasks, failing to recognize incorrect reasoning. This creates a challenge to maintain data-efficiency and interpretability advantages while achieving the superior performance of discriminative PRMs.

    Research approaches to solve process verification challenges have followed three main paths. Discriminative PRMs function as classifiers that predict numerical correctness scores for each reasoning step, requiring extensive step-level annotations. Generative PRMs frame verification as a language-generation task, producing correctness decisions as natural language tokens accompanied by verification chain-of-thought (CoT). These models compute correctness scores through conditional token probabilities like P(“correct”), making them inherently interpretable and scalable. Test-time scaling techniques like Best-of-N selection and tree-based search improve reasoning performance using additional inference-time compute. The effectiveness of these approaches depends heavily on verifier quality for scoring solutions.

    Researchers from the University of Michigan, Mila, LG AI Research, and the University of Illinois Urbana-Champaign have proposed THINKPRM, a long CoT verifier fine-tuned on significantly fewer process labels than those required by discriminative PRMs. It uses the inherent reasoning abilities of long CoT models to outperform both LLM-as-a-Judge and discriminative verifiers while using only 1% of process labels in PRM800K across several challenging benchmarks. Under equal token budgets, THINKPRM scales verification compute more effectively than LLM-as-a-Judge, outperforming it by 7.2% on a ProcessBench subset, highlighting the value of generative, long CoT PRMs for scaling test-time verification compute with minimal supervision.

    The THINKPRM is evaluated against DiscPRM, the same base model finetuned with binary cross-entropy on the entire PRM800K dataset containing 712K process labels from 98K problem-solution pairs. Additional comparisons include unweighted majority voting and verifier-weighted majority for best-of-N experiments. The results are shown on three math reasoning tasks: 100 problems from MATH-500 covering all difficulty levels, 2024 American Invitational Mathematics Examination (AIME) problems, and out-of-domain tasks including physics problems from GPQA-Diamond and a 200-problem subset from LiveCodeBench v5. For MATH-500, researchers used THINKPRM-1.5B and THINKPRM-14B with two different generator models.

    On best-of-N selection with MATH500, THINKPRM achieves higher or comparable reasoning accuracy to DiscPRM across all sampling budgets. Under verifier-guided search on MATH-500, THINKPRM-1.5B outperforms discPRM by approximately 5 percentage points and surpasses LLM-as-a-judge using the same base model (R1-Qwen-1.5B). THINKPRM-1.5B’s scaling curve exceeds all baselines when compared to strong off-the-shelf PRMs like RLHFFlow-Deepseek-PRM and MATH-Shepherd-PRM, outperforming RLHFFlow-Deepseek-PRM by over 7% at 16 beams. For out-of-domain evaluation, THINKPRM shows better scaling than DiscPRM on GPQA-physics, outperforming it by 8%, while on LiveCodeBench, THINKPRM surpasses DiscPRM by 4.5%.

    In conclusion, researchers introduced THINKPRM, a generative process reward model trained with minimal supervision on synthetic data, allowing efficient and scalable verification of step-by-step reasoning. Researchers show that lightweight fine-tuning of generative PRMs on as few as 8K process labels can improve upon zero-shot LLM-as-a-judge baselines. THINKPRM also surpasses discriminative PRMs trained with orders of magnitude more process labels, highlighting the advantages of utilizing generative language-modeling objectives for interpretability, scalability, and data efficiency. The results underscore the potential of generative PRMs to scale verification compute at test-time effectively, benefiting challenging domains such as mathematical and scientific reasoning.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post ThinkPRM: A Generative Process Reward Models for Scalable Reasoning Verification appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs
    Next Article WhatsApp Launches Private Processing to Enable AI Features While Protecting Message Privacy

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 14, 2025
    Machine Learning

    OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models

    June 14, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    La Danimarca saluta Microsoft: al Ministero per la digitalizzazione arriva LibreOffice e GNU/Linux

    Linux

    Redox OS: Ultime Novità di Aprile 2025

    Linux

    How to Register Models in Django Admin

    Development

    CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-4221 – WordPress Animated Buttons Stored Cross-Site Scripting Vulnerability

    May 21, 2025

    CVE ID : CVE-2025-4221

    Published : May 21, 2025, 12:16 p.m. | 2 hours, 34 minutes ago

    Description : The Animated Buttons plugin for WordPress is vulnerable to Stored Cross-Site Scripting via the plugin’s ‘auto-downloader’ shortcode in all versions up to, and including, 1.0.0 due to insufficient input sanitization and output escaping on user supplied attributes. This makes it possible for authenticated attackers, with contributor-level access and above, to inject arbitrary web scripts in pages that will execute whenever a user accesses an injected page.

    Severity: 6.4 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    This hidden Chrome feature is my secret productivity trick – here’s my favorite way to use it

    June 12, 2025

    Antidote is a Zsh implementation of the legacy Antibody plugin manager

    June 5, 2025

    CVE-2025-46782 – Apache HTTP Server Unvalidated Request Parameter

    April 30, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.