Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 3, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 3, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 3, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 3, 2025

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025

      These solid-state fans will revolutionize cooling in our PCs and laptops

      June 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025
      Recent

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025

      A Comprehensive Guide to Azure Firewall

      June 3, 2025

      Test Job Failures Precisely with Laravel’s assertFailedWith Method

      June 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025
      Recent

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy

    Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy

    January 28, 2025

    Large Language Models (LLMs) have become an indispensable part of contemporary life, shaping the future of nearly every conceivable domain. They are widely acknowledged for their impressive performance across tasks of varying complexity. However, instances have arisen where LLMs have been criticized for generating unexpected and unsafe responses. Consequently, ongoing research aims to align LLMs more closely with human preferences while fully leveraging their extensive training data.

    Methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have proven effective. However, they still require iterative training, which is often impractical. Researchers are therefore focusing on modifying inference approaches to match the performance of training-based optimization methods. This article explores the latest research that enhances human preference alignment during inference time.

    Researchers from Shanghai AI Laboratory have introduced Test-Time Preference Optimization (TPO), a novel framework designed to align LLM outputs with human preferences during inference. This framework can be conceptualized as an online, on-policy learning paradigm, where the policy model continuously interacts with a novel reward model to refine its outputs.

    TPO incorporates a mechanism to leverage interpretable textual feedback for preference optimization instead of conventional numerical scoring.   To achieve this, authors translate reward signals into textual rewards through critiques. The model then generates suggestions by the transformed rewards and updates its outputs to align with the signals during testing.

    During the actual test, the newly generated responses are scored at each inference-time optimization step, and the extreme ends of response quality are classified as “chosen” or “rejected” outputs. The model then learns the strength from the best or “chosen” outputs and the shortfalls of rejected responses to compile a  “textual loss.” The model then generates suggestions or “ textual gradients” for the next iteration. TPO thus improves the output iteratively based on interactions with text rewards.

    The authors used aligned and unaligned policy models to validate the concept and determine whether the model had undergone preference optimization during training. Two key models included in the study were Llama-3.1-70B-SFT, an unaligned model that did not undergo preference optimization during training, and Llama-3.1-70B-Instruct, an aligned model trained with preference optimization. Additionally, experiments spanned many datasets to evaluate instruction following, preference alignment, safety, and mathematical reasoning.

    Results from these experiments confirmed that a few TPO optimization steps significantly improved performance in both aligned and unaligned models. When comparing TPO-based inference optimization with traditional training optimization approaches, researchers found that the unaligned Llama-3.1-70B-SFT model outperformed its aligned counterpart Llama-3.1-70B-Instruct after undergoing TPO epochs. Furthermore, applying TPO to an aligned model with as few as 22 billion parameters achieved an LC score of 53.4% and a WR score of 72.2%

    Conclusion: The research team introduced TPO, an online, on-policy learning framework to align outputs from LLMs by human preference. This framework optimized the responses in inference time and eliminated the hassle of retraining and weight updates. Additionally, TPO offered high scalability and flexibility, making it a promising approach for future LLM works.


    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

    🚨 [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

    The post Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLeveraging Hallucinations in Large Language Models to Enhance Drug Discovery
    Next Article Quantifying Knowledge Transfer: Evaluating Distillation in Large Language Models

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 3, 2025
    Machine Learning

    This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning

    June 3, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Mozilla Firefox 138 introduces profile manager and more

    Operating Systems

    Vehicle Record System using PHP and MySQL

    Development

    Free Decryptor Released for BitLocker-Based ShrinkLocker Ransomware Victims

    Development

    Microsoft’s Xbox strategy has reader opinions split almost perfectly in half

    News & Updates

    Highlights

    Development

    TransFusion: An Artificial Intelligence AI Framework To Boost a Large Language Model’s Multilingual Instruction-Following Information Extraction Capability

    June 30, 2024

    Large Language Models (LLMs) have made significant advances in the field of Information Extraction (IE).…

    Hong Kong Fire Department Issues Data Breach Notification

    May 8, 2024

    Cheshire-Cat: A Python Framework to Build Custom AIs on Top of Any Language Models

    August 31, 2024

    Microsoft rated this bug as low exploitability. Miscreants weaponized it in just 8 days

    April 21, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.