Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 5, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 5, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 5, 2025

      CodeSOD: Integral to a Database Read

      June 5, 2025

      Players aren’t buying Call of Duty’s “error” excuse for the ads Activision started forcing into the game’s menus recently

      June 4, 2025

      In Sam Altman’s world, the perfect AI would be “a very tiny model with superhuman reasoning capabilities” for any context

      June 4, 2025

      Sam Altman’s ouster from OpenAI was so dramatic that it’s apparently becoming a movie — Will we finally get the full story?

      June 4, 2025

      One of Microsoft’s biggest hardware partners joins its “bold strategy, Cotton” moment over upgrading to Windows 11, suggesting everyone just buys a Copilot+ PC

      June 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Enable Flexible Pattern Matching with Laravel’s Case-Insensitive Str::is Method

      June 5, 2025
      Recent

      Enable Flexible Pattern Matching with Laravel’s Case-Insensitive Str::is Method

      June 5, 2025

      Laravel OpenRouter

      June 5, 2025

      This Week in Laravel: Starter Kits, Alpine, PDFs and Roles/Permissions

      June 5, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      FOSS Weekly #25.23: Helwan Linux, Quarkdown, Konsole Tweaks, Keyboard Shortcuts and More Linux Stuff

      June 5, 2025
      Recent

      FOSS Weekly #25.23: Helwan Linux, Quarkdown, Konsole Tweaks, Keyboard Shortcuts and More Linux Stuff

      June 5, 2025

      Grow is a declarative website generator

      June 5, 2025

      Raspberry Pi 5 Desktop Mini PC: Benchmarking

      June 5, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy

    Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy

    January 28, 2025

    Large Language Models (LLMs) have become an indispensable part of contemporary life, shaping the future of nearly every conceivable domain. They are widely acknowledged for their impressive performance across tasks of varying complexity. However, instances have arisen where LLMs have been criticized for generating unexpected and unsafe responses. Consequently, ongoing research aims to align LLMs more closely with human preferences while fully leveraging their extensive training data.

    Methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have proven effective. However, they still require iterative training, which is often impractical. Researchers are therefore focusing on modifying inference approaches to match the performance of training-based optimization methods. This article explores the latest research that enhances human preference alignment during inference time.

    Researchers from Shanghai AI Laboratory have introduced Test-Time Preference Optimization (TPO), a novel framework designed to align LLM outputs with human preferences during inference. This framework can be conceptualized as an online, on-policy learning paradigm, where the policy model continuously interacts with a novel reward model to refine its outputs.

    TPO incorporates a mechanism to leverage interpretable textual feedback for preference optimization instead of conventional numerical scoring.   To achieve this, authors translate reward signals into textual rewards through critiques. The model then generates suggestions by the transformed rewards and updates its outputs to align with the signals during testing.

    During the actual test, the newly generated responses are scored at each inference-time optimization step, and the extreme ends of response quality are classified as “chosen” or “rejected” outputs. The model then learns the strength from the best or “chosen” outputs and the shortfalls of rejected responses to compile a  “textual loss.” The model then generates suggestions or “ textual gradients” for the next iteration. TPO thus improves the output iteratively based on interactions with text rewards.

    The authors used aligned and unaligned policy models to validate the concept and determine whether the model had undergone preference optimization during training. Two key models included in the study were Llama-3.1-70B-SFT, an unaligned model that did not undergo preference optimization during training, and Llama-3.1-70B-Instruct, an aligned model trained with preference optimization. Additionally, experiments spanned many datasets to evaluate instruction following, preference alignment, safety, and mathematical reasoning.

    Results from these experiments confirmed that a few TPO optimization steps significantly improved performance in both aligned and unaligned models. When comparing TPO-based inference optimization with traditional training optimization approaches, researchers found that the unaligned Llama-3.1-70B-SFT model outperformed its aligned counterpart Llama-3.1-70B-Instruct after undergoing TPO epochs. Furthermore, applying TPO to an aligned model with as few as 22 billion parameters achieved an LC score of 53.4% and a WR score of 72.2%

    Conclusion: The research team introduced TPO, an online, on-policy learning framework to align outputs from LLMs by human preference. This framework optimized the responses in inference time and eliminated the hassle of retraining and weight updates. Additionally, TPO offered high scalability and flexibility, making it a promising approach for future LLM works.


    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

    🚨 [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

    The post Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleLeveraging Hallucinations in Large Language Models to Enhance Drug Discovery
    Next Article Quantifying Knowledge Transfer: Evaluating Distillation in Large Language Models

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 5, 2025
    Machine Learning

    H Company Releases Runner H Public Beta Alongside Holo-1 and Tester H for Developers

    June 5, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    How to Use Python’s Built-in Profiling Tools: Examples and Best Practices

    Development

    Playing Age of Empires 2 on a Windows PC in the 1990s instilled my lifelong love of real-time strategy games

    News & Updates

    A guide to supply chain security tools

    Development

    CVE-2025-48432 – Apache Django Log Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-48260 – Ninja Team GDPR CCPA Compliance Support Missing Authorization Vulnerability

    May 19, 2025

    CVE ID : CVE-2025-48260

    Published : May 19, 2025, 3:15 p.m. | 1 hour, 13 minutes ago

    Description : Missing Authorization vulnerability in Ninja Team GDPR CCPA Compliance Support allows Exploiting Incorrectly Configured Access Control Security Levels. This issue affects GDPR CCPA Compliance Support: from n/a through 2.7.3.

    Severity: 4.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    The Bridge to Nowhere

    February 11, 2025

    Critical Flaw in Telerik Report Server Poses Remote Code Execution Risk

    July 26, 2024

    AWS announces several updates to Amazon Bedrock and Amazon Q during re:Invent

    December 2, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.