Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives

    This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives

    July 28, 2024

    Reinforcement Learning (RL) finetuning is an important step in training language models (LMs) to behave in specific ways and follow human etiquette. In today’s applications, RL finetuning involves multiple goals due to various human preferences and uses. The multi-objective finetuning (MOFT) is needed to train a multi-objective LM to overcome the limitations of single-objective finetuning (SOFT). For LMs, MOFT has been explored through prompt-based and parameter-based methods. Prompt-based methods finetune an LM by including the reward weightings in the prompt. However, this approach can be less effective in guiding the model and sensitive to how the weightings are presented. Further, zero-shot MOFT might perform badly on intermediate weightings, which are not encountered during training.

    The two main techniques to approach multi-reward alignment (or MOFT) are prompt-based, and parameter-based conditioning. Prompt-based conditioning contains approaches like Personalized Soups (PS), which use custom prompts to personalize language models (LMs) based on binary weights for different rewards. Rewarded Soups (RS) offers a zero-shot method by averaging the parameters of LMs trained independently at inference time. A recent paper introduces embedding reward weightings as singular values within the AdaLoRA framework. For KL realignment, decoding time realignment linearly mixes logits between 𝜋ref and another LM learned through SOFT with the minimum KL weight.

    A team from Google has proposed a general MOFT framework called Conditional Language Policy (CLP), that uses parameter-space conditioning and multi-task training. This method is more steerable than purely prompt-based techniques because it uses parameter-conditioning from RS. Moreover, CLP produces higher-quality responses than zero-shot methods like RS by finetuning on different reward weightings, while having the same or better steerability. The team conducted a series of experiments and found that CLP outperforms Pareto-dominates RS and is more controllable than prompt-based MOFT. It consistently maintains these advantages in various conditions, including different reward choices and model sizes.

    The proposed method CLP, learns a set of parameters that can be processed into a conditioned language model (LM) for any given weighting across rewards and KL, using a parameter-averaging method. The learning algorithm samples a range of weightings to improve its Pareto-front for all weightings at once. This approach includes multi-task learning across different weightings, maximizing the MOFT objective. An automated evaluation with Gemini 1.0 Ultra shows that CLP is more adaptable and generates better responses than existing baselines. The team proposed a new theory showing that zero-shot methods can be nearly Pareto-optimal when optimal policies are aligned for individual reward.

    The benchmarking results were obtained for the following setups: Single Reward, Multi KL Regularizer, Two Rewards, Fixed KL Regularizer, and Three Rewards, Fixed KL Regularizer. In the Single Reward, CLP is 2 times more computationally efficient than DeRa during inference because DeRa makes two LM calls per token. The multi-task training helps this method enhance over the zero-shot RS baseline regarding performance. Also, the full-CLP and attn-CLP maintain a more spread-out and steerable Pareto-front compared to logit-CLP and the prompting baseline. In sum, attn-CLP offers a good balance between Pareto-front and steerability while using fewer parameters than current baselines.

    In this paper, a team from Google introduced Conditional Language Policy (CLP), a flexible framework for MOFT that uses multi-task training and efficient parameter finetuning to create adaptable language models (LMs) that can balance different individual rewards efficiently. The paper includes extensive benchmarking and ablation studies to understand the factors that help develop steerable LMs within the CLP framework. The team also proposed theoretical results to show the working of zero-shot approaches and the need for multi-task training for near-optimal behavior. Future research includes other conditioning mechanisms like soft tokens, automating the tuning of weight sampling distributions, and addressing non-linear reward scalarization.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 47k+ ML SubReddit

    Find Upcoming AI Webinars here

    The post This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA Comparison of Top Embedding Libraries for Generative AI
    Next Article LoRA-Pro: A Groundbreaking Machine Learning Approach to Bridging the Performance Gap Between Low-Rank Adaptation and Full Fine-Tuning

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48187 – RAGFlow Authentication Bypass

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    The best iPad cases for kids of 2025: Protect your kids’ tablets

    News & Updates

    What is a Dark Pattern and How to Avoid Them

    Web Development

    Minecraft friendly ghasts and Vibrant Visuals are rolling out to players

    News & Updates

    CodeSOD: One Month

    Tech & Work
    Hostinger

    Highlights

    Selenium OnClick identify button C#

    August 9, 2024

    How to identify this button in C#:

    <button class=”reward_link_redeem_button_style” onclick=”RedeemRPProduct(‘free_points_1’)”>REDEEM</button>
    <button class=”reward_link_redeem_button_style ” onclick=”RedeemRPProduct(‘free_points_50’)”>REDEEM</button>

    But there are like 4 buttons called equal, and to differentiate them you need the (‘free_points_1’) as well as (‘free_points_10’), etc.

    CodeSOD: Under the Sheets

    August 14, 2024
    Microsoft Patches 126 Flaws Including Actively Exploited Windows CLFS Vulnerability

    Microsoft Patches 126 Flaws Including Actively Exploited Windows CLFS Vulnerability

    April 9, 2025

    Expanding Your Global Reach with Localization and Internationalization

    June 30, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.