Google DeepMind Introduces WARP: A Novel Reinforcement Learning from Human Feedback RLHF Method to Align LLMs and Optimize the KL-Reward Pareto Front of Solutions

Reinforcement learning from human feedback (RLHF) encourages generations to have high rewards, using a reward model trained on human preferences to align large language models (LLMs). However, RLHF has several unresolved issues. First, the fine-tuning process is often limited to small datasets, causing the model to become too specialized and miss the wide range of knowledge it learned during pre-training. This can lower the LLMâ€™s reasoning abilities and performance on NLP benchmarks. Second, trying to maximize an imperfect reward model (RM) can lead to problems, as the LLM might find ways to exploit flaws in the RM. Lastly, RLHF can reduce the variety of outputs, causing the model to collapse to produce similar responses.

This paper discusses two related topics. The first topic is how to merge models. Recently, the idea of merging deep models in the weight space, rather than in the prediction space as traditionally done in ensembling, has gained great attention. This method is called weight averaging (WA), and the most common form of WA is LERP. This form was initially used to average checkpoints from a single run, uniformly or with an exponential moving average (EMA). The second topic is the benefits of model merging, where WA improves generalization by reducing variance, memorization, and flattening the loss landscape. Moreover, merging weights combines their strengths, which is useful in multi-task setups.

A team from Google DeepMind has proposed Weight Averaged Rewarded Policies (WARP), a method to align LLMs and optimize the Kullback-Leibler(KL)-reward Pareto front of solutions. WARP uses three types of WA at three stages of the alignment process for distinct reasons.Â First, it uses the exponential moving average of the policy in the KL regularization as a flexible reference point. Second, it merges fine-tuned policies into an improved policy through spherical interpolation. Third, it linearly interpolates between the merged model and the initialization, to get back features from pre-training. This process is repeated, where each final model serves as a starting point for the next iteration, and enhances the KL-reward Pareto front, obtaining better rewards at fixed KL.Â

In the experiment carried out by the team, Gemma â€œ7Bâ€ LLM is considered and fine-tuned with RLHF into a better conversational agent. Moreover, the REINFORCE policy gradient is also utilized to optimize the KL-regularized reward. After that, on-policy samples are generated using the dataset which includes conversation prompts, with a temperature of 0.9, batch size of 128, Adam optimizer with learning rate 10âˆ’6, warmup of 100 steps, and SLERP is applied to the 28 layers separately. Itâ€™s important to note that this experiment relies on the high-capacity reward model, the largest available, which prevents the use of an oracle control RM.

Side-by-side comparisons were made for the trained policies against Mistral and Mixtral LLMs. Each policy generated a candidate answer from a set of prompts as described in the Gemma tech report. Similar to Gemini 1.5, side-by-side preference rates were calculated with â€œmuch betterâ€, â€œbetterâ€ and â€œslightly betterâ€ receiving scores of Â±1.5, Â±1, and Â±0.5 respectively, and ties receiving a score of 0. A positive score means better policies. The results validate that WARP is efficient, as the proposed policies were preferred over the Mistral variants and outperformed the previous Gemma â€œ7Bâ€ releases.

In conclusion, a team from Google DeepMind has introduced (WARP), a novel RLHF method to align LLMs and optimize the KL-reward Pareto front of solutions. It uses three distinct stages of model merging, (a) exponential moving average as a dynamic anchor during RL, (b) spherical interpolation to combine multiple policies rewarded independently, and (c) interpolation towards the shared initialization. This iterative application of WARP improves the KL-reward Pareto front, aligning the LLMs while protecting the knowledge from pre-training, and compares favorably against state-of-the-art baselines. In the future, WARP could help create safe and powerful AI systems by improving alignment and encouraging further study of model merging techniques.Â

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generallyÂ available! [Advertisement]

The post Google DeepMind Introduces WARP: A Novel Reinforcement Learning from Human Feedback RLHF Method to Align LLMs and Optimize the KL-Reward Pareto Front of Solutions appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

Google DeepMind Introduces WARP: A Novel Reinforcement Learning from Human Feedback RLHF Method to Align LLMs and Optimize the KL-Reward Pareto Front of Solutions

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Asserting Exceptions in Laravel Tests

Announcing Inertia 2.0: Redefining Frontend Development for Laravel

ProgressGym: A Machine Learning Framework for Dynamic Ethical Alignment in Frontier AI Systems

Latest NVIDIA drivers add support for the RTX 5070 Ti you can’t buy while bringing DLSS 4 to Indiana Jones and Marvel Rivals

How does Data Engineering in Retail Maximize Efficiency?

RockYou2024: Massive 10-Billion Password Leak Raises Credential Stuffing Concerns

Microsoft AI Releases OmniParser V2: An AI Tool that Turns Any LLM into a Computer Use Agent

India Confirms State-Owned Telecom Giant BSNLâ€™s Data Breach, Millions of User Records Compromised

Google DeepMind Introduces WARP: A Novel Reinforcement Learning from Human Feedback RLHF Method to Align LLMs and Optimize the KL-Reward Pareto Front of Solutions

Related Posts