Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 17, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 17, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 17, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 17, 2025

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025

      Save $400 on the best Samsung TVs, laptops, tablets, and more when you sign up for Verizon 5G Home or Home Internet

      May 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025
      Recent

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025

      Big Changes at Meteor Software: Our Next Chapter

      May 17, 2025

      Apps in Generative AI – Transforming the Digital Experience

      May 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025
      Recent

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»This Machine Learning Research from Microsoft Introduces an Active Preference Elicitation Method for the Online Alignment of Large Language Models

    This Machine Learning Research from Microsoft Introduces an Active Preference Elicitation Method for the Online Alignment of Large Language Models

    June 4, 2024

    Large Language Models (LLMs) have significantly advanced in recent times, primarily because of their increased capacity to follow human commands efficiently. Reinforcement Learning from Human Feedback (RLHF) is the main technique for matching LLMs to human intent. This method operates by optimizing a reward function, which can be reparameterized within the LLM’s policy or be an independent model. 

    Data regarding human preferences for prompt-response pairs are used to derive this reward function. The variety of answers found in the preference data is a critical component of this alignment’s effectiveness. This diversity facilitates the development of more adaptable and powerful language models by preventing reward models from becoming trapped in local optima.

    Alignment can be done primarily online or offline. Offline alignment makes an effort to manually generate a variety of responses for predetermined prompts. However, this approach is not very successful in covering the wide range of natural language possibilities. In contrast, online alignment employs an iterative procedure in which new preference data for training the reward model is generated through feedback following the sampling of answers from the LLM. 

    Sampling is random in this approach, so out-of-distribution (OOD) regions can be explored. On the other hand, the LLM’s only goal in most online RLHF setups is to maximize the expected reward from the data that is gathered. Because of passive exploration, this frequently results in responses that cluster around local optima, which may cause overfitting and premature convergence, leaving high-reward regions unexplored.

    Preference optimization has shown great effectiveness in bringing Large Language Models (LLMs) into alignment with human goals, especially when applied with Reinforcement Learning from Human Feedback. Online feedback collection, from humans or AI, on model outputs typically leads to more capable reward models and better-aligned LLMs through an iterative process. This is in contrast to offline alignment, which depends on a fixed dataset. However, developing a globally accurate reward model necessitates methodical study to produce a range of responses across the vast field of natural language. This condition cannot be met by just utilizing random sampling from ordinary reward-maximizing LLMs.

    To address this issue, a bilevel objective that is optimistically biased towards potentially high-reward responses has been proposed. This method actively investigates regions that are outside of distribution (OOD). The resulting approach, called Self-Exploring Language Models (SELM), solves the inner-level problem with a reparameterized reward function, doing away with the requirement for a separate reward model and updating the LLM repeatedly with a simple objective. 

    The SELM aims to improve exploration efficiency and lessen the indiscriminate favoring of unseen extrapolations when compared to Direct Preference Optimisation (DPO). Based on experimental findings, SELM can greatly enhance performance on instruction-following benchmarks like MT-Bench and AlpacaEval 2.0 when modified on the Zephyr-7B-SFT and Llama-3-8B-Instruct models. SELM also performs well on a range of common academic standards in diverse contexts. 

    In conclusion, by guaranteeing that LLMs not only precisely obey instructions but also consider a broad range of possible replies, this approach marks a substantial advancement in matching LLMs with human intent and will eventually result in more capable and reliable language models.

    Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

    The post This Machine Learning Research from Microsoft Introduces an Active Preference Elicitation Method for the Online Alignment of Large Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCoSy (Concept Synthesis): A Novel Architecture-Agnostic Machine Learning Framework to Evaluate the Quality of Textual Explanations for Latent Neurons
    Next Article Sparse Maximal Update Parameterization (SμPar): Optimizing Sparse Neural Networks for Superior Training Dynamics and Efficiency

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 17, 2025
    Development

    Learn A1 Level Spanish

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    SatIntel is a OSINT tool for satellite reconnaissance

    Linux

    SlideGar: A Novel AI Approach to Use LLMs in Retrieval Reranking, Solving the Challenge of Bound Recall

    Machine Learning

    StarCalendar – full-featured international calendar

    Development

    How Yelp’s latest AI updates better connect restaurant owners and diners

    News & Updates
    Hostinger

    Highlights

    Development

    Rilasciata Nobara Linux 41: Nuove Funzionalità e Miglioramenti per Gaming e Creazione di Contenuti

    January 2, 2025

    Con questo lancio, l’anno nuovo inizia proprio bene per gli appassionati di tecnologia e per…

    CSS Tools, AI Tools

    February 13, 2025

    CVE-2025-4554 – PHPGurukul Apartment Visitors Management System SQL Injection Vulnerability

    May 12, 2025

    New Withings feature connects you with a cardiologist when it senses you need one

    January 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.