Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 17, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 17, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 17, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 17, 2025

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025

      Save $400 on the best Samsung TVs, laptops, tablets, and more when you sign up for Verizon 5G Home or Home Internet

      May 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025
      Recent

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025

      Big Changes at Meteor Software: Our Next Chapter

      May 17, 2025

      Apps in Generative AI – Transforming the Digital Experience

      May 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025
      Recent

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Self-Play Preference Optimization (SPPO): An Innovative Machine Learning Approach to Finetuning Large Language Models (LLMs) from Human/AI Feedback

    Self-Play Preference Optimization (SPPO): An Innovative Machine Learning Approach to Finetuning Large Language Models (LLMs) from Human/AI Feedback

    May 7, 2024

    Large Language Models (LLMs) have demonstrated remarkable abilities in generating human-like text, answering questions, and coding. However, they face hurdles requiring high reliability, safety, and ethical adherence. Reinforcement Learning from Human Feedback (RLHF), or Preference-based Reinforcement Learning (PbRL), emerges as a promising solution. This framework has shown significant success in fine-tuning LLMs to align with human preferences, enhancing their usefulness. 

    Existing RLHF approaches, like InstructGPT, rely on explicit or implicit reward models, e.g., the Bradley-Terry model. Recent research explores direct preference probabilities to better represent human preferences. Some researchers formulate RLHF as finding Nash equilibriums in constant-sum games, proposing mirror descent and Self-play Preference Optimization (SPO) methods. Direct Nash Optimization (DNO) was also introduced based on win rate gaps, yet its practical implementation still relies on iterative DPO frameworks.

    Researchers from the University of California, Los Angeles and Carnegie Mellon University introduce a robust self-play framework, Self-Play Preference Optimization (SPPO),  for language model alignment addressing RLHF challenges. It offers provable guarantees for solving two-player constant-sum games and scalability for large language models. In formulating RLHF as such a game, the objective is to identify the Nash equilibrium policy, ensuring consistently preferred responses. They propose an adaptive algorithm based on multiplicative weights, employing a self-play mechanism where the policy fine-tunes itself on synthetic data annotated by the preference model.  

    The self-play framework aims to solve two-player constant-sum games efficiently and at scale for large language models. It adopts an iterative framework based on multiplicative weight updates and a self-play mechanism. The algorithm asymptotically converges to the optimal policy, identifying the Nash equilibrium. Theoretical analysis ensures convergence, providing provable guarantees. Compared to existing methods like DPO and IPO, SPPO demonstrates improved convergence and addresses data sparsity issues efficiently.

    The researchers evaluate models using GPT-4 for automatic evaluation, presenting results on AlpacaEval 2.0 and MT-Bench. SPPO models consistently improve across iterations, with SPPO Iter3 showing the highest win rate. Compared to DPO and IPO, SPPO achieves superior performance and effectively controls output length. Test-time reranking with the PairRM reward model consistently improves model performance without over-optimization. SPPO outperforms many state-of-the-art chatbots on AlpacaEval 2.0 and remains competitive with GPT-4 on MT-Bench. 

    To conclude, the paper introduces Self-Play Preference Optimization (SPPO), a robust method for fine-tuning LLMs using Human/AI Feedback. By employing self-play in a two-player game and a preference-based learning objective, SPPO significantly improves over existing methods like DPO and IPO across various benchmarks. Integrating a preference model and batched estimation, SPPO aligns LLMs closely with human preferences, addressing issues like “length bias” reward hacking. These findings suggest SPPO’s potential for enhancing generative AI system alignment, advocating for its broader adoption in LLMs and beyond.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 41k+ ML SubReddit

    The post Self-Play Preference Optimization (SPPO): An Innovative Machine Learning Approach to Finetuning Large Language Models (LLMs) from Human/AI Feedback appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleDLAP: A Deep Learning Augmented LLMs Prompting Framework for Software Vulnerability Detection
    Next Article Mastering UX Design: Principles and Practice

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 17, 2025
    Development

    Learn A1 Level Spanish

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-4450 – D-Link DIR-619L Remote Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-31637 – LambertGroup SHOUT SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Distribution Release: elementary OS 8.0.1

    News & Updates

    CVE-2025-32472 – HPE MultiScan and picoScan Slowloris Denial-of-Service Vulnerability

    Common Vulnerabilities and Exposures (CVEs)
    GetResponse

    Highlights

    CVE-2025-3801 – Songquanpeng One-Api Cross Site Scripting Vulnerability

    April 20, 2025

    CVE ID : CVE-2025-3801

    Published : April 19, 2025, 2:15 p.m. | 14 hours, 38 minutes ago

    Description : A vulnerability was found in songquanpeng one-api up to 0.6.10. It has been classified as problematic. This affects an unknown part of the component System Setting Handler. The manipulation of the argument Homepage Content leads to cross site scripting. It is possible to initiate the attack remotely. The exploit has been disclosed to the public and may be used.

    Severity: 2.4 | LOW

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    How does Cloud Application Modernization Drive Business Growth?

    April 21, 2024

    North Korea’s ScarCruft Deploys KoSpy Malware, Spying on Android Users via Fake Utility Apps

    March 16, 2025

    If you bought an RTX 5090 or RTX 5080 before stock ran out, you need to grab this NVIDIA driver

    February 3, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.