Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 13, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 13, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 13, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 13, 2025

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025

      How to install and use Ollama to run AI LLMs on your Windows 11 PC

      May 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025
      Recent

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025

      How We Use Epic Branches. Without Breaking Our Flow.

      May 13, 2025

      I think the ergonomics of generators is growing on me.

      May 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025
      Recent

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»AI models can cheat, lie, and game the system for rewards

    AI models can cheat, lie, and game the system for rewards

    June 19, 2024

    Anthropic researchers found that misspecified training goals and tolerance of sycophancy can lead AI models to game the system to increase rewards.

    Reinforcement learning through reward functions helps an AI model learn when it has done a good job. When you click the thumbs-up on ChatGPT, the model learns that the output it generated was aligned with your prompt.

    The researchers found that when a model is presented with poorly defined objectives, it can engage in “specification gaming” to cheat the system in pursuit of the reward.

    Specification gaming could be as simple as sycophancy, where the model agrees with you even when it knows you’re wrong.

    When an AI model chases poorly thought-out reward functions, it can lead to unexpected behavior.

    In 2016, OpenAI found that an AI playing a boat-racing game called CoastRunners, learned it could earn more points by moving in a tight circle to hit targets rather than completing the course like a human would.

    The Anthropic researchers found that when models learned low-level specification gaming, the models could eventually generalize to more serious reward tampering.

    Their paper describes how they set up a “curriculum” of training environments where an LLM was given the opportunity to cheat the system starting with relatively benign scenarios like sycophancy.

    For example, early on in the curriculum, the LLM could respond positively to a user’s political views, even if they were inaccurate or inappropriate, to obtain the training reward.

    In the next stage, the model learned that it could alter a checklist to cover up that it hadn’t completed a task.

    After progressing through increasingly more difficult training environments, the model eventually learned a generalized ability to lie and cheat to obtain the reward.

    The experiment culminated in a disturbing scenario where the model edited the training code defining its reward function so it would always obtain the maximum reward, regardless of its output, even though it had never been trained to do that.

    It also edited the code that checked if the reward function had been altered.

    An AI model starts with insincere flattery but eventually learns to tamper with its reward function while hiding its actions from the operator during training. Source: Anthropic

    Some models even edited their event logs to hide their deception from the human operator. The model’s internal monologue chain of thought process was captured in a scratchpad.

    Reward tampering only happened 45 times out of 32,768 trials, and the model only covered its tracks seven times, but that’s still significant.

    The AI models demonstrated an emergent ability to edit their reward function and to hide that they had done so. The real-world implications of this could be disastrous.

    Can it be fixed?

    Could this behavior be unlearned? The researchers found that even when they used Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI approaches, these “did not significantly change the observed rates of reward tampering.”

    When the model was trained to remove its tendency to be sycophantic, it substantially reduced the rate of reward tampering, but not to zero.

    This behavior was elicited in a test environment, and Anthropic said, “Current frontier models almost certainly do not pose a risk of reward tampering.”

    “Almost certainly” isn’t the most comforting odds and the possibility of this emergent behavior developing outside the lab is cause for concern.

    Anthropic said, “The risk of serious misalignment emerging from benign misbehavior will increase as models grow more capable and training pipelines become more complex.”

    The post AI models can cheat, lie, and game the system for rewards appeared first on DailyAI.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCopilot+ certified hardware Galaxy Book4 Edge has finally arrived in US, UK, & more for $1,349
    Next Article Vendor reconciliation process in accounts payable

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 13, 2025
    Development

    How to Become an Analytical Programmer – Solve the “Rock, Paper, Scissors” Game 5 Ways Using JavaScript & Mermaid.js

    May 13, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CodeSOD: The Wrong Kind of Character

    News & Updates

    CVE-2025-4023 – iSourcecode Placement Management System SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Extio_rtl2832.dll: What Is It & How to Remove

    Operating Systems

    Microsoft AI Tour is coming to Sydney on December 11 after stops in London & other European cities

    Development
    GetResponse

    Highlights

    How cyber-secure is your business? | Unlocked 403 cybersecurity podcast (ep. 8)

    December 20, 2024

    As cybersecurity is a make-or-break proposition for businesses of all sizes, can your organization’s security…

    Unlock Creative Potential at Frontrow 2025

    February 4, 2025

    Ticketmaster Data Breach Confirmed; Stolen Data Hosted on Snowflake’s Cloud Storage

    June 3, 2024

    Samsung MagicINFO 9-servers doelwit van botnet, update niet beschikbaar

    May 8, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.