AI models can cheat, lie, and game the system for rewards

Anthropic researchers found that misspecified training goals and tolerance of sycophancy can lead AI models to game the system to increase rewards.

Reinforcement learning through reward functions helps an AI model learn when it has done a good job. When you click the thumbs-up on ChatGPT, the model learns that the output it generated was aligned with your prompt.

The researchers found that when a model is presented with poorly defined objectives, it can engage in â€œspecification gamingâ€ to cheat the system in pursuit of the reward.

Specification gaming could be as simple as sycophancy, where the model agrees with you even when it knows youâ€™re wrong.

When an AI model chases poorly thought-out reward functions, it can lead to unexpected behavior.

In 2016, OpenAI found that an AI playing a boat-racing game called CoastRunners, learned it could earn more points by moving in a tight circle to hit targets rather than completing the course like a human would.

The Anthropic researchers found that when models learned low-level specification gaming, the models could eventually generalize to more serious reward tampering.

Their paper describes how they set up a â€œcurriculumâ€ of training environments where an LLM was given the opportunity to cheat the system starting with relatively benign scenarios like sycophancy.

For example, early on in the curriculum, the LLM could respond positively to a userâ€™s political views, even if they were inaccurate or inappropriate, to obtain the training reward.

In the next stage, the model learned that it could alter a checklist to cover up that it hadnâ€™t completed a task.

After progressing through increasingly more difficult training environments, the model eventually learned a generalized ability to lie and cheat to obtain the reward.

The experiment culminated in a disturbing scenario where the model edited the training code defining its reward function so it would always obtain the maximum reward, regardless of its output, even though it had never been trained to do that.

It also edited the code that checked if the reward function had been altered.

An AI model starts with insincere flattery but eventually learns to tamper with its reward function while hiding its actions from the operator during training. Source: Anthropic

Some models even edited their event logs to hide their deception from the human operator. The modelâ€™s internal monologue chain of thought process was captured in a scratchpad.

Reward tampering only happened 45 times out of 32,768 trials, and the model only covered its tracks seven times, but thatâ€™s still significant.

The AI models demonstrated an emergent ability to edit their reward function and to hide that they had done so. The real-world implications of this could be disastrous.

Can it be fixed?

Could this behavior be unlearned? The researchers found that even when they used Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI approaches, these â€œdid not significantly change the observed rates of reward tampering.â€

When the model was trained to remove its tendency to be sycophantic, it substantially reduced the rate of reward tampering, but not to zero.

This behavior was elicited in a test environment, and Anthropic said, â€œCurrent frontier models almost certainly do not pose a risk of reward tampering.â€

â€œAlmost certainlyâ€ isnâ€™t the most comforting odds and the possibility of this emergent behavior developing outside the lab is cause for concern.

Anthropic said, â€œThe risk of serious misalignment emerging from benign misbehavior will increase as models grow more capable and training pipelines become more complex.â€

The post AI models can cheat, lie, and game the system for rewards appeared first on DailyAI.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

AI models can cheat, lie, and game the system for rewards

Can it be fixed?

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Stylish Range Sliders with Pure CSS and Animation

How to add delay between opening websocket connection and sending requests (not between requests) in Jmeter

AI Performance Metrics: Insights from Experts

Sorting Lists with Vue.js Composition API Computed Properties

Copilot+ Recall is â€˜Dumbest Cybersecurity Move in a Decadeâ€™: Researcher

This pocket camera has fully replaced my iPhone for video shooting – and it’s a must for traveling

Optimize your database storage for Oracle workloads on AWS, Part 1: Using ADO and ILM data compression policies

â€˜Big-game huntingâ€™ â€“ Ransomware gangs are focusing on more lucrative attacks

AI models can cheat, lie, and game the system for rewards

Can it be fixed?

Related Posts