LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

Language models trained on vast internet-scale datasets have become prominent language understanding and generation tools. Their potential extends beyond language tasks to functioning as decision-making agents in interactive environments. When applied to environments requiring action choices, these models are expected to leverage their internal knowledge and reasoning to act effectively. Their ability to consider context, weigh options, and choose actions opens new possibilities for their integration into agentic systems that interact with dynamic environments.

Despite this promise, these models exhibit critical limitations in decision-making. While capable of forming accurate chains of reasoning, they often fail to act upon them. This issue is identified as the knowing-doing gap, where models recognize correct strategies but do not implement them in practice. Another significant concern is greediness, where models repeatedly select high-reward options prematurely, ignoring alternative strategies that could lead to better outcomes. Moreover, smaller models display frequency bias, favoring commonly seen actions regardless of reward, impairs exploration, and hinder learning from diverse scenarios.

To address these challenges, researchers have experimented with various strategies. Traditional reinforcement learning methods, including bandit algorithms like the Upper-Confidence Bound (UCB), aim to manage exploration-exploitation trade-offs. In contrast, in-context learning and behavior cloning imitate expert trajectories but often reinforce the same decision biases. While some exploration strategies have improved performance marginally, these approaches lack a mechanism to convert internal reasoning into optimal action reliably, especially in complex or stochastic environments.

Researchers from Google DeepMind and the LIT AI Lab at JKU Linz focused on refining language model behavior through Reinforcement Learning Fine-Tuning (RLFT). Their approach employs self-generated Chain-of-Thought (CoT) rationales as training signals. By evaluating the rewards of actions following specific reasoning steps, the model learns to favor decisions that sound logical and yield high returns in practice. This reinforcement links model reasoning to environmental feedback, promoting improved decision alignment and reducing gaps between thought and behavior.

The methodology centers on token-based fine-tuning using environment interactions. At each step, the model receives an input instruction and a recent action-reward history, and it generates a sequence containing the rationale and the selected action. These outputs are evaluated based on environmental rewards and whether the action conforms to the desired format. A penalty is applied when the model fails to generate a valid action. Over time, reward shaping encourages consistent output formatting while preserving exploration. The process includes Monte Carlo baseline estimates and generalized advantage estimation for variable-length tasks like Tic-tac-toe, allowing the model to learn from diverse decision sequences.

Performance results show that RLFT considerably improves the model’s decision-making abilities. In a button-based multi-armed bandit setting with 10 arms, the action coverage for a 2B parameter model increased from 40% to over 52% after 30,000 gradient updates. In environments with 20 choices, coverage remained suboptimal but showed meaningful improvement. The frequency bias in the 2B model decreased from 70% to 35% in early repetitions after RLFT. Moreover, in Tic-tac-toe, the 2B model’s win rate against a random opponent rose from 15% to 75%, and the model achieved a draw rate against an optimal Monte Carlo Tree Search agent, improving from -0.95 to 0.0 in average return. Furthermore, larger models like the 27B variant exhibited an 87% rate of generating correct rationales, yet chose the optimal action only 21% of the time without RLFT. This gap was significantly reduced after fine-tuning.

The research shows that refining large language models through reinforcement on their reasoning processes enhances their ability to act according to their knowledge. This connection between thought and action is vital in creating reliable decision-making agents. The proposed method offers a practical path forward for developing more capable and autonomous LLM-based agents by directly addressing common decision errors and reinforcing successful behaviors.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

The post LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap appeared first on MarkTechPost.

Source: Read MoreÂ

Mirantis reveals Lens Prism, an AI copilot for operating Kubernetes clusters

Avoid these common platform engineering mistakes

Full-Stack Techies vs Toptal: Which Is Better for React.js Outsourcing?

The AI productivity paradox in software engineering: Balancing efficiency and human skill retention

Microsoft Gaming studios head Matt Booty says “overall portfolio strategy is unchanged” — with more than 40 games in production

Capcom reports that its Steam game sales have risen massively — despite flagship titles like Monster Hunter Wilds receiving profuse backlash from PC players

Cloudflare is fighting to safeguard “the future of the web itself” — standing directly in the way of leading AI firms

Microsoft reportedly lacks the know-how to fully leverage OpenAI’s tech — despite holding IP rights

PHP 8.5.0 Alpha 1 available for testing

PHP 8.5.0 Alpha 1 available for testing

Recording cross browser compatible media

Celebrating Perficient’s Third Databricks Champion

Microsoft Gaming studios head Matt Booty says “overall portfolio strategy is unchanged” — with more than 40 games in production

Microsoft Gaming studios head Matt Booty says “overall portfolio strategy is unchanged” — with more than 40 games in production

Capcom reports that its Steam game sales have risen massively — despite flagship titles like Monster Hunter Wilds receiving profuse backlash from PC players

Cloudflare is fighting to safeguard “the future of the web itself” — standing directly in the way of leading AI firms

LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

End-to-End model training and deployment with Amazon SageMaker Unified Studio

CVE-2025-3278 – “UrbanGo Membership Plugin Privilege Escalation Vulnerability”

Sam Altman Talks GPT-5, AGI, and AI Privacy in OpenAI’s First Podcast Episode – Know More

SACAD: Smart Automatic Cover Art Downloader

Critical PyTorch Vulnerability CVE-2025-32434 Allows Remote Code Execution

Best Buy will give you our favorite Sony Bravia TV for free when you buy another – here’s what to know

CVE-2025-24348 – CtrlX OS Network Interfaces HTTP Request Manipulation Vulnerability

CVE-2025-37817 – Linux kernel Double Free in Chameleon Driver

Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and Privacy

LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

Related Posts