Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Mirantis reveals Lens Prism, an AI copilot for operating Kubernetes clusters

      July 3, 2025

      Avoid these common platform engineering mistakes

      July 3, 2025

      Full-Stack Techies vs Toptal: Which Is Better for React.js Outsourcing?

      July 3, 2025

      The AI productivity paradox in software engineering: Balancing efficiency and human skill retention

      July 2, 2025

      Microsoft Gaming studios head Matt Booty says “overall portfolio strategy is unchanged” — with more than 40 games in production

      July 3, 2025

      Capcom reports that its Steam game sales have risen massively — despite flagship titles like Monster Hunter Wilds receiving profuse backlash from PC players

      July 3, 2025

      Cloudflare is fighting to safeguard “the future of the web itself” — standing directly in the way of leading AI firms

      July 3, 2025

      Microsoft reportedly lacks the know-how to fully leverage OpenAI’s tech — despite holding IP rights

      July 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      PHP 8.5.0 Alpha 1 available for testing

      July 3, 2025
      Recent

      PHP 8.5.0 Alpha 1 available for testing

      July 3, 2025

      Recording cross browser compatible media

      July 3, 2025

      Celebrating Perficient’s Third Databricks Champion

      July 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft Gaming studios head Matt Booty says “overall portfolio strategy is unchanged” — with more than 40 games in production

      July 3, 2025
      Recent

      Microsoft Gaming studios head Matt Booty says “overall portfolio strategy is unchanged” — with more than 40 games in production

      July 3, 2025

      Capcom reports that its Steam game sales have risen massively — despite flagship titles like Monster Hunter Wilds receiving profuse backlash from PC players

      July 3, 2025

      Cloudflare is fighting to safeguard “the future of the web itself” — standing directly in the way of leading AI firms

      July 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

    LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

    May 19, 2025

    Language models trained on vast internet-scale datasets have become prominent language understanding and generation tools. Their potential extends beyond language tasks to functioning as decision-making agents in interactive environments. When applied to environments requiring action choices, these models are expected to leverage their internal knowledge and reasoning to act effectively. Their ability to consider context, weigh options, and choose actions opens new possibilities for their integration into agentic systems that interact with dynamic environments.

    Despite this promise, these models exhibit critical limitations in decision-making. While capable of forming accurate chains of reasoning, they often fail to act upon them. This issue is identified as the knowing-doing gap, where models recognize correct strategies but do not implement them in practice. Another significant concern is greediness, where models repeatedly select high-reward options prematurely, ignoring alternative strategies that could lead to better outcomes. Moreover, smaller models display frequency bias, favoring commonly seen actions regardless of reward, impairs exploration, and hinder learning from diverse scenarios.

    To address these challenges, researchers have experimented with various strategies. Traditional reinforcement learning methods, including bandit algorithms like the Upper-Confidence Bound (UCB), aim to manage exploration-exploitation trade-offs. In contrast, in-context learning and behavior cloning imitate expert trajectories but often reinforce the same decision biases. While some exploration strategies have improved performance marginally, these approaches lack a mechanism to convert internal reasoning into optimal action reliably, especially in complex or stochastic environments.

    Researchers from Google DeepMind and the LIT AI Lab at JKU Linz focused on refining language model behavior through Reinforcement Learning Fine-Tuning (RLFT). Their approach employs self-generated Chain-of-Thought (CoT) rationales as training signals. By evaluating the rewards of actions following specific reasoning steps, the model learns to favor decisions that sound logical and yield high returns in practice. This reinforcement links model reasoning to environmental feedback, promoting improved decision alignment and reducing gaps between thought and behavior.

    The methodology centers on token-based fine-tuning using environment interactions. At each step, the model receives an input instruction and a recent action-reward history, and it generates a sequence containing the rationale and the selected action. These outputs are evaluated based on environmental rewards and whether the action conforms to the desired format. A penalty is applied when the model fails to generate a valid action. Over time, reward shaping encourages consistent output formatting while preserving exploration. The process includes Monte Carlo baseline estimates and generalized advantage estimation for variable-length tasks like Tic-tac-toe, allowing the model to learn from diverse decision sequences.

    Performance results show that RLFT considerably improves the model’s decision-making abilities. In a button-based multi-armed bandit setting with 10 arms, the action coverage for a 2B parameter model increased from 40% to over 52% after 30,000 gradient updates. In environments with 20 choices, coverage remained suboptimal but showed meaningful improvement. The frequency bias in the 2B model decreased from 70% to 35% in early repetitions after RLFT. Moreover, in Tic-tac-toe, the 2B model’s win rate against a random opponent rose from 15% to 75%, and the model achieved a draw rate against an optimal Monte Carlo Tree Search agent, improving from -0.95 to 0.0 in average return. Furthermore, larger models like the 27B variant exhibited an 87% rate of generating correct rationales, yet chose the optimal action only 21% of the time without RLFT. This gap was significantly reduced after fine-tuning.

    The research shows that refining large language models through reinforcement on their reasoning processes enhances their ability to act according to their knowledge. This connection between thought and action is vital in creating reliable decision-making agents. The proposed method offers a practical path forward for developing more capable and autonomous LLM-based agents by directly addressing common decision errors and reinforcing successful behaviors.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

    The post LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleReinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency
    Next Article Not Just a Manual: How Our Project Management Framework Helps Teams Deliver

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 3, 2025
    Machine Learning

    End-to-End model training and deployment with Amazon SageMaker Unified Studio

    July 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-3278 – “UrbanGo Membership Plugin Privilege Escalation Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    Sam Altman Talks GPT-5, AGI, and AI Privacy in OpenAI’s First Podcast Episode – Know More

    Operating Systems

    SACAD: Smart Automatic Cover Art Downloader

    Linux

    Critical PyTorch Vulnerability CVE-2025-32434 Allows Remote Code Execution

    Security

    Highlights

    Best Buy will give you our favorite Sony Bravia TV for free when you buy another – here’s what to know

    June 11, 2025

    Right now at Best Buy, when you purchase a 98-inch Sony Bravia 5 at full…

    CVE-2025-24348 – CtrlX OS Network Interfaces HTTP Request Manipulation Vulnerability

    April 30, 2025

    CVE-2025-37817 – Linux kernel Double Free in Chameleon Driver

    May 8, 2025

    Enterprise AI Without GPU Burn: Salesforce’s xGen-small Optimizes for Context, Cost, and Privacy

    May 10, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.