Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 19, 2025

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 19, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 19, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 19, 2025

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025

      Microsoft Copilot gets OpenAI’s GPT-4o image generation support — but maybe a day late and a dollar short for the hype?

      May 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      ES6: Set Vs Array- What and When?

      May 19, 2025
      Recent

      ES6: Set Vs Array- What and When?

      May 19, 2025

      Transform JSON into Typed Collections with Laravel’s AsCollection::of()

      May 19, 2025

      Deployer

      May 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025
      Recent

      My latest hands-on could be the best value AI laptop of the summer, but I still have questions

      May 19, 2025

      DOOM: The Dark Ages had the lowest Steam launch numbers in series history — Is it suffering from the ‘Game Pass Effect’?

      May 19, 2025

      Microsoft won’t be left exposed if something “catastrophic” happens to OpenAI — but may still be 3 to 6 months behind ChatGPT

      May 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

    LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap

    May 19, 2025

    Language models trained on vast internet-scale datasets have become prominent language understanding and generation tools. Their potential extends beyond language tasks to functioning as decision-making agents in interactive environments. When applied to environments requiring action choices, these models are expected to leverage their internal knowledge and reasoning to act effectively. Their ability to consider context, weigh options, and choose actions opens new possibilities for their integration into agentic systems that interact with dynamic environments.

    Despite this promise, these models exhibit critical limitations in decision-making. While capable of forming accurate chains of reasoning, they often fail to act upon them. This issue is identified as the knowing-doing gap, where models recognize correct strategies but do not implement them in practice. Another significant concern is greediness, where models repeatedly select high-reward options prematurely, ignoring alternative strategies that could lead to better outcomes. Moreover, smaller models display frequency bias, favoring commonly seen actions regardless of reward, impairs exploration, and hinder learning from diverse scenarios.

    To address these challenges, researchers have experimented with various strategies. Traditional reinforcement learning methods, including bandit algorithms like the Upper-Confidence Bound (UCB), aim to manage exploration-exploitation trade-offs. In contrast, in-context learning and behavior cloning imitate expert trajectories but often reinforce the same decision biases. While some exploration strategies have improved performance marginally, these approaches lack a mechanism to convert internal reasoning into optimal action reliably, especially in complex or stochastic environments.

    Researchers from Google DeepMind and the LIT AI Lab at JKU Linz focused on refining language model behavior through Reinforcement Learning Fine-Tuning (RLFT). Their approach employs self-generated Chain-of-Thought (CoT) rationales as training signals. By evaluating the rewards of actions following specific reasoning steps, the model learns to favor decisions that sound logical and yield high returns in practice. This reinforcement links model reasoning to environmental feedback, promoting improved decision alignment and reducing gaps between thought and behavior.

    The methodology centers on token-based fine-tuning using environment interactions. At each step, the model receives an input instruction and a recent action-reward history, and it generates a sequence containing the rationale and the selected action. These outputs are evaluated based on environmental rewards and whether the action conforms to the desired format. A penalty is applied when the model fails to generate a valid action. Over time, reward shaping encourages consistent output formatting while preserving exploration. The process includes Monte Carlo baseline estimates and generalized advantage estimation for variable-length tasks like Tic-tac-toe, allowing the model to learn from diverse decision sequences.

    Performance results show that RLFT considerably improves the model’s decision-making abilities. In a button-based multi-armed bandit setting with 10 arms, the action coverage for a 2B parameter model increased from 40% to over 52% after 30,000 gradient updates. In environments with 20 choices, coverage remained suboptimal but showed meaningful improvement. The frequency bias in the 2B model decreased from 70% to 35% in early repetitions after RLFT. Moreover, in Tic-tac-toe, the 2B model’s win rate against a random opponent rose from 15% to 75%, and the model achieved a draw rate against an optimal Monte Carlo Tree Search agent, improving from -0.95 to 0.0 in average return. Furthermore, larger models like the 27B variant exhibited an 87% rate of generating correct rationales, yet chose the optimal action only 21% of the time without RLFT. This gap was significantly reduced after fine-tuning.

    The research shows that refining large language models through reinforcement on their reasoning processes enhances their ability to act according to their knowledge. This connection between thought and action is vital in creating reliable decision-making agents. The proposed method offers a practical path forward for developing more capable and autonomous LLM-based agents by directly addressing common decision errors and reinforcing successful behaviors.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

    The post LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleReinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency
    Next Article How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 19, 2025
    Machine Learning

    Reinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency

    May 19, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Designing Sustainable E-Commerce Experiences

    Development

    How to get the Overlord exotic shotgun in The Division 2

    Development

    CSS Animation Effects: Bringing Web Designs to Life

    Development

    Do you really need antivirus software for Linux desktops?

    Development

    Highlights

    Development

    Hijack Loader Malware Employs Process Hollowing, UAC Bypass in Latest Version

    May 8, 2024

    A newer version of a malware loader called Hijack Loader has been observed incorporating an updated set of anti-analysis…

    Introducing moleQLar: Streamline Your PostgreSQL to GraphQL Integration

    July 27, 2024

    What is Grok AI? Is It Worth the Hype?

    January 21, 2025

    AWS DMS best practices for moving large tables with table parallelism settings

    May 13, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.