Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Slack’s AI search now works across an organization’s entire knowledge base

      July 17, 2025

      In-House vs Outsourcing for React.js Development: Understand What Is Best for Your Enterprise

      July 17, 2025

      Tiny Screens, Big Impact: The Forgotten Art Of Developing Web Apps For Feature Phones

      July 16, 2025

      Kong AI Gateway 3.11 introduces new method for reducing token costs

      July 16, 2025

      Researchers from OpenAI, Anthropic, Meta, and Google issue joint AI safety warning – here’s why

      July 17, 2025

      You’ll soon be able to chat with Copilot and attend Teams meetings while driving your Mercedes-Benz — now there’s no excuse to miss your meetings

      July 17, 2025

      Intel is laying off thousands of US workers in AI restructuring — CEO Lip-Bu Tan says it’s “too late” to catch up with the competition

      July 17, 2025

      Elon Musk says “We need more babies” — then creates digital girlfriends so you actually won’t go out and make any babies

      July 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Tinkerwell v5 is now released

      July 17, 2025
      Recent

      Tinkerwell v5 is now released

      July 17, 2025

      The details of TC39’s last meeting

      July 17, 2025

      Perficient Honored as a 2025 Technology Top Workplaces Winner

      July 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You’ll soon be able to chat with Copilot and attend Teams meetings while driving your Mercedes-Benz — now there’s no excuse to miss your meetings

      July 17, 2025
      Recent

      You’ll soon be able to chat with Copilot and attend Teams meetings while driving your Mercedes-Benz — now there’s no excuse to miss your meetings

      July 17, 2025

      Intel is laying off thousands of US workers in AI restructuring — CEO Lip-Bu Tan says it’s “too late” to catch up with the competition

      July 17, 2025

      Elon Musk says “We need more babies” — then creates digital girlfriends so you actually won’t go out and make any babies

      July 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»This “smart coach” helps LLMs switch between text and code

    This “smart coach” helps LLMs switch between text and code

    July 17, 2025

    Large language models (LLMs) excel at using textual reasoning to understand the context of a document and provide a logical answer about its contents. But these same LLMs often struggle to correctly answer even the simplest math problems.

    Textual reasoning is usually a less-than-ideal way to deliberate over computational or algorithmic tasks. While some LLMs can generate code like Python to handle symbolic queries, the models don’t always know when to use code, or what kind of code would work best.

    LLMs, it seems, may need a coach to steer them toward the best technique.

    Enter CodeSteer, a smart assistant developed by MIT researchers that guides an LLM to switch between code and text generation until it correctly answers a query.

    CodeSteer, itself a smaller LLM, automatically generates a series of prompts to iteratively steer a larger LLM. It reviews the model’s current and previous answers after each round and provides guidance for how it can fix or refine that solution until it deems the answer is correct.

    The researchers found that augmenting a larger LLM with CodeSteer boosted its accuracy on symbolic tasks, like multiplying numbers, playing Sudoku, and stacking blocks, by more than 30 percent. It also enabled less sophisticated models to outperform more advanced models with enhanced reasoning skills.

    This advance could improve the problem-solving capabilities of LLMs for complex tasks that are especially difficult to solve with textual reasoning alone, such as generating paths for robots in uncertain environments or scheduling shipments in an international supply chain.

    “There is a race to develop better and better models that are capable of doing everything, but we’ve taken a complementary approach. Researchers have spent years developing effective technologies and tools to tackle problems in many domains. We want to enable LLMs to select the right tools and methods, and make use of others’ expertise to enhance their own capabilities,” says Chuchu Fan, an associate professor of aeronautics and astronautics (AeroAstro) and principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS).

    Fan, the senior author of the study, is joined on a paper about the work by LIDS graduate student Yongchao Chen; AeroAstro graduate student Yilun Hao; University of Illinois at Urbana-Champaign graduate student Yueying Liu; and MIT-IBM Watson AI Lab Research Scientist Yang Zhang. The research will be presented at the International Conference on Machine Learning.

    An LLM “trainer”  

    Ask an LLM which number is bigger, 9.11 or 9.9, and it will often give the wrong answer by using textual reasoning. But ask it to use code to answer the same question, and it can generate and execute a Python script to compare the two numbers, easily solving the problem.

    Initially trained to understand and predict human language, LLMs are more likely to answer queries using text, even when code would be more effective. And while they have learned to generate code through fine-tuning, these models often generate an incorrect or less efficient version of the code.

    Rather than trying to retrain a powerful LLM like GPT-4 or Claude to improve these capabilities, the MIT researchers fine-tune a smaller, lightweight LLM to guide a larger model between text and code. Fine-tuning a smaller model doesn’t change the larger LLM, so there is no risk it would undermine the larger model’s other abilities.

    “We were also inspired by humans. In sports, a trainer may not be better than the star athlete on the team, but the trainer can still give helpful suggestions to guide the athlete. This steering method works for LLMs, too,” Chen says.

    This trainer, CodeSteer, works in conjunction with the larger LLM. It first reviews a query and determines whether text or code is suitable for this problem, and which sort of code would be best.

    Then it generates a prompt for the larger LLM, telling it to use a coding method or textual reasoning to answer the query. The larger model follows this prompt to answer the query and sends the result back to CodeSteer, which reviews it.

    If the answer is not correct, CodeSteer will continue prompting the LLM to try different things that might fix the problem, such as incorporating a search algorithm or constraint into its Python code, until the answer is correct.

    “We found that oftentimes, the larger LLM will try to be lazy and use a shorter, less efficient code that will not carry the correct symbolic calculation. We’ve designed CodeSteer to avoid this phenomenon,” Chen says.

    A symbolic checker evaluates the code’s complexity and sends a signal to CodeSteer if it is too simple or inefficient. The researchers also incorporate a self-answer checker into CodeSteer, which prompts the LLM to generate code that calculates the answer to verify it is correct.

    Tackling complex tasks

    As the researchers designed CodeSteer, they couldn’t find suitable symbolic datasets to fine-tune and test the model, since many existing benchmarks don’t point out whether a certain query could be best solved with text or code.

    So, they gathered a corpus of 37 complex symbolic tasks, including spatial reasoning, mathematics, order reasoning, and optimization, and built their own dataset, called SymBench. They implemented a fine-tuning approach that leverages SymBench to maximize the performance of CodeSteer.

    In their experiments, CodeSteer outperformed all nine baseline methods they evaluated and boosted average accuracy from 53.3 percent to 86.4 percent. It maintains similar performance even on unseen tasks, and on a variety of LLMs.

    In addition, a general-purpose model augmented with CodeSteer can achieve higher accuracy than state-of-the-art models designed to focus on complex reasoning and planning, while requiring much less computation.

    “Our method uses an LLM’s own capabilities. By augmenting an LLM with the ability to smartly use coding, we can take a model that is already very strong and improve its performance even more,” Chen says.

    In the future, the researchers want to streamline CodeSteer to speed up its iterative prompting process. In addition, they are studying how to effectively fine-tune a unified model with the ability to switch between textual reasoning and code generation, rather than relying on a separate assistant.

    “The authors present an elegant solution to the critical challenge of tool utilization in LLMs. This simple yet impactful method enables state-of-the-art LLMs to achieve significant performance improvements without requiring direct fine-tuning,” says Jinsung Yoon, a staff research scientist at Google Cloud AI, who was not involved with this work. “This research represents a substantial contribution that promises to significantly enhance the application of LLMs to a diverse range of tasks with which they currently struggle.”

    “Their success in training a smaller, specialized model to strategically guide larger, advanced models is particularly impactful,” adds Chi Wang, a senior staff scientist at Google DeepMind who was not involved with this work. “This intelligent collaboration among diverse AI ‘agents’ paves the way for more robust and versatile applications in complex real-world scenarios.”

    This research is supported, in part, by the U.S. Office of Naval Research and the MIT-IBM Watson AI Lab.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleEx-Army Soldier Cameron John Wagenius Pleads Guilty to $1M Cyber Extortion Scheme
    Next Article Gemini Robotics brings AI into the physical world

    Related Posts

    Artificial Intelligence

    Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

    July 17, 2025
    Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)
    Artificial Intelligence

    Defending against Prompt Injection with Structured Queries (StruQ) and Preference Optimization (SecAlign)

    July 17, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Is Blue Prince on Xbox Game Pass?

    News & Updates

    Always deploy at peak traffic

    Learning Resources

    CVE-2025-0856 – WordPress PGS Core Plugin Unauthenticated Remote Data Manipulation

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-46758 – Apache HTTP Server Arbitrary File Disclosure

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-2772 – BEC Technologies Router Credentials Disclosure Vulnerability

    April 23, 2025

    CVE ID : CVE-2025-2772

    Published : April 23, 2025, 5:16 p.m. | 1 hour, 42 minutes ago

    Description : BEC Technologies Multiple Routers Insufficiently Protected Credentials Information Disclosure Vulnerability. This vulnerability allows network-adjacent attackers to disclose sensitive information on affected installations of BEC Technologies routers. Authentication is not required to exploit this vulnerability.

    The specific flaw exists within /cgi-bin/tools_usermanage.asp. The issue results from transmitting a list of users and their credentials to be handled on the client side. An attacker can leverage this vulnerability to disclose transported credentials, leading to further compromise. Was ZDI-CAN-25895.

    Severity: 5.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-0993 – GitLab Denial of Service

    May 22, 2025

    Generative AI in Data and Quality Assurance (QA): Transforming Processes

    April 16, 2025

    This AI Paper Introduces WEB-SHEPHERD: A Process Reward Model for Web Agents with 40K Dataset and 10× Cost Efficiency

    May 29, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.