Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»NATURAL PLAN: Benchmarking LLMs on natural language planning

    NATURAL PLAN: Benchmarking LLMs on natural language planning

    June 10, 2024

    Google DeepMind researchers developed NATURAL PLAN, a benchmark for evaluating the capability of LLMs to plan real-world tasks based on natural language prompts.

    The next evolution of AI is to have it leave the confines of a chat platform and take on agentic roles to complete tasks across platforms on our behalf. But that’s harder than it sounds.

    Planning tasks like scheduling a meeting or compiling a holiday itinerary might seem simple for us. Humans are good at reasoning through multiple steps and predicting whether a course of action will accomplish the desired objective or not.

    You might find that easy, but even the best AI models struggle with planning. Could we benchmark them to see which LLM is best at planning?

    The NATURAL PLAN benchmark tests LLMs on 3 planning tasks:

    Trip planning – Planning a trip itinerary under flight and destination constraints
    Meeting planning – Scheduling meetings with multiple friends in different locations
    Calendar scheduling – Scheduling work meetings between multiple people given existing schedules and various constraints

    The experiment began with few-shot prompting where the models were provided with 5 examples of prompts and corresponding correct answers. They were then prompted with planning prompts of varying difficulty.

    Here’s an example of a prompt and solution provided as an example to the models:

    An example of a prompt and solution used in the Trip Planning experiment. Source: arXiv

    Results

    The researchers tested GPT-3.5, GPT-4, GPT-4o, Gemini 1.5 Flash, and Gemini 1.5 Pro, none of which performed very well on these tests.

    The results must have gone down well in the DeepMind office though as Gemini 1.5 Pro came out on top.

    NATURAL PLAN benchmark results. Source: arXiv

    As expected, the results got exponentially worse with more complex prompts where the number of people or cities was increased. For example, look at how quickly the accuracy suffered as more people were added to the meeting planning test.

    The accuracy of results in the Meeting Planning test degraded exponentially as the prompts became more complex. Source: arXiv

    Could multi-shot prompting result in improved accuracy? The results of the research indicate that it can, but only if the model has a large enough context window.

    Gemini 1.5 Pro’s larger context window enables it to leverage more in-context examples than the GPT models.

    The researchers found that in Trip Planning, increasing the number of shots from 1 to 800 improves the accuracy of Gemini Pro 1.5 from 2.7% to 39.9%.

    The paper noted, “These results show the promise of in-context planning where the long-context capabilities enable LLMs to leverage further context to improve Planning.”

    A strange result was that GPT-4o was really bad at Trip Planning. The researchers found that it struggled “to understand and respect the flight connectivity and travel date constraints.”

    Another strange outcome was that self-correction led to a significant model performance drop across all models. When the models were prompted to check their work and make corrections they made more mistakes.

    Interestingly, the stronger models, such as GPT-4 and Gemini 1.5 Pro, suffered bigger losses than GPT-3.5 when self-correcting.

    Agentic AI is an exciting prospect and we’re already seeing some practical use cases in Microsoft Copilot agents.

    But the results of the NATURAL PLAN benchmark tests show that we’ve got some way to go before AI can handle more complex planning.

    The DeepMind researchers concluded that “NATURAL PLAN is very hard for state-of-the-art models to solve.”

    It seems AI won’t be replacing travel agents and personal assistants quite yet.

    The post NATURAL PLAN: Benchmarking LLMs on natural language planning appeared first on DailyAI.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBlackBerry Disputes Cylance Hack Claims, New York Times Confirms Code Breach
    Next Article Can someone tell if I block their number?

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    London Drugs Temporarily Closes All Western Canadian Stores After Cyberattack

    Development

    I tested Samsung’s 98-inch 4K QLED TV, and here’s why it might be worth the $13,000

    Development

    Claude AI can analyze your Google Docs and offer feedback now

    Development

    Salesforce to NetSuite Integration: The Complete Guide

    Artificial Intelligence
    GetResponse

    Highlights

    I used the OnePlus Open for a whole year – why it’s still the foldable phone to beat in 2025

    January 14, 2025

    OnePlus excelled with its first foldable phone, and 16 months after launch, the Open continues…

    How do I access fields within Braintree’s Hosted Fields’ iframes using selenium

    November 14, 2024

    Cross-Platform JavaScript Stealer Targets Crypto Wallets in New Lazarus Group Campaign

    February 5, 2025

    Yahoo’s New AI-Driven News App: A Revolution in Digital Journalism

    June 14, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.