Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 17, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 17, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 17, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 17, 2025

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025

      Save $400 on the best Samsung TVs, laptops, tablets, and more when you sign up for Verizon 5G Home or Home Internet

      May 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025
      Recent

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025

      Big Changes at Meteor Software: Our Next Chapter

      May 17, 2025

      Apps in Generative AI – Transforming the Digital Experience

      May 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025
      Recent

      Microsoft’s allegiance isn’t to OpenAI’s pricey models — Satya Nadella’s focus is selling any AI customers want for maximum profits

      May 17, 2025

      If you think you can do better than Xbox or PlayStation in the Console Wars, you may just want to try out this card game

      May 17, 2025

      Surviving a 10 year stint in dev hell, this retro-styled hack n’ slash has finally arrived on Xbox

      May 17, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»AI model performance: Is it reasoning or simply reciting?

    AI model performance: Is it reasoning or simply reciting?

    July 14, 2024

    When ChatGPT gives you the right answer to your prompt, does it reason through the request or simply remember the answer from its training data?

    MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers designed a series of tests to see if AI models “think” or just have good memories.

    When you prompt an AI model to solve a math problem like “What is 27+62?” it comes back quickly with the correct answer: 89. How could we tell if it understands the underlying arithmetic or simply saw the problem in its training data?

    In their paper, the researchers tested GPT-4, GPT-3.5 Turbo, Claude 1.3, and PaLM2 to see if they could “generalize not only to unseen instances of known tasks, but to new tasks.”

    They designed a series of 11 tasks that differed slightly from the standard tasks in which the LLMs generally perform well.

    The LLMs should perform equally well with the “counterfactual tasks” if they employ general and transferable task-solving procedures.

    If an LLM “understands” math then it should provide the correct answer to a math problem in base-10 and the seldom-used base-9, for example.

    Here’s a look at examples of the tasks and GPT-4’s performance.

    GPT-4’s performance with standard default tasks (Blue) and slightly altered counterfactual tasks (Orange). Examples of the tasks and correct answers are shown here. Source: arXiv

    GPT-4’s performance in standard tests (blue line) is good, but its math, logic reasoning, spatial reasoning, and other abilities (orange line) degrade significantly when the task is slightly altered.

    The other models displayed similar degradation with GPT-4 coming out on top.

    Despite the degradation, the performance on counterfactual tasks was still better than chance. The AI models try to reason through these tasks but aren’t very good at it.

    The results show that the impressive performance of AI models in tasks like college exams relies on excellent recall of training data, not reasoning. This further highlights that AI models can’t generalize to unseen tasks,

    Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author of the paper said, “We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons.”

    We saw a similar demonstration of this inability to generalize when we explored how bad AI models are at solving a simplified river crossing puzzle.

    The researchers concluded that when developers analyze their models, they should “consider abstract task ability as detached from observed task performance.”

    The “train-to-test” approach may move a model up the benchmarks but doesn’t offer a true measure of how the model will fare when presented with a new task to reason through.

    The researchers suggest that part of the problem is that these models are trained only on surface form text.

    If LLMs are exposed to more real-world contextualized data and semantic representation they might be able to generalize when presented with task variations.

    The post AI model performance: Is it reasoning or simply reciting? appeared first on DailyAI.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleWeekly Vulnerability Report: Cyble Urges Fixes in Rockwell Automation, Microsoft and Rejetto
    Next Article Amazon Prime Day 2024: Live updates on the 50+ hottest Prime Day deals so far

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 17, 2025
    Development

    Learn A1 Level Spanish

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    How to Configure ServiceNow Knowledge Articles for Microsoft Graph Connector

    Development

    Debugging and Profiling Linux Applications with GDB and strace

    Learning Resources

    A new thesis for the Fermi Paradox: is AI a Great Filter or a cosmic colonizer?

    Artificial Intelligence

    Access Request Data Fluently in Laravel 11.34

    Development

    Highlights

    Development

    PHPStan 2.0 is Here

    November 12, 2024

    PHPStan just released v2.0.0 this week! This massive update is three years in the making,…

    Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

    November 15, 2024

    Samsung’s AI alliance with Google could spell trouble for the iPhone – here’s why

    January 22, 2025

    Windows 11 Widgets Board Could Get More Useful with THIS Update

    April 14, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.