AI model performance: Is it reasoning or simply reciting?

When ChatGPT gives you the right answer to your prompt, does it reason through the request or simply remember the answer from its training data?

MITâ€™s Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers designed a series of tests to see if AI models â€œthinkâ€ or just have good memories.

When you prompt an AI model to solve a math problem like â€œWhat is 27+62?â€ it comes back quickly with the correct answer: 89. How could we tell if it understands the underlying arithmetic or simply saw the problem in its training data?

In their paper, the researchers tested GPT-4, GPT-3.5 Turbo, Claude 1.3, and PaLM2 to see if they could â€œgeneralize not only to unseen instances of known tasks, but to new tasks.â€

They designed a series of 11 tasks that differed slightly from the standard tasks in which the LLMs generally perform well.

The LLMs should perform equally well with the â€œcounterfactual tasksâ€ if they employ general and transferable task-solving procedures.

If an LLM â€œunderstandsâ€ math then it should provide the correct answer to a math problem in base-10 and the seldom-used base-9, for example.

Hereâ€™s a look at examples of the tasks and GPT-4â€™s performance.

GPT-4â€™s performance with standard default tasks (Blue) and slightly altered counterfactual tasks (Orange). Examples of the tasks and correct answers are shown here. Source: arXiv

GPT-4â€™s performance in standard tests (blue line) is good, but its math, logic reasoning, spatial reasoning, and other abilities (orange line) degrade significantly when the task is slightly altered.

The other models displayed similar degradation with GPT-4 coming out on top.

Despite the degradation, the performance on counterfactual tasks was still better than chance. The AI models try to reason through these tasks but arenâ€™t very good at it.

The results show that the impressive performance of AI models in tasks like college exams relies on excellent recall of training data, not reasoning. This further highlights that AI models canâ€™t generalize to unseen tasks,

Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author of the paper said, â€œWeâ€™ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these modelsâ€™ adaptability and broaden their application horizons.â€

We saw a similar demonstration of this inability to generalize when we explored how bad AI models are at solving a simplified river crossing puzzle.

The researchers concluded that when developers analyze their models, they should â€œconsider abstract task ability as detached from observed task performance.â€

The â€œtrain-to-testâ€ approach may move a model up the benchmarks but doesnâ€™t offer a true measure of how the model will fare when presented with a new task to reason through.

The researchers suggest that part of the problem is that these models are trained only on surface form text.

If LLMs are exposed to more real-world contextualized data and semantic representation they might be able to generalize when presented with task variations.

The post AI model performance: Is it reasoning or simply reciting? appeared first on DailyAI.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

AI model performance: Is it reasoning or simply reciting?

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Stellar Blade will be coming to PC optimized for Steam Deck and other handhelds

Dutch Regulator Fines Uber â‚¬290 Million for GDPR Violations in Data Transfers to U.S.

One of my favorite open-ear headphones just got better (and somehow cheaper)

Tune replication performance with AWS DMS for an Amazon Kinesis Data Streams target endpoint â€“ Part 2

Best Websites for Web Design Inspiration and Ideas

The best Motorola phones of 2024: Expert tested and reviewed

Audit Exposes Security Lapses in FBIâ€™s Handling of Sensitive Storage Media

Kobiton Delivers for Mobile Developers with Support for iOS 18 Beta

AI model performance: Is it reasoning or simply reciting?

Related Posts