Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Representative Line: National Exclamations

      July 16, 2025

      Freelancer vs React.js Development Agency: Who Should Build Your Web App? (2025)

      July 16, 2025

      Twilio’s Event Triggered Journeys, OutSystem’s Agent Workbench, and more – Daily News Digest

      July 15, 2025

      Harness Infrastructure as Code Management expands with features that facilitate better reusability

      July 15, 2025

      Xbox Cloud Gaming is getting next-gen treatment too — here’s what we expect to see in the coming months and years for cloud gamers ☁️

      July 15, 2025

      Windows 7 running natively on a Steam Deck is an affront to science — this tinkerer has Microsoft’s OS booting in portrait mode

      July 15, 2025

      “Everybody’s jobs will be affected” — but NVIDIA’s CEO believes society can think its way out of AI-related job loss

      July 15, 2025

      “A future has been stolen from many of us” — ZeniMax Online Studios devs will reportedly soon be hit by Microsoft’s Xbox layoffs after the MMO Phil Spencer loved was cancelled

      July 15, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Add QR Code field functionality to your Filament UI’s

      July 16, 2025
      Recent

      Add QR Code field functionality to your Filament UI’s

      July 16, 2025

      The details of TC39’s last meeting

      July 16, 2025

      How Agentic AI is Reshaping Marketing and CX Operations

      July 15, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Karere – GTK4 WhatsApp Client

      July 16, 2025
      Recent

      Karere – GTK4 WhatsApp Client

      July 16, 2025

      Poweradmin – web-based DNS administration tool

      July 16, 2025

      Xbox Cloud Gaming is getting next-gen treatment too — here’s what we expect to see in the coming months and years for cloud gamers ☁️

      July 15, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»From Logic to Confusion: MIT Researchers Show How Simple Prompt Tweaks Derail LLM Reasoning

    From Logic to Confusion: MIT Researchers Show How Simple Prompt Tweaks Derail LLM Reasoning

    April 15, 2025

    Large language models are increasingly used to solve math problems that mimic real-world reasoning tasks. These models are tested for their ability to answer factual queries and how well they can handle multi-step logical processes. Mathematical problem-solving offers a reliable way to examine whether models can extract the necessary information, navigate complex statements, and compute answers correctly. This field has become central to understanding the extent of AI’s logical and cognitive capabilities.

    A key concern in this domain is how these models perform when their inputs aren’t neat or formatted. In many cases, the questions LLMs encounter in practice come with extra background information, irrelevant details, or even subtle hints that could lead them off track. While models can perform well on standard benchmark problems, their ability to isolate important information from cluttered prompts remains questionable. This has raised the need to examine how distractions influence their reasoning and whether current models are ready for unpredictable, real-world use cases.

    Past tools and benchmarks have focused mostly on well-formed problem sets, such as GSM8K or MATH. Still, newer variants like GSM-Symbolic and GSM-PLUS began testing model performance under symbolic variations and distractor insertions. These tools uncovered significant weaknesses in LLMs when faced with small changes to the problem text. For instance, introducing one clause that seems relevant but is logically redundant can reduce model accuracy by as much as 65%. This led to the conclusion that models often rely on surface patterns rather than genuine reasoning, which prompted further exploration into more realistic and noisy testing conditions.

    A team of researchers from the Massachusetts Institute of Technology has introduced a research focused on measuring how LLMs handle four types of systematic perturbations: irrelevant context, pathological instructions, relevant but non-essential information, and a combination of the latter two. The team evaluated 13 large language models—both open-source and commercial—through APIs provided by OpenAI, Anthropic, Cohere, and TogetherAI. Instead of relying on full test sets, the team sampled 56 data points from the GSM8K dataset per experiment, ensuring they captured a balanced distribution of reasoning complexity.

    To construct these altered prompts, the researchers added dense and irrelevant contexts like Wikipedia pages or financial reports into the input. This took up to 90% of the model’s context window. In the pathological scenario, misleading instructions were appended, designed to manipulate the reasoning path without altering the original question. New details that were factually correct but unnecessary were inserted for the relevant context case to see how the models handled distractions that looked informative. In the final variant, pathological and relevant perturbations were combined, increasing the input complexity while observing how this dual pressure influenced model output.

    The performance dropped most sharply when irrelevant context was introduced. Across all models, the average accuracy dropped by 55.89%. Pathological instructions caused an 8.52% decline, while relevant context led to a 7.01% decrease. Combining the two types of perturbations produced a 12.91% drop in accuracy. Interestingly, performance didn’t correlate with model size—larger models like Mixtral-8x22B and Command-R-Plus experienced greater regressions compared to some smaller models. Also, the number of reasoning steps in a problem didn’t significantly affect the outcome, suggesting that complexity in logical structure wasn’t the dominant factor in performance variance.

    This study shows that current large language models, even those with billions of parameters, still struggle when their prompts are altered relatively simply. The researchers from MIT demonstrate that model resilience doesn’t improve significantly with size and that the ability to filter and prioritize information is a major gap in LLM design. These findings push for developing models that are better equipped to deal with cluttered and misleading inputs—an essential step for moving closer to reliable AI in real-world environments.


    Here is the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post From Logic to Confusion: MIT Researchers Show How Simple Prompt Tweaks Derail LLM Reasoning appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleElevate business productivity with Amazon Q and Amazon Connect
    Next Article LLM Reasoning Benchmarks are Statistically Fragile: New Study Shows Reinforcement Learning RL Gains often Fall within Random Variance

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 16, 2025
    Machine Learning

    How PayU built a secure enterprise AI assistant using Amazon Bedrock

    July 15, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Lenovo Just Launched the Gemini-powered Chromebook Plus 14 with “Select to Search” and These Features

    Operating Systems

    top 10 website design

    Web Development

    CVE-2025-40908 – LibYAML YAML File Modification Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-46727 – Rack Denial of Service (DoS) Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-4649 – Centreon Web Privilege Escalation Vulnerability

    May 13, 2025

    CVE ID : CVE-2025-4649

    Published : May 13, 2025, 12:15 p.m. | 4 hours, 9 minutes ago

    Description : Improper Privilege Management vulnerability in Centreon web allows Privilege Escalation.
    ACL are not correctly taken into account in the display of the “event logs” page. This page requiring, high privileges, will display all available logs.

    This issue affects web: from 24.10.3 before 24.10.4, from 24.04.09 before 24.04.10, from 23.10.19 before 23.10.21, from 23.04.24 before 23.04.26.

    Severity: 4.9 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-50054 – OpenVPN Buffer Overflow

    June 20, 2025

    CVE-2025-48116 – EventON Missing Authorization

    May 16, 2025

    Want to ditch Windows? This Linux distro makes that transition easy

    June 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.