Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Psychology Of Color In UX Design And Digital Products

      August 15, 2025

      This week in AI dev tools: Claude Sonnet 4’s larger context window, ChatGPT updates, and more (August 15, 2025)

      August 15, 2025

      Sentry launches MCP monitoring tool

      August 14, 2025

      10 Benefits of Hiring a React.js Development Company (2025–2026 Edition)

      August 13, 2025

      14 secret phone codes that unlock hidden features on your Android and iPhone

      August 17, 2025

      Stop using AI for these 9 work tasks – here’s why

      August 17, 2025

      A smart sensor assessed my home’s risk of electrical fires, and I was impressed

      August 17, 2025

      I brought Samsung’s rugged Galaxy tablet on a hiking trip, and it weathered everything

      August 17, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      AI’s Hidden Thirst: The Water Behind Tech

      August 16, 2025
      Recent

      AI’s Hidden Thirst: The Water Behind Tech

      August 16, 2025

      Minesweeper game in 100 lines of pure JavaScript – easy tutorial

      August 16, 2025

      Maintaining Data Consistency with Laravel Database Transactions

      August 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      5 Best VPN for Lenovo Laptops to Enjoy the Web Safely

      August 16, 2025
      Recent

      5 Best VPN for Lenovo Laptops to Enjoy the Web Safely

      August 16, 2025

      3 Best Antivirus and Malware Protection Software

      August 16, 2025

      11 Best Antivirus Without Ads

      August 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

    LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

    May 17, 2025

    Conversational artificial intelligence is centered on enabling large language models (LLMs) to engage in dynamic interactions where user needs are revealed progressively. These systems are widely deployed in tools that assist with coding, writing, and research by interpreting and responding to natural language instructions. The aspiration is for these models to flexibly adjust to changing user inputs over multiple turns, adapting their understanding with each new piece of information. This contrasts with static, single-turn responses and highlights a major design goal: sustaining contextual coherence and delivering accurate outcomes in extended dialogues.

    A persistent problem in conversational AI is the model’s inability to handle user instructions distributed across multiple conversation turns. Rather than receiving all necessary information simultaneously, LLMs must extract and integrate key details incrementally. However, when the task is not specified upfront, models tend to make early assumptions about what is being asked and attempt final solutions prematurely. This leads to errors that persist through the conversation, as the models often stick to their earlier interpretations. The result is that once an LLM makes a misstep in understanding, it struggles to recover, resulting in incomplete or misguided answers.

    Most current tools evaluate LLMs using single-turn, fully-specified prompts, where all task requirements are presented in one go. Even in research claiming multi-turn analysis, the conversations are typically episodic, treated as isolated subtasks rather than an evolving flow. These evaluations fail to account for how models behave when the information is fragmented and context must be actively constructed from multiple exchanges. Consequently, evaluations often miss the core difficulty models face: integrating underspecified inputs over several conversational turns without explicit direction.

    Researchers from Microsoft Research and Salesforce Research introduced a simulation setup that mimics how users reveal information in real conversations. Their “sharded simulation” method takes complete instructions from high-quality benchmarks and splits them into smaller, logically connected parts or “shards.” Each shard delivers a single element of the original instruction, which is then revealed sequentially over multiple turns. This simulates the progressive disclosure of information that happens in practice. The setup includes a simulated user powered by an LLM that decides which shard to reveal next and reformulates it naturally to fit the ongoing context. This setup also uses classification mechanisms to evaluate whether the assistant’s responses attempt a solution or require clarification, further refining the simulation of genuine interaction.

    The technology developed simulates five types of conversations, including single-turn full instructions and multiple multi-turn setups. In SHARDED simulations, LLMs received instructions one shard at a time, forcing them to wait before proposing a complete answer. This setup evaluated 15 LLMs across six generation tasks: coding, SQL queries, API actions, math problems, data-to-text descriptions, and document summaries. Each task drew from established datasets such as GSM8K, Spider, and ToTTo. For every LLM and instruction, 10 simulations were conducted, totaling over 200,000 simulations. Aptitude, unreliability, and average performance were computed using a percentile-based scoring system, allowing direct comparison of best and worst-case outcomes per model.

    Across all tasks and models, a consistent decline in performance was observed in the SHARDED setting. On average, performance dropped from 90% in single-turn to 65% in multi-turn scenarios—a 25-point decline. The main cause was not reduced capability but a dramatic rise in unreliability. While aptitude dropped by 16%, unreliability increased by 112%, revealing that models varied wildly in how they performed when information was presented gradually. For example, even top-performing models like GPT-4.1 and Gemini 2.5 Pro exhibited 30-40% average degradations. Additional compute at generation time or lowering randomness (temperature settings) offered only minor improvements in consistency.

    This research clarifies that even state-of-the-art LLMs are not yet equipped to manage complex conversations where task requirements unfold gradually. The sharded simulation methodology effectively exposes how models falter in adapting to evolving instructions, highlighting the urgent need to improve reliability in multi-turn settings. Enhancing the ability of LLMs to process incomplete instructions over time is essential for real-world applications where conversations are naturally unstructured and incremental.


    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

    The post LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability
    Next Article 18 Best Free and Open Source Linux Earth Science Software

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 16, 2025
    Machine Learning

    Introducing Amazon Bedrock AgentCore Identity: Securing agentic AI at scale

    August 15, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Meta AI One Click AI Video Edits Are Here on Instagram, Facebook and Official Website

    Operating Systems

    CVE-2025-31177 – Gnuplot Heap Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    QAQ-QQ-AI-QUEST

    Development

    CVE-2025-27525 – Hitachi JP1/IT Desktop Management 2 – Smart Device Manager Windows Information Exposure Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Raspberry Pi OS – Debian-based distro

    May 18, 2025

    Raspberry Pi OS is the official supported operating system for the Raspberry Pi series of…

    All Pixel Watch 3s are on sale at Best Buy – save $70 on the LTE model

    June 13, 2025

    How to Access Oracle Fusion Cloud Apps Data from Snowflake

    April 16, 2025

    Brisa 0.2.12 – Near 0.3 🔜

    May 3, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.