Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 14, 2025

      This week in AI dev tools: Apple’s Foundations Model framework, Mistral’s first reasoning model, and more (June 13, 2025)

      June 13, 2025

      Open Talent platforms emerging to match skilled workers to needs, study finds

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025

      Here’s why network infrastructure is vital to maximizing your company’s AI adoption

      June 14, 2025

      The AI video tool behind the most viral social trends right now

      June 14, 2025

      Got a new password manager? How to clean up the password mess you left in the cloud

      June 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025
      Recent

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025

      Tunnel Run game in 170 lines of pure JS

      June 14, 2025

      Integrating Drupal with Salesforce SSO via SAML and Dynamic User Sync

      June 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025
      Recent

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025

      Is Chrome Copying Edge? ‘Omnibox Tools’ Bring Edge-Style Address Bar Shortcuts

      June 14, 2025

      Windows 11 24H2’s new Start Menu auto-changes size based on screen resolution

      June 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation

    Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation

    June 13, 2025

    Artificial intelligence has undergone a significant transition from basic language models to advanced models that focus on reasoning tasks. These newer systems, known as Large Reasoning Models (LRMs), represent a class of tools designed to simulate human-like thinking by producing intermediate reasoning steps before arriving at conclusions. The focus has moved from generating accurate outputs to understanding the process that leads to these answers. This shift has raised questions about how these models manage tasks with layered complexity and whether they truly possess reasoning abilities or are simply leveraging training patterns to guess outcomes.

    Redefining Evaluation: Moving Beyond Final Answer Accuracy

    A recurring problem with evaluating machine reasoning is that traditional benchmarks mostly assess the final answer without examining the steps involved in arriving at it. Final answer accuracy alone does not reveal the quality of internal reasoning, and many benchmarks are contaminated with data that may have been seen during training. This creates a misleading picture of a model’s true capabilities. To explore actual reasoning, researchers require environments where problem difficulty can be precisely controlled and intermediate steps can be analyzed. Without such settings, it is hard to determine whether these models can generalize solutions or merely memorize patterns.

    To evaluate reasoning more reliably, the research team at Apple designed a setup using four puzzle environments: Tower of Hanoi, River Crossing, Checkers Jumping, and Blocks World. These puzzles allow precise manipulation of complexity by changing elements such as the number of disks, checkers, or agents involved. Each task requires different reasoning abilities, such as constraint satisfaction and sequential planning. Importantly, these environments are free from typical data contamination, enabling thorough checks of both outcomes and the reasoning steps in between. This method ensures a detailed investigation of how models behave across varied task demands.

    The research introduced a comparative study using two sets of models: Claude 3.7 Sonnet and DeepSeek-R1, along with their “thinking” variants and their standard LLM counterparts. These models were tested across the puzzles under identical token budgets to measure both accuracy and reasoning efficiency. This helped reveal performance shifts across low, medium, and high-complexity tasks. One of the most revealing observations was the formation of three performance zones. In simple tasks, non-thinking models outperformed reasoning variants. For medium complexity, reasoning models gained an edge, while both types collapsed completely as complexity peaked.

    Comparative Insights: Thinking vs. Non-Thinking Models Under Stress

    An in-depth analysis revealed that reasoning effort increased with task difficulty up to a certain point but then declined despite the availability of resources. For instance, in the Tower of Hanoi, Claude 3.7 Sonnet (thinking) maintained high accuracy until complexity reached a certain threshold, after which performance dropped to zero. Even when these models were supplied with explicit solution algorithms, they failed to execute steps beyond specific complexity levels. In one case, Claude 3.7 could manage around 100 steps correctly for the Tower of Hanoi but was unable to complete simpler River Crossing tasks requiring only 11 moves when $N = 3$. This inconsistency exposed serious limitations in symbolic manipulation and exact computation.

    The performance breakdown also highlighted how LRMs handle their internal thought process. Models frequently engaged in “overthinking,” generating correct intermediate solutions early in the process but continuing to explore incorrect paths. This led to inefficient use of tokens. At medium complexity levels, models began to find correct answers later in their reasoning chains. However, at high levels of complexity, they failed to produce accurate solutions. Quantitative analysis confirmed that solution accuracy dropped to near zero as the problem complexity increased, and the number of reasoning tokens allocated began to decline unexpectedly.

    Scaling Limits and the Collapse of Reasoning

    This research presents a sobering assessment of how current Learning Resource Management Systems (LRMs) operate. Research from Apple makes it clear that, despite some progress, today’s reasoning models are still far from achieving generalized reasoning. The work identifies how performance scales, where it collapses, and why over-reliance on benchmark accuracy fails to capture deeper reasoning behavior. Controlled puzzle environments have proven to be a powerful tool for uncovering hidden weaknesses in these systems and emphasizing the need for more robust designs in the future.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

    The post Apple Researchers Reveal Structural Failures in Large Reasoning Models Using Puzzle-Based Evaluation appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuild a Secure AI Code Execution Workflow Using Daytona SDK
    Next Article Google AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 14, 2025
    Machine Learning

    OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models

    June 14, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    All the WWE 2K25 locker codes that are currently active

    News & Updates

    CVE-2025-1533 – ASUS Armoury Crate App Stack Buffer Overflow

    Common Vulnerabilities and Exposures (CVEs)

    Microsoft wants to streamline your workday with powerful AI agents

    Operating Systems

    This Call of Duty game is being taken down after just one year — get it now before it’s gone

    News & Updates

    Highlights

    CVE-2025-48268 – Guru Team Bot for Telegram WooCommerce Missing Authorization Vulnerability

    May 19, 2025

    CVE ID : CVE-2025-48268

    Published : May 19, 2025, 3:15 p.m. | 1 hour, 13 minutes ago

    Description : Missing Authorization vulnerability in Guru Team Bot for Telegram on WooCommerce allows Exploiting Incorrectly Configured Access Control Security Levels. This issue affects Bot for Telegram on WooCommerce: from n/a through 1.2.6.

    Severity: 4.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-46661 – IPW Systems Metazo Server-Side Template-Injection Vulnerability

    April 28, 2025

    Cisco Warns of High-Severity SSH Security Flaws in UCS IMC and NDFC Systems

    June 4, 2025

    Google AI Introduce the Articulate Medical Intelligence Explorer (AMIE): A Large Language Model Optimized for Diagnostic Reasoning, and Evaluate its Ability to Generate a Differential Diagnosis

    April 12, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.