Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      This week in AI dev tools: Gemini 2.5 Pro and Flash GA, GitHub Copilot Spaces, and more (June 20, 2025)

      June 20, 2025

      Gemini 2.5 Pro and Flash are generally available and Gemini 2.5 Flash-Lite preview is announced

      June 19, 2025

      CSS Cascade Layers Vs. BEM Vs. Utility Classes: Specificity Control

      June 19, 2025

      IBM launches new integration to help unify AI security and governance

      June 18, 2025

      I used Lenovo’s latest dual-screen OLED laptop for a month and it wouldn’t be my first choice — here’s why

      June 22, 2025

      Here’s how I fixed a dead Steam Deck screen — with Valve proving they still have the best customer service in gaming

      June 22, 2025

      Borderlands 4 drops stunning new story trailer

      June 22, 2025

      DistroWatch Weekly, Issue 1127

      June 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Exploring Lakebase: Databricks’ Next-Gen AI-Native OLTP Database

      June 22, 2025
      Recent

      Exploring Lakebase: Databricks’ Next-Gen AI-Native OLTP Database

      June 22, 2025

      Understanding JavaScript Promise

      June 22, 2025

      Lakeflow: Revolutionizing SCD2 Pipelines with Change Data Capture (CDC)

      June 21, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      I used Lenovo’s latest dual-screen OLED laptop for a month and it wouldn’t be my first choice — here’s why

      June 22, 2025
      Recent

      I used Lenovo’s latest dual-screen OLED laptop for a month and it wouldn’t be my first choice — here’s why

      June 22, 2025

      Here’s how I fixed a dead Steam Deck screen — with Valve proving they still have the best customer service in gaming

      June 22, 2025

      Borderlands 4 drops stunning new story trailer

      June 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Why Apple’s Critique of AI Reasoning Is Premature

    Why Apple’s Critique of AI Reasoning Is Premature

    June 22, 2025

    The debate around the reasoning capabilities of Large Reasoning Models (LRMs) has been recently invigorated by two prominent yet conflicting papers: Apple’s “Illusion of Thinking” and Anthropic’s rebuttal titled “The Illusion of the Illusion of Thinking”. Apple’s paper claims fundamental limits in LRMs’ reasoning abilities, while Anthropic argues these claims stem from evaluation shortcomings rather than model failures.

    Apple’s study systematically tested LRMs on controlled puzzle environments, observing an “accuracy collapse” beyond specific complexity thresholds. These models, such as Claude-3.7 Sonnet and DeepSeek-R1, reportedly failed to solve puzzles like Tower of Hanoi and River Crossing as complexity increased, even exhibiting reduced reasoning effort (token usage) at higher complexities. Apple identified three distinct complexity regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at medium complexity, and both collapse at high complexity. Critically, Apple’s evaluations concluded that LRMs’ limitations were due to their inability to apply exact computation and consistent algorithmic reasoning across puzzles.

    Anthropic, however, sharply challenges Apple’s conclusions, identifying critical flaws in the experimental design rather than the models themselves. They highlight three major issues:

    1. Token Limitations vs. Logical Failures: Anthropic emphasizes that failures observed in Apple’s Tower of Hanoi experiments were primarily due to output token limits rather than reasoning deficits. Models explicitly noted their token constraints, deliberately truncating their outputs. Thus, what appeared as “reasoning collapse” was essentially a practical limitation, not cognitive failure.
    2. Misclassification of Reasoning Breakdown: Anthropic identifies that Apple’s automated evaluation framework misinterpreted intentional truncations as reasoning failures. This rigid scoring method didn’t accommodate models’ awareness and decision-making regarding output length, leading to unjustly penalizing LRMs.
    3. Unsolvable Problems Misinterpreted: Perhaps most significantly, Anthropic demonstrates that some of Apple’s River Crossing benchmarks were mathematically impossible to solve (e.g., cases with six or more individuals with a boat capacity of three). Scoring these unsolvable instances as failures drastically skewed the results, making models appear incapable of solving fundamentally unsolvable puzzles.

    Anthropic further tested an alternative representation method—asking models to provide concise solutions (like Lua functions)—and found high accuracy even on complex puzzles previously labeled as failures. This outcome clearly indicates the issue was with evaluation methods rather than reasoning capabilities.

    Another key point raised by Anthropic pertains to the complexity metric used by Apple—compositional depth (number of required moves). They argue this metric conflates mechanical execution with genuine cognitive difficulty. For example, while Tower of Hanoi puzzles require exponentially more moves, each decision step is trivial, whereas puzzles like River Crossing involve fewer steps but significantly higher cognitive complexity due to constraint satisfaction and search requirements.

    Both papers significantly contribute to understanding LRMs, but the tension between their findings underscores a critical gap in current AI evaluation practices. Apple’s conclusion—that LRMs inherently lack robust, generalizable reasoning—is substantially weakened by Anthropic’s critique. Instead, Anthropic’s findings suggest LRMs are constrained by their testing environments and evaluation frameworks rather than their intrinsic reasoning capacities.

    Given these insights, future research and practical evaluations of LRMs must:

    • Differentiate Clearly Between Reasoning and Practical Constraints: Tests should accommodate the practical realities of token limits and model decision-making.
    • Validate Problem Solvability: Ensuring puzzles or problems tested are solvable is essential for fair evaluation.
    • Refine Complexity Metrics: Metrics must reflect genuine cognitive challenges, not merely the volume of mechanical execution steps.
    • Explore Diverse Solution Formats: Assessing LRMs’ capabilities across various solution representations can better reveal their underlying reasoning strengths.

    Ultimately, Apple’s claim that LRMs “can’t really reason” appears premature. Anthropic’s rebuttal demonstrates that LRMs indeed possess sophisticated reasoning capabilities that can handle substantial cognitive tasks when evaluated correctly. However, it also stresses the importance of careful, nuanced evaluation methods to truly understand the capabilities—and limitations—of emerging AI models.


    Check out the Apple Paper and Anthropic Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post Why Apple’s Critique of AI Reasoning Is Premature appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIBM’s MCP Gateway: A Unified FastAPI-Based Model Context Protocol Gateway for Next-Gen AI Toolchains
    Next Article Texas A&M Researchers Introduce a Two-Phase Machine Learning Method Named ‘ShockCast’ for High-Speed Flow Simulation with Neural Temporal Re-Meshing

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 22, 2025
    Machine Learning

    EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations

    June 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4720 – SourceCodester Student Result Management System Remote Path Traversal Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-47886 – Jenkins Cadence vManager Plugin CSRF Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Can AI Site Builders Make WordPress Easier?

    Web Development
    Firebase & MongoDB Atlas: A Powerful Combo for Rapid App Development

    Firebase & MongoDB Atlas: A Powerful Combo for Rapid App Development

    Databases

    Highlights

    CVE-2025-32878 – COROS PACE 3 TLS Certificate Validation Bypass

    June 20, 2025

    CVE ID : CVE-2025-32878

    Published : June 20, 2025, 2:15 p.m. | 28 minutes ago

    Description : An issue was discovered on COROS PACE 3 devices through 3.0808.0. It implements a function to connect the watch to a WLAN. This function is mainly for downloading firmware files. Before downloading firmware files, the watch requests some information about the firmware via HTTPS from the back-end API. However, the X.509 server certificate within the TLS handshake is not validated by the device. This allows an attacker within an active machine-in-the-middle position, using a TLS proxy and a self-signed certificate, to eavesdrop and manipulate the HTTPS communication. This could be abused, for example, for stealing the API access token of the assigned user account.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-6154 – PHPGurukul Hostel Management System SQL Injection

    June 17, 2025

    Your embarassing Meta AI prompts might be public – here’s how to check

    June 13, 2025

    CVE-2025-46801 – PgPool Global Development Group Pgpool-II Authentication Bypass

    May 19, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.