Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 3, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 3, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 3, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 3, 2025

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025

      These solid-state fans will revolutionize cooling in our PCs and laptops

      June 3, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025
      Recent

      Community News: Latest PECL Releases (06.03.2025)

      June 3, 2025

      A Comprehensive Guide to Azure Firewall

      June 3, 2025

      Test Job Failures Precisely with Laravel’s assertFailedWith Method

      June 3, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025
      Recent

      All the WWE 2K25 locker codes that are currently active

      June 3, 2025

      PSA: You don’t need to spend $400+ to upgrade your Xbox Series X|S storage

      June 3, 2025

      UK civil servants saved 24 minutes per day using Microsoft Copilot, saving two weeks each per year according to a new report

      June 3, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts

    RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts

    April 11, 2025
    RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts

    In recent years, the rapid progress of LLMs has given the impression that we are nearing the achievement of Artificial General Intelligence (AGI), with models seemingly capable of solving increasingly complex tasks. However, a fundamental question remains: Are LLMs genuinely reasoning like humans or merely repeating patterns learned during training? Since the release of models like GPT-3 and ChatGPT, LLMs have revolutionized the research landscape, pushing boundaries across AI and science. Data quality, model scaling, and multi-step reasoning improvements have brought LLMs close to passing high-level AGI benchmarks. Yet, their true reasoning capabilities are not fully understood. Instances where advanced models fail to solve simple math problems—despite their apparent simplicity—raise concerns about whether they are truly reasoning or just mimicking familiar solution patterns.

    Although various benchmarks exist to evaluate LLMs across domains like general knowledge, coding, math, and reasoning, many rely on tasks solvable by applying memorized templates. As a result, the actual intelligence and robustness of LLMs remain debatable. Studies show LLMs struggle with subtle context shifts, simple calculations, symbolic reasoning, and out-of-distribution prompts. These weaknesses are amplified under perturbed conditions or misleading cues. Similarly, multi-modal LLMs, including vision-language models like GPT-4v and LLaVA, show the same tendency to recite instead of reason when tested with subtly altered visual or textual inputs. This suggests that issues like spurious correlations, memorization, and inefficient decoding might underlie these failures, indicating a gap between observed performance and genuine understanding.

    ByteDance Seed and the University of Illinois Urbana-Champaign researchers introduce RoR-Bench, a new multi-modal benchmark designed to identify whether LLMs rely on recitation rather than genuine reasoning when solving simple problems with subtly altered conditions. The benchmark includes 158 text and 57 image problem pairs, each featuring a basic reasoning task alongside a slightly modified version. Experiments reveal that leading models like OpenAI-o1 and DeepSeek-R1 suffer drastic performance drops—often over 60% with minor changes. Alarmingly, most models struggle to recognize unsolvable problems—preliminary fixes like prompt engineering offer limited improvement, emphasizing the need for deeper solutions.

    RoR-Bench is a Chinese multimodal benchmark created to assess whether LLMs rely on memorized solution patterns rather than true reasoning. It contains 215 problem pairs—158 text-based and 57 image-based—where each pair includes an original and a subtly altered version. The original problems are simple, often from children’s puzzle sets, while the modified ones introduce minor changes that require entirely different reasoning. Annotators ensured minimal wording changes and no ambiguity. Notably, some problems are designed to have no solution or feature unrelated information, testing LLMs’ ability to recognize illogical conditions and resist recitation-based answers.

    The study empirically evaluates leading LLMs and VLMs on the RoR-Bench benchmark, focusing on their ability to reason through subtle problem changes rather than merely recalling learned patterns. Results reveal that most models suffer a significant performance drop—often over 50% when tested on slightly modified problems, suggesting a reliance on memorization rather than genuine reasoning. Even techniques like Chain-of-Thought prompting or “Forced Correct” instructions provide limited improvement. Few-shot in-context learning shows some gains, especially with increased examples or added instructions, but still fails to close the gap. Overall, these findings highlight the limitations of current models in adaptive reasoning.

    In conclusion, the study introduces RoR-Bench, a Chinese multimodal benchmark designed to uncover a critical flaw in current large language models: their inability to handle simple reasoning tasks when problem conditions are slightly altered. The significant performance drop—often over 50% suggests that these models rely on memorization rather than true reasoning. Even with added prompts or few-shot examples, the issue remains largely unresolved. While the benchmark is limited to Chinese, initial English results indicate similar weaknesses. The findings challenge assumptions about LLM intelligence and call for future research to develop models that reason genuinely rather than reciting learned patterns from training data.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBalancing Accuracy and Efficiency in Language Models: A Two-Phase RL Post-Training Approach for Concise Reasoning
    Next Article Complete Guide: Working with CSV/Excel Files and EDA in Python

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 3, 2025
    Machine Learning

    This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning

    June 3, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Employee Privacy Policy

    Development

    15 Design Concepts Every Financial Product Owner Should Know and Use

    Development

    Google DeepMind Researchers Propose Matryoshka Quantization: A Technique to Enhance Deep Learning Efficiency by Optimizing Multi-Precision Models without Sacrificing Accuracy

    Machine Learning

    Fine-grained Markdown

    Development

    Highlights

    CVE-2025-4210 – Casdoor SCIM User Creation Endpoint Authorization Bypass Vulnerability

    May 2, 2025

    CVE ID : CVE-2025-4210

    Published : May 2, 2025, 4:15 p.m. | 34 minutes ago

    Description : A vulnerability classified as critical was found in Casdoor up to 1.811.0. This vulnerability affects the function HandleScim of the file controllers/scim.go of the component SCIM User Creation Endpoint. The manipulation leads to authorization bypass. The attack can be initiated remotely. Upgrading to version 1.812.0 is able to address this issue. The name of the patch is 3d12ac8dc2282369296c3386815c00a06c6a92fe. It is recommended to upgrade the affected component.

    Severity: 7.3 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    RT-2: New model translates vision and language into action

    May 27, 2025

    Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

    June 2, 2025

    My 5 favorite note-taking apps for staying organized on a desktop

    December 31, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.