Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025

      New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

      June 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025
      Recent

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts

    RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts

    April 11, 2025
    RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts

    In recent years, the rapid progress of LLMs has given the impression that we are nearing the achievement of Artificial General Intelligence (AGI), with models seemingly capable of solving increasingly complex tasks. However, a fundamental question remains: Are LLMs genuinely reasoning like humans or merely repeating patterns learned during training? Since the release of models like GPT-3 and ChatGPT, LLMs have revolutionized the research landscape, pushing boundaries across AI and science. Data quality, model scaling, and multi-step reasoning improvements have brought LLMs close to passing high-level AGI benchmarks. Yet, their true reasoning capabilities are not fully understood. Instances where advanced models fail to solve simple math problems—despite their apparent simplicity—raise concerns about whether they are truly reasoning or just mimicking familiar solution patterns.

    Although various benchmarks exist to evaluate LLMs across domains like general knowledge, coding, math, and reasoning, many rely on tasks solvable by applying memorized templates. As a result, the actual intelligence and robustness of LLMs remain debatable. Studies show LLMs struggle with subtle context shifts, simple calculations, symbolic reasoning, and out-of-distribution prompts. These weaknesses are amplified under perturbed conditions or misleading cues. Similarly, multi-modal LLMs, including vision-language models like GPT-4v and LLaVA, show the same tendency to recite instead of reason when tested with subtly altered visual or textual inputs. This suggests that issues like spurious correlations, memorization, and inefficient decoding might underlie these failures, indicating a gap between observed performance and genuine understanding.

    ByteDance Seed and the University of Illinois Urbana-Champaign researchers introduce RoR-Bench, a new multi-modal benchmark designed to identify whether LLMs rely on recitation rather than genuine reasoning when solving simple problems with subtly altered conditions. The benchmark includes 158 text and 57 image problem pairs, each featuring a basic reasoning task alongside a slightly modified version. Experiments reveal that leading models like OpenAI-o1 and DeepSeek-R1 suffer drastic performance drops—often over 60% with minor changes. Alarmingly, most models struggle to recognize unsolvable problems—preliminary fixes like prompt engineering offer limited improvement, emphasizing the need for deeper solutions.

    RoR-Bench is a Chinese multimodal benchmark created to assess whether LLMs rely on memorized solution patterns rather than true reasoning. It contains 215 problem pairs—158 text-based and 57 image-based—where each pair includes an original and a subtly altered version. The original problems are simple, often from children’s puzzle sets, while the modified ones introduce minor changes that require entirely different reasoning. Annotators ensured minimal wording changes and no ambiguity. Notably, some problems are designed to have no solution or feature unrelated information, testing LLMs’ ability to recognize illogical conditions and resist recitation-based answers.

    The study empirically evaluates leading LLMs and VLMs on the RoR-Bench benchmark, focusing on their ability to reason through subtle problem changes rather than merely recalling learned patterns. Results reveal that most models suffer a significant performance drop—often over 50% when tested on slightly modified problems, suggesting a reliance on memorization rather than genuine reasoning. Even techniques like Chain-of-Thought prompting or “Forced Correct” instructions provide limited improvement. Few-shot in-context learning shows some gains, especially with increased examples or added instructions, but still fails to close the gap. Overall, these findings highlight the limitations of current models in adaptive reasoning.

    Hostinger

    In conclusion, the study introduces RoR-Bench, a Chinese multimodal benchmark designed to uncover a critical flaw in current large language models: their inability to handle simple reasoning tasks when problem conditions are slightly altered. The significant performance drop—often over 50% suggests that these models rely on memorization rather than true reasoning. Even with added prompts or few-shot examples, the issue remains largely unresolved. While the benchmark is limited to Chinese, initial English results indicate similar weaknesses. The findings challenge assumptions about LLM intelligence and call for future research to develop models that reason genuinely rather than reciting learned patterns from training data.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post RoR-Bench: Revealing Recitation Over Reasoning in Large Language Models Through Subtle Context Shifts appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBalancing Accuracy and Efficiency in Language Models: A Two-Phase RL Post-Training Approach for Concise Reasoning
    Next Article Complete Guide: Working with CSV/Excel Files and EDA in Python

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Enhancing Self-Supervised Learning with Automatic Data Curation: A Hierarchical K-Means Approach

    Development

    Slow internet on the go? These two Wi-Fi 7 adapters can boost your connections

    News & Updates

    Rebel against God’s final judgment with 32 of your damned besties in this action roguelike for Xbox and PC

    News & Updates

    CVE-2025-5390 – JeeWMS File Handler Improper Access Control Remote Vulnerability

    Common Vulnerabilities and Exposures (CVEs)
    Hostinger

    Highlights

    CVE-2025-48069 – Apache ejson2env Command Injection Vulnerability

    May 21, 2025

    CVE ID : CVE-2025-48069

    Published : May 21, 2025, 6:15 p.m. | 2 hours, 26 minutes ago

    Description : ejson2env allows users to decrypt EJSON secrets and export them as environment variables. Prior to version 2.0.8, the `ejson2env` tool has a vulnerability related to how it writes to `stdout`. Specifically, the tool is intended to write an export statement for environment variables and their values. However, due to inadequate output sanitization, there is a potential risk where variable names or values may include malicious content, resulting in additional unintended commands being output to `stdout`. If this output is improperly utilized in further command execution, it could lead to command injection, allowing an attacker to execute arbitrary commands on the host system. Version 2.0.8 sanitizes output during decryption. Other mitigations involve avoiding use of `ejson2env` to decrypt untrusted user secrets and/or avoiding evaluating or executing the direct output from `ejson2env` without removing nonprintable characters.

    Severity: 6.6 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Distribution Release: Fedora 42

    April 15, 2025

    Experts Uncover New Evasive SquidLoader Malware Targeting Chinese Organizations

    June 20, 2024

    Unable to capture the alert in Selenium using Python

    November 13, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.