Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Anthropic proposes transparency framework for frontier AI development

      July 8, 2025

      Sonatype Open Source Malware Index, Gemini API Batch Mode, and more – Daily News Digest

      July 8, 2025

      15 Top Node.js Development Service Providers for Large Enterprises in 2026

      July 8, 2025

      Droip: The Modern Website Builder WordPress Needed

      July 8, 2025

      The gaming headset I use every day is slashed to its lowest price ever thanks to Amazon Prime Day — “stellar battery life” awaits

      July 9, 2025

      How passkeys work: The complete guide to your inevitable passwordless future

      July 9, 2025

      This Sony OLED TV is my pick for best Prime Day deal – and it’s the last chance to get 50% off

      July 9, 2025

      Blizzard announces release date for World of Warcraft: The War Within’s 3rd major content patch — a patch that will feature the largest, city-sized raid boss in MMORPG history

      July 8, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Top PHP Projects for B.Tech Students: Learn Real Skills with PHPGurukul Projects

      July 8, 2025
      Recent

      Top PHP Projects for B.Tech Students: Learn Real Skills with PHPGurukul Projects

      July 8, 2025

      Deno 2.4: deno bundle is back

      July 8, 2025

      From Silos to Synergy: Accelerating Your AI Journey

      July 8, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The gaming headset I use every day is slashed to its lowest price ever thanks to Amazon Prime Day — “stellar battery life” awaits

      July 9, 2025
      Recent

      The gaming headset I use every day is slashed to its lowest price ever thanks to Amazon Prime Day — “stellar battery life” awaits

      July 9, 2025

      Blizzard announces release date for World of Warcraft: The War Within’s 3rd major content patch — a patch that will feature the largest, city-sized raid boss in MMORPG history

      July 8, 2025

      Microsoft recently raised the price of the Xbox Series S, but these retailers just dropped it back down again — close to the old price, but not for long

      July 8, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku

    This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku

    April 6, 2025

    While the outputs of large language models (LLMs) appear coherent and useful, the underlying mechanisms guiding these behaviors remain largely unknown. As these models are increasingly deployed in sensitive and high-stakes environments, it has become crucial to understand what they do and how they do it.

    The main challenge lies in uncovering the internal steps that lead a model to a specific response. The computations happen across hundreds of layers and billions of parameters, making it difficult to isolate the processes involved. Without a clear understanding of these steps, trusting or debugging their behavior becomes harder, especially in tasks requiring reasoning, planning, or factual reliability. Researchers are thus focused on reverse-engineering these models to identify how information flows and decisions are made internally.

    Existing interpretability methods like attention maps and feature attribution offer partial views into model behavior. While these tools help highlight which input tokens contribute to outputs, they often fail to trace the full chain of reasoning or identify intermediate steps. Moreover, these tools usually focus on surface-level behaviors and do not provide consistent insight into deeper computational structures. This has created the need for more structured, fine-grained methods to trace logic through internal representations over multiple steps.

    To address this, researchers from Anthropic introduced a new technique called attribution graphs. These graphs allow researchers to trace the internal flow of information between features within a model during a single forward pass. By doing so, they attempt to identify intermediate concepts or reasoning steps that are not visible from the model’s outputs alone. The attribution graphs generate hypotheses about the computational pathways a model follows, which are then tested using perturbation experiments. This approach marks a significant step toward revealing the “wiring diagram” of large models, much like how neuroscientists map brain activity.

    The researchers applied attribution graphs to Claude 3.5 Haiku, a lightweight language model released by Anthropic in October 2024. The method begins by identifying interpretable features activated by a specific input. These features are then traced to determine their influence on the final output. For example, when prompted with a riddle or poem, the model selects a set of rhyming words before writing lines, a form of planning. In another example, the model identifies “Texas” as an intermediate step to answer the question, “What’s the capital of the state containing Dallas?” which it correctly resolves as “Austin.” The graphs reveal the model outputs and how it internally represents and transitions between ideas.

    The performance results from attribution graphs uncovered several advanced behaviors within Claude 3.5 Haiku. In poetry tasks, the model pre-plans rhyming words before composing each line, showing anticipatory reasoning. In multi-hop questions, the model forms internal intermediate representations, such as associating Dallas with Texas before determining Austin as the answer. It leverages both language-specific and abstract circuits for multilingual inputs, with the latter becoming more prominent in Claude 3.5 Haiku than in earlier models. Further, the model generates diagnoses internally in medical reasoning tasks and uses them to inform follow-up questions. These findings suggest that the model can abstract planning, internal goal-setting, and stepwise logical deductions without explicit instruction.

    This research presents attribution graphs as a valuable interpretability tool that reveals the hidden layers of reasoning in language models. By applying this method, the team from Anthropic has shown that models like Claude 3.5 Haiku don’t merely mimic human responses—they compute through layered, structured steps. This opens the door to deeper audits of model behavior, allowing more transparent and responsible deployment of advanced AI systems.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleTransformer Meets Diffusion: How the Transfusion Architecture Empowers GPT-4o’s Creativity
    Next Article Min Woo Lee Lululemon Let Him Cook Shirt

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 9, 2025
    Machine Learning

    Cohere Embed 4 multimodal embeddings model is now available on Amazon SageMaker JumpStart

    July 8, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5323 – Fossasia Open-Event-Server Encryption Bypass Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40674 – osCommerce Reflected Cross-Site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    Exploring institutions for global AI governance

    Artificial Intelligence

    We’re expanding our Gemini 2.5 family of models

    Artificial Intelligence

    Highlights

    CVE-2025-46576 – GoldenDB Database Permission Bypass Vulnerability

    April 27, 2025

    CVE ID : CVE-2025-46576

    Published : April 27, 2025, 2:15 a.m. | 48 minutes ago

    Description : There is a Permission Management and Access Control vulnerability in the GoldenDB database product. Attackers can manipulate requests to bypass privilege restrictions and delete content.

    Severity: 5.4 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

    May 1, 2025

    CVE-2025-5712 – SourceCodester Open Source Clinic Management System SQL Injection

    June 5, 2025

    laravel-lang/http-statuses

    June 24, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.