Decoding the AI mind: Anthropic researchers peer inside the â€œblack boxâ€

Anthropic researchers successfully identified millions of concepts within Claude Sonnet, one of their advanced LLMs.

AI models are often considered black boxes, meaning you canâ€™t â€˜seeâ€™ inside them to understand exactly how they work.Â

When you provide an LLM with an input, it generates a response, but the reasoning behind its choices isnâ€™t clear.

Your input goes in, and the output comes out â€“ and even the AI developers themselves donâ€™t truly understand what happens inside that â€˜box.â€™Â

Neural networks create their own internal representations of information when they map inputs to outputs during data training.Â The building blocks of this process, called â€œneuron activations,â€ are represented by numerical values.

Each concept is distributed across multiple neurons, and each neuron contributes to representing multiple concepts, making it tricky to map concepts directly to individual neurons.

This is broadly analogous to our human brains. Just as our brains process sensory inputs and generate thoughts, behaviors, and memories, the billions, even trillions, of processes behind those functions remain primarily unknown to science.

Anthropicâ€™s study attempts to see inside AIâ€™s black box with a technique called â€œdictionary learning.â€Â

This involves decomposing complex patterns in an AI model into linear building blocks or â€œatomsâ€ that make intuitive sense to humans.

Mapping LLMs with Dictionary Learning

In October 2023, Anthropic applied this method to a tiny â€œtoyâ€ language model and found coherent features corresponding to concepts like uppercase text, DNA sequences, surnames in citations, mathematical nouns, or function arguments in Python code.

This latest study scales up the technique to work for todayâ€™s larger AI language models, in this case, Anthropicâ€˜s Claude 3 Sonnet.Â

Hereâ€™s a step-by-step of how the study worked:

Identifying patterns with dictionary learning

Anthropic used dictionary learning to analyze neuron activations across various contexts and identify common patterns.

Dictionary learning groups these activations into a smaller set of meaningful â€œfeatures,â€ representing higher-level concepts learned by the model.

By identifying these features, researchers can better understand how the model processes and represents information.

Extracting features from the middle layer

The researchers focused on the middle layer of Claude 3.0 Sonnet, which serves as a critical point in the modelâ€™s processing pipeline.

Applying dictionary learning to this layer extracts millions of features that capture the modelâ€™s internal representations and learned concepts at this stage.

Extracting features from the middle layer allows researchers to examine the modelâ€™s understanding of information after it has processed the input before generating the final output.

Discovering diverse and abstract concepts

The extracted features revealed an expansive range of concepts learned by Claude, from concrete entities like cities and people to abstract notions related to scientific fields and programming syntax.

Interestingly, the features were found to be multimodal, responding to both textual and visual inputs, indicating that the model can learn and represent concepts across different modalities.

Additionally, the multilingual features suggest that the model can grasp concepts expressed in various languages.

Analyzing the organization of concepts

To understand how the model organizes and relates different concepts, the researchers analyzed the similarity between features based on their activation patterns.

They discovered that features representing related concepts tended to cluster together. For example, features associated with cities or scientific disciplines exhibited higher similarity to each other than to features representing unrelated concepts.

This suggests that the modelâ€™s internal organization of concepts aligns, to some extent, with human intuitions about conceptual relationships.

Anthropic managed to map abstract concepts like â€œinner conflict.â€ Source: Anthropic.

Verifying the features

To confirm that the identified features directly influence the modelâ€™s behavior and outputs, the researchers conducted â€œfeature steeringâ€ experiments.

This involved selectively amplifying or suppressing the activation of specific features during the modelâ€™s processing and observing the impact on its responses.

By manipulating individual features, researchers could establish a direct link between individual features and the modelâ€™s behavior. For instance, amplifying a feature related to a specific city caused the model to generate city-biased outputs, even in irrelevant contexts.

Why interpretability is critical for AI safety

Anthropicâ€™s research is fundamentally relevant to AI interpretability and, by extension, safety.

Understanding how LLMs process and represent information helps researchers understand and mitigate risks. It lays the foundation for developing more transparent and explainable AI systems.Â

As Anthropic explains, â€œWe hope that we and others can use these discoveries to make models safer. For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors (such as deceiving the user), to steer them towards desirable outcomes (debiasing), or to remove certain dangerous subject matter entirely.â€

Unlocking a greater understanding of AI behavior becomes paramount as they become ubiquitous for critical decision-making processes in fields such as healthcare, finance, and criminal justice. It also helps uncover the root cause of bias, hallucinations, and other unwanted or unpredictable behaviors.Â

For example, a recent study from the University of Bonn uncovered how graph neural networks (GNNs) used for drug discovery rely heavily on recalling similarities from training data rather than truly learning complex new chemical interactions. This makes it tough to understand how exactly these models determine new compounds of interest.

Last year, the UK government negotiated with major tech giants like OpenAI and DeepMind, seeking access to their AI systemsâ€™ internal decision-making processes.Â

Regulation like the EUâ€™s AI Act will pressure AI companies to be more transparent, though commercial secrets seem sure to remain under lock and key.Â

Anthropicâ€™s research offers a glimpse of whatâ€™s inside the box by â€˜mappingâ€™ information across the model.Â

However, the truth is that these models are so vast that, by Anthropicâ€™s own admission, â€œWe think itâ€™s quite likely that weâ€™re orders of magnitude short, and that if we wanted to get all the features â€“ in all layers! â€“ we would need to use much more compute than the total compute needed to train the underlying models.â€

Thatâ€™s an interesting point â€“ reverse engineering a model is more computationally complex than engineering the model in the first place.Â

Itâ€™s reminiscent of hugely expensive neuroscience projects like the Human Brain Project (HBP), which poured billions into mapping our own human brains only to ultimately fail.Â

Never underestimate how much lies inside the black box.Â

The post Decoding the AI mind: Anthropic researchers peer inside the â€œblack boxâ€ appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Decoding the AI mind: Anthropic researchers peer inside the â€œblack boxâ€

Mapping LLMs with Dictionary Learning

Why interpretability is critical for AI safety

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

CVE-2025-34489 – GFI MailEssentials Remote Code Execution Vulnerability

Unleashing Stability AIâ€™s most advanced text-to-image models for media, marketing and advertising: Revolutionizing creative workflows

150,000 Sites Compromised by JavaScript Injection Promoting Chinese Gambling Platforms

Confidently Extract Single Array Items with Laravel’s Arr::sole() Method

Windows 11 KB5052093 24H2 out with Xbox ads, direct download .msu

TIOBE Programming Language Index News (June 2024): C++ Rises to Second Place

Understanding the FakeBat Loader: Distribution Tactics and Cybercriminal Infrastructure

WWE 2K25 has revealed its release date for Xbox and PC, while hinting at a new ‘Island’ game mode

Decoding the AI mind: Anthropic researchers peer inside the â€œblack boxâ€

Mapping LLMs with Dictionary Learning

Why interpretability is critical for AI safety

Related Posts

Decoding the AI mind: Anthropic researchers peer inside the â€œblack boxâ€