Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs

LLMs are deployed through conversational interfaces that present helpful, harmless, and honest assistant personas. However, they fail to maintain consistent personality traits throughout the training and deployment phases. LLMs show dramatic and unpredictable persona shifts when exposed to different prompting strategies or contextual inputs. The training process can also cause unintended personality shifts, as seen when modifications to RLHF unintentionally create overly sycophantic behaviors in GPT-4o, leading to validation of harmful content and reinforcement of negative emotions. This highlights weaknesses in current LLM deployment practices and emphasizes the urgent need for reliable tools to detect and prevent harmful persona shifts.

Related works like linear probing techniques extract interpretable directions for behaviors like entity recognition, sycophancy, and refusal patterns by creating contrastive sample pairs and computing activation differences. However, these methods struggle with unexpected generalization during finetuning, where training on narrow domain examples can cause broader misalignment through emergent shifts along meaningful linear directions. Current prediction and control methods, including gradient-based analysis for identifying harmful training samples, sparse autoencoder ablation techniques, and directional feature removal during training, show limited effectiveness in preventing unwanted behavioral changes.

A team of researchers from Anthropic, UT Austin, Constellation, Truthful AI, and UC Berkeley present an approach to address persona instability in LLMs through persona vectors in activation space. The method extracts directions corresponding to specific personality traits like evil behavior, sycophancy, and hallucination propensity using an automated pipeline that requires only natural-language descriptions of target traits. Moreover, it shows that intended and unintended personality shifts after finetuning strongly correlate with movements along persona vectors, offering opportunities for intervention via post-hoc correction or preventative steering methods. Moreover, researchers show that finetuning-induced persona shifts can be predicted before finetuning, identifying problematic training data at both the dataset and individual sample levels.

To monitor persona shifts during finetuning, two datasets are constructed. The first one is trait-eliciting datasets that contain explicit examples of malicious responses, sycophantic behaviors, and fabricated information. The second is “emergent misalignment-like” (“EM-like”) datasets, which contain narrow domain-specific issues such as incorrect medical advice, flawed political arguments, invalid math problems, and vulnerable code. Moreover, researchers extract average hidden states to detect behavioral shifts during finetuning mediated by persona vectors at the last prompt token across evaluation sets, computing the difference to provide activation shift vectors. These shift vectors are then mapped onto previously extracted persona directions to measure finetuning-induced changes along specific trait dimensions.

Dataset-level projection difference metrics show a strong correlation with trait expression after finetuning, allowing early detection of training datasets that may trigger unwanted persona characteristics. It proves more effective than raw projection methods in predicting trait shifts, as it considers the base model’s natural response patterns to specific prompts. Sample-level detection achieves high separability between problematic and control samples across trait-eliciting datasets (Evil II, Sycophantic II, Hallucination II) and “EM-like” datasets (Opinion Mistake II). The persona directions identify individual training samples that induce persona shifts with fine-grained precision, outperforming traditional data filtering methods and providing broad coverage across trait-eliciting content and domain-specific errors.

In conclusion, researchers introduced an automated pipeline that extracts persona vectors from natural-language trait descriptions, providing tools for monitoring and controlling personality shifts across deployment, training, and pre-training phases in LLMs. Future research directions include characterizing the complete persona space dimensionality, identifying natural persona bases, exploring correlations between persona vectors and trait co-expression patterns, and investigating limitations of linear methods for certain personality traits. This study builds a foundational understanding of persona dynamics in models and offers practical frameworks for creating more reliable and controllable language model systems.

Check out the Paper, Technical Blog and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

Error’d: You Talkin’ to Me?

The Psychology Of Trust In AI: A Guide To Measuring And Designing For User Confidence

This week in AI updates: OpenAI Codex updates, Claude integration in Xcode 26, and more (September 19, 2025)

Report: The major factors driving employee disengagement in 2025

DistroWatch Weekly, Issue 1140

Distribution Release: DietPi 9.17

Development Release: Zorin OS 18 Beta

Distribution Release: IPFire 2.29 Core 197

@ts-ignore is almost always the worst option

@ts-ignore is almost always the worst option

MutativeJS v1.3.0 is out with massive performance gains

Student Performance Prediction System using Python Machine Learning (ML)

DistroWatch Weekly, Issue 1140

DistroWatch Weekly, Issue 1140

Distribution Release: DietPi 9.17

Hyprland Made Easy: Preconfigured Beautiful Distros

Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

CVE-2025-48942 – vLLM JSON Schema Deserialization Denial of Service

Low-Code and No-Code Platforms: Revolutionizing Application Development

CVE-2025-4553 – PHPGurukul Apartment Visitors Management System SQL Injection

6,500 Axis Servers Expose Remoting Protocol; 4,000 in U.S. Vulnerable to Exploits

NVIDIA’s GeForce 577.00 WHQL driver brings DLSS 4, game fixes, and new app controls

Novità nel kernel Linux: rimozione del supporto per i486 e i primi Pentium

CVE-2025-4269 – TOTOLINK Log Handler Remote Code Execution Vulnerability

The one-click Linux app I use for instant online anonymity

Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs

Related Posts