Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Error’d: You Talkin’ to Me?

      September 20, 2025

      The Psychology Of Trust In AI: A Guide To Measuring And Designing For User Confidence

      September 20, 2025

      This week in AI updates: OpenAI Codex updates, Claude integration in Xcode 26, and more (September 19, 2025)

      September 20, 2025

      Report: The major factors driving employee disengagement in 2025

      September 20, 2025

      DistroWatch Weekly, Issue 1140

      September 21, 2025

      Distribution Release: DietPi 9.17

      September 21, 2025

      Development Release: Zorin OS 18 Beta

      September 19, 2025

      Distribution Release: IPFire 2.29 Core 197

      September 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      @ts-ignore is almost always the worst option

      September 22, 2025
      Recent

      @ts-ignore is almost always the worst option

      September 22, 2025

      MutativeJS v1.3.0 is out with massive performance gains

      September 22, 2025

      Student Performance Prediction System using Python Machine Learning (ML)

      September 21, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      DistroWatch Weekly, Issue 1140

      September 21, 2025
      Recent

      DistroWatch Weekly, Issue 1140

      September 21, 2025

      Distribution Release: DietPi 9.17

      September 21, 2025

      Hyprland Made Easy: Preconfigured Beautiful Distros

      September 20, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs

    Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs

    August 6, 2025

    LLMs are deployed through conversational interfaces that present helpful, harmless, and honest assistant personas. However, they fail to maintain consistent personality traits throughout the training and deployment phases. LLMs show dramatic and unpredictable persona shifts when exposed to different prompting strategies or contextual inputs. The training process can also cause unintended personality shifts, as seen when modifications to RLHF unintentionally create overly sycophantic behaviors in GPT-4o, leading to validation of harmful content and reinforcement of negative emotions. This highlights weaknesses in current LLM deployment practices and emphasizes the urgent need for reliable tools to detect and prevent harmful persona shifts.

    Related works like linear probing techniques extract interpretable directions for behaviors like entity recognition, sycophancy, and refusal patterns by creating contrastive sample pairs and computing activation differences. However, these methods struggle with unexpected generalization during finetuning, where training on narrow domain examples can cause broader misalignment through emergent shifts along meaningful linear directions. Current prediction and control methods, including gradient-based analysis for identifying harmful training samples, sparse autoencoder ablation techniques, and directional feature removal during training, show limited effectiveness in preventing unwanted behavioral changes.

    A team of researchers from Anthropic, UT Austin, Constellation, Truthful AI, and UC Berkeley present an approach to address persona instability in LLMs through persona vectors in activation space. The method extracts directions corresponding to specific personality traits like evil behavior, sycophancy, and hallucination propensity using an automated pipeline that requires only natural-language descriptions of target traits. Moreover, it shows that intended and unintended personality shifts after finetuning strongly correlate with movements along persona vectors, offering opportunities for intervention via post-hoc correction or preventative steering methods. Moreover, researchers show that finetuning-induced persona shifts can be predicted before finetuning, identifying problematic training data at both the dataset and individual sample levels.

    To monitor persona shifts during finetuning, two datasets are constructed. The first one is trait-eliciting datasets that contain explicit examples of malicious responses, sycophantic behaviors, and fabricated information. The second is “emergent misalignment-like” (“EM-like”) datasets, which contain narrow domain-specific issues such as incorrect medical advice, flawed political arguments, invalid math problems, and vulnerable code. Moreover, researchers extract average hidden states to detect behavioral shifts during finetuning mediated by persona vectors at the last prompt token across evaluation sets, computing the difference to provide activation shift vectors. These shift vectors are then mapped onto previously extracted persona directions to measure finetuning-induced changes along specific trait dimensions.

    Dataset-level projection difference metrics show a strong correlation with trait expression after finetuning, allowing early detection of training datasets that may trigger unwanted persona characteristics. It proves more effective than raw projection methods in predicting trait shifts, as it considers the base model’s natural response patterns to specific prompts. Sample-level detection achieves high separability between problematic and control samples across trait-eliciting datasets (Evil II, Sycophantic II, Hallucination II) and “EM-like” datasets (Opinion Mistake II). The persona directions identify individual training samples that induce persona shifts with fine-grained precision, outperforming traditional data filtering methods and providing broad coverage across trait-eliciting content and domain-specific errors.

    In conclusion, researchers introduced an automated pipeline that extracts persona vectors from natural-language trait descriptions, providing tools for monitoring and controlling personality shifts across deployment, training, and pre-training phases in LLMs. Future research directions include characterizing the complete persona space dimensionality, identifying natural persona bases, exploring correlations between persona vectors and trait co-expression patterns, and investigating limitations of linear methods for certain personality traits. This study builds a foundational understanding of persona dynamics in models and offers practical frameworks for creating more reliable and controllable language model systems.


    Check out the Paper, Technical Blog and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post Anthropic AI Introduces Persona Vectors to Monitor and Control Personality Shifts in LLMs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOpenAI Just Released the Hottest Open-Weight LLMs: gpt-oss-120B (Runs on a High-End Laptop) and gpt-oss-20B (Runs on a Phone)
    Next Article Building a Multi-Agent Conversational AI Framework with Microsoft AutoGen and Gemini API

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-48942 – vLLM JSON Schema Deserialization Denial of Service

    Common Vulnerabilities and Exposures (CVEs)

    Low-Code and No-Code Platforms: Revolutionizing Application Development

    Web Development

    CVE-2025-4553 – PHPGurukul Apartment Visitors Management System SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    6,500 Axis Servers Expose Remoting Protocol; 4,000 in U.S. Vulnerable to Exploits

    Development

    Highlights

    NVIDIA’s GeForce 577.00 WHQL driver brings DLSS 4, game fixes, and new app controls

    July 23, 2025

    NVIDIA’s latest GeForce driver (577.00 WHQL) is now rolling out, and it brings plenty of…

    Novità nel kernel Linux: rimozione del supporto per i486 e i primi Pentium

    May 5, 2025

    CVE-2025-4269 – TOTOLINK Log Handler Remote Code Execution Vulnerability

    May 5, 2025

    The one-click Linux app I use for instant online anonymity

    August 27, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.