How AI Chatbots Mimic Human Behavior: Insights from Multi-Turn Evaluations of LLMs

AI chatbots create the illusion of having emotions, morals, or consciousness by generating natural conversations that seem human-like. Many users engage with AI for chat and companionship, reinforcing the false belief that it truly understands. This leads to serious risks. Users can over-rely on AI, provide sensitive data, or rely on it for advice beyond its capabilities. Others even let AI impact their choices in detrimental manners. Without proper knowledge of how AI fosters this belief, the issue gets worse.

Current methods for evaluating AI chat systems rely on single-turn prompts and fixed tests, failing to capture how AI interacts in real conversations. Some multi-turn tests focus only on harmful user behavior, ignoring normal interactions. Automated red-teaming adapts too much, making results hard to compare. Studies involving human users are difficult to repeat and scale. Measuring how people see AI as human-like is also a challenge. People instinctively assume AI has human traits, which affects how much they trust it. Evaluations show that AI’s human-like behavior makes users believe it is more accurate or even form emotional bonds. Hence, Existing methods fail to measure this issue properly.

To address these issues, a team of researchers from University Oxford, and Google Deepmind proposed an evaluation framework to assess human-like behaviors in AI chat systems. Unlike existing methods that rely on single-turn prompts and fixed tests, this framework tracks 14 specific anthropomorphic behaviors through multi-turn conversations. Automated simulations analyze AI interactions with users over multiple exchanges, improving scalability and comparability. The framework consists of three main components. First, it systematically monitors 14 anthropomorphic behaviors and classifies them into self-referential and relational traits, including personhood claims and expressions of emotion. Second, it scales up multi-turn assessment through interactive user simulation to ensure consistency and scalability. Third, it validates results through human subject evaluation to confirm the alignment between automated evaluations and user perceptions.

Researchers evaluated anthropomorphic behaviors in AI systems using a multi-turn framework in which a User LLM interacted with a Target LLM across eight scenarios in four domains: friendship, life coaching, career development, and general planning. Fourteen behaviors were analyzed and categorized as self-referential (personhood claims, physical embodiment claims, and internal state expressions) and relational (relationship-building behaviors). 960 contextualized prompts generated 4,800 five–turn dialogues per model, assessed by three Judge LLMs, resulting in 561,600 ratings. The analysis confirmed that the User LLM exhibited higher anthropomorphism scores than the Target LLMs. Interactions between 1,101 participants and Gemini 1.5 Pro were analyzed under high and low anthropomorphism conditions to evaluate alignment with human perceptions. High-frequency respondents also registered increased anthropomorphic perceptions based on survey responses as quantified using the AnthroScore measure. Statistical contrasts found large differences in anthropomorphic behavior by domain area, highlighting that AI systems exhibit human-like behavior when used in verbal interaction.

In summary, the framework employed a better multi-turn assessment technique than a single-turn approach to evaluating anthropomorphic behaviors in conversational AI. The results identified relationship-building behaviors that evolved with dialogue. As a baseline for subsequent research, this framework can inform AI development by learning to recognize when anthropomorphic characteristics occur and their effect on users. Future development can make assessment methods more precise, enhance the robustness of metrics, and formalize analysis, leading to more transparent and morally sound AI systems.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

The post How AI Chatbots Mimic Human Behavior: Insights from Multi-Turn Evaluations of LLMs appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

Student Record Android App using SQLite

Student Record Android App using SQLite

When Array uses less memory than Uint8Array (in V8)

Laravel 12 Starter Kits: Definite Guide Which to Choose

My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

How AI Chatbots Mimic Human Behavior: Insights from Multi-Turn Evaluations of LLMs

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

I missed out on the Clair Obscur: Expedition 33 Collector’s Edition but thankfully, the developers are launching something special

Top Open-Source Large Language Model (LLM) Evaluation Repositories

Perficient Insights: Dreamforce 2024 with Megan Glasow

How to Make a Program Open on a Specific Monitor Windows 11

CVE-2025-31144 – Quick Agent Windows RCE Vulnerability

Filament Gaze: Show When Multiple Users View the Same Resource

An Overview of DataRaptors in OmniStudio

Bento Laravel SDK

How AI Chatbots Mimic Human Behavior: Insights from Multi-Turn Evaluations of LLMs

Related Posts