Allie: A Human-Aligned Chess Bot

Play against Allie on lichess!

Introduction

In 1948, Alan Turning designed what might be the first chess playing AI, a paper program that Turing himself acted as the computer for. Since then, chess has been a testbed for nearly every generation of AI advancement. After decades of improvement, today’s top chess engines like Stockfish and AlphaZero have far surpassed the capabilities of even the strongest human grandmasters.

However, most chess players are not grandmasters, and these state-of-the-art Chess AIs have been described as playing more like aliens than fellow humans.

The core problem here is that strong AI systems are not human-aligned; they are unable to match the diversity of skill levels of human partners and unable to model human-like behaviors beyond piece movement. Understanding how to make AI systems that can effectively collaborate with and be overseen by humans is a key challenge in AI alignment. Chess provides an ideal testbed for trying out new ideas towards this goal – while modern chess engines far surpass human ability, they are completely incapable of playing in a human-like way or adapting to match their human opponents’ skill levels. In this paper, we introduce Allie, a chess-playing AI designed to bridge the gap between artificial and human intelligence in this classic game.

What is Human-aligned Chess?

When we talk about “human-aligned” chess AI, what exactly do we mean? At its core, we want a system that is both humanlike, defined as making moves that feel natural to human players, as well as skill-calibrated, defined as capable of playing at a similar level against human opponents across the skill spectrum.

Our goal here is quite different from traditional chess engines like Stockfish or AlphaZero, which are optimized solely to play the strongest moves possible. While these engines achieve superhuman performance, their play can feel alien to humans. They may instantly make moves in complex positions where humans would need time to think, or continue playing in completely lost positions where humans would normally resign.

Building Allie

Allie's system design — Figure 1: (a) A game state is represented as the sequence of moves that produced it and some metadata. This sequence is inputted to a Transformer, which predicts the next move, pondering time for this move, and a value assessment of the move. (b) At inference time, we employee Monte-Carlo Tree Search with the value predictions from the model. The number of rollouts (N_mathrm{sim}) is chosen dynamically based on the predicted pondering time.

A Transformer model trained on transcripts of real games

While most prior deep learning approaches build models that input a board state, and output a distribution over possible moves, we instead approach chess like a language modeling task. We use a Transformer architecture that inputs a sequence of moves rather than a single board state. Just as large language models learn to generate human-like text by training on vast text corpora, we hypothesized that a similar architecture could learn human-like chess by training on human game records. We train our chess “language” model on transcripts of over 93M games encompassing a total of 6.6 billion moves, which were played on the chess website Lichess.

Conditioning on Elo score

In chess, Elo scores normally fall in the range of 500 (beginner players) to 3000 (top chess professionals). To calibrate the playing strength of ALLIE to different levels of players, we model gameplay under a conditional generation framework, where encodings of the Elo ratings of both players are prepended to the game sequence. Specifically, we prefix each game with soft control tokens, which interpolate between a weak token, representing 500 Elo, and a strong token, representing 3000 Elo.

For a player with Elo rating (k), we compute a soft token (e_k) by linearly interpolating between the weak and strong tokens:

$$e_k = gamma e_text{weak} + (1-gamma) e_text{strong}$$

where (gamma = frac{3000-k}{2500}). During training, we prefix each game with two soft tokens corresponding to the two players’ strengths.

Learning objectives

On top of the base Transformer model, Allie has three prediction objectives:

A policy head (p_theta) that outputs a probability distribution over possible next moves
A pondering-time head (t_theta) that outputs the number of seconds a human player would take to come up with this move
A value assessment head (v_theta) that outputs a scalar value representing who expects to win the game

All three heads are individually parametrized as linear layers applied to the final hidden state of the decoder. Given a dataset of chess games, represented as a sequence of moves (mathbf{m}), human ponder time before each move (mathbf{t}), and game output (v) we trained Allie to minimize the log-likelihood of next moves and MSE of time and value predictions:

$$mathcal{L}(theta) = sum_{(mathbf{m}, mathbf{t}, v) in mathcal{D}} left( sum_{1 le i le N} left( -log p_theta(m_i ,|, mathbf{m}_{lt i}) + left(t_theta(mathbf{m}_{lt i}) – t_iright)^2 + left(v_theta(mathbf{m}_{lt i}) – vright)^2 right) right) text{.}$$

Adaptive Monte-Carlo Tree Search

At play-time, traditional chess engines like AlphaZero use search algorithms such as Monte-Carlo Tree Search (MCTS) to anticipate many moves into the future, evaluating different possibilities for how the game might go. The search budget (N_mathrm{sim}) is almost always fixed—they will spend the same amount of compute on search regardless of whether the best next move is extremely obvious or pivotal to the outcome of the game.

This fixed budget doesn’t match human behavior; humans naturally spend more time analyzing critical or complex positions compared to simple ones. In Allie, we introduce a time-adaptive MCTS procedure that varies the amount of search based on Allie’s prediction of how long a human would think in each position. If Allie predicts a human would spend more time on a position, it performs more search iterations to better match human depth of analysis. To keep things simple, we just set

How does Allie Play?

To evaluate whether Allie is human-aligned, we evaluate its performance both on an offline dataset and online against real human players.

Figure 2. Allie significantly outperforms pervious state-of-the-art methods. Adaptive-search enables matching human moves at expert levels.

In offline games, Allie achieves state-of-the-art in move-matching accuracy (defined as the % of moves made that match real human moves). It also models how humans resign, and ponder very well.

Figure 3: Allie’s time predictions are strongly correlated with ground-truth human time usage. In the figure, we show median and IQR of Allie’s think time for different amount of time spent by humans.

Figure 4: Allie learns to assign reliable value estimates to board states by observing game outcomes alone. We report Pearson’s r correlation of value estimates by ALLIE and Stockfish with game outcomes.

Another main insight of our paper is that adaptive search enables remarkable skill calibration against players across the skill spectrum. Against players from 1100 to 2500 Elo, the adaptive search variant of Allie has an average skill gap of only 49 Elo points. In other words, Allie (with adaptive search) wins about 50% of games against opponents that are both beginner and expert level. Notably, none of the other methods (even the non-adpative MCTS baseline) can match the strength of 2500 Elo players.

Table 1: Adaptive search enables remarkable skill calibration. Mean and maximum skill calibration errors is measured by computed by binning human players into 200-Elo groups. We also report systems’ estimated performance against players at the lower and upper Elo ends of the skill spectrum.

Limitations and Future Work

Despite strong offline evaluation metrics and generally positive player feedback, Allie still exhibits occasional behaviors that feel non-humanlike. Players specifically noted Allie’s propensity toward late-game blunders and sometimes spending too much time pondering positions where there’s only one reasonable move. These observations suggest there’s still room to improve our understanding of how humans allocate cognitive resources during chess play.

For future work, we identify several promising directions. First, our approach heavily relies on available human data, which is plentiful for fast time controls but more limited for classical chess with longer thinking time. Extending our approach to model human reasoning in slower games, where players make more accurate moves with deeper calculation, represents a significant challenge. With the recent interest in reasoning models that make use of test-time compute, we hope that our adaptive search technique can be applied to improving the efficiency of allocating a limited compute budget.

If you are interested in learning more about this work, please checkout our ICLR paper, Human-Aligned Chess With a Bit of Search.

Source: Read MoreÂ

Designing For TV: Principles, Patterns And Practical Guidance (Part 2)

Neo4j introduces new graph architecture that allows operational and analytics workloads to be run together

Beyond the benchmarks: Understanding the coding personalities of different LLMs

Top 10 Use Cases of Vibe Coding in Large-Scale Node.js Applications

Building smarter interactions with MCP elicitation: From clunky tool calls to seamless user experiences

From Zero to MCP: Simplifying AI Integrations with xmcp

Distribution Release: Linux Mint 22.2

Coded Smorgasbord: Basically, a Smorgasbord

Drupal 11’s AI Features: What They Actually Mean for Your Team

Drupal 11’s AI Features: What They Actually Mean for Your Team

Why Data Governance Matters More Than Ever in 2025?

Perficient Included in the IDC Market Glance for Digital Business Professional Services, 3Q25

How DevOps Teams Are Redefining Reliability with NixOS and OSTree-Powered Linux

How DevOps Teams Are Redefining Reliability with NixOS and OSTree-Powered Linux

Distribution Release: Linux Mint 22.2

‘Cronos: The New Dawn’ was by far my favorite experience at Gamescom 2025 — Bloober might have cooked an Xbox / PC horror masterpiece

Allie: A Human-Aligned Chess Bot

Introduction

What is Human-aligned Chess?

Building Allie

A Transformer model trained on transcripts of real games

Conditioning on Elo score

Learning objectives

Adaptive Monte-Carlo Tree Search

How does Allie Play?

Limitations and Future Work

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

Attackers Abuse Velociraptor Forensic Tool to Deploy Visual Studio Code for C2 Tunneling

CVE-2025-49276 – Unfoldwp Blogmine PHP Remote File Inclusion Vulnerability

10 tips for designing epic ships and vehicles for concept art

CVE-2025-6754 – “WordPress SEO Metrics Privilege Escalation”

Universal Design Principles Supporting Operable Content – Equitable Use

7 reasons The Division 2 is a game you should be playing in 2025

Best AI Tools for Freelancers in 2025 Boost Productivity & Profits

CVE-2025-8822 – Linksys RE Series Stack-Based Buffer Overflow Vulnerability

Allie: A Human-Aligned Chess Bot

Introduction

What is Human-aligned Chess?

Building Allie

A Transformer model trained on transcripts of real games

Conditioning on Elo score

Learning objectives

Adaptive Monte-Carlo Tree Search

How does Allie Play?

Limitations and Future Work

Related Posts