Trained transformer-based chess models to play like humans (including thinking time) [P]

Reddit r/MachineLearning Models

Summary

Trained transformer-based chess models for rating buckets from 800 to 2500+, predicting moves, thinking time, and outcome. Achieves strong accuracy with only 9M parameters, and includes a novel thinking-time prediction component.

I trained a set of deep learning (transformer-based) chess models to play like humans (inspired by MAIA and Grandmaster Chess Without Search). There's a separate model for each 100-point rating bucket from \~800 to 2500+. I started with training a mid-strength model from scratch on a 8xH100 cluster, then fine-tuned models for the other rating ranges on my local 5090 GPU. The total training size was nearly a year of Lichess data, about 1B total games. Each rating range actually has 3 models: A move model, a thinking time model, and a white win / draw / black win model. Despite being quite small (only 9MM parameters!) the move models achieve better accuracy than MAIA-2 and are approximately on par with MAIA-3 (see [here](https://github.com/thomasj02/1e4_ai/blob/master/experiments/maia2_benchmark/RESULTS.md) for MAIA-2 comparison). AFAIK this is the only attempt to train on thinking times in chess, so I don't have a benchmark to compare against for that. Likely because of the network size, at high ratings the models aren't quite as good as they could be. They see short tactical motifs but can't do deep calculation - probably a bigger model would help here. The move and win models take into account player ratings and clock times. For instance, under extreme time pressure a much stronger player has a lower win prob even if their opponent is weaker. The models blunder more under time pressure as well. The data pipeline is C++ via nanobind, then training with Pytorch. Getting this right was actually the thing I spent the most time on. Pre-shuffling the dataset and then being able to read the shuffled dataset sequentially at training time kept the GPU utilization high. Without this it spent a huge percentage of time on I/O while the GPU sat idle. Happy to answer questions about the rating-conditioning, the clock model, or the data pipeline. Code (including training code and model weights) is at [https://github.com/thomasj02/1e4\_ai/](https://github.com/thomasj02/1e4_ai/). A demo is at [https://1e4.ai/](https://1e4.ai/) but all the frontend code is also in the repo if you want to self-host.
Original Article

Similar Articles

Transformers Learn the Mestre-Nagao Heuristic

arXiv cs.LG

This paper trains a two-layer transformer encoder to classify rational elliptic curves by rank from Frobenius traces, achieving >99% accuracy. Mechanistic interpretability reveals the model learns the Mestre-Nagao heuristic and concentrates attention on prime positions, demonstrating that transformers can learn number-theoretic algorithms.

Transformers Linearly Represent Highly Structured World Models

arXiv cs.LG

This paper demonstrates that transformers trained on Sudoku solving traces build structured world models organized by domain constraints, and identifies a sparse, monosemantic circuit responsible for the naked-single decision rule. The work provides a fully interpretable algorithmic account of transformer reasoning on a combinatorial task.

Transformer Math Explorer [P]

Reddit r/MachineLearning

This interactive tool visualizes the mathematical underpinnings of transformer models through dataflow graphs, covering architectures from GPT-2 to Qwen 3.6 and various attention mechanisms.

@NFTCPS: You keep talking about AI, but can't even explain what a Transformer is? There's a repo that goes all out — builds a GPT from scratch without using any high-level libraries. It lays out exactly how Attention, Multi-Head, Feed-Forward, Embedding, Residual connections, and Layer Norm are pieced together. And it's not just the model; the entire pipeline is covered…

X AI KOLs Timeline

A GitHub open-source project that implements the complete GPT training pipeline from scratch, including data preprocessing, pretraining, SFT, and RLHF post-training, all based on native PyTorch. Ideal for developers who want to deeply understand the Transformer architecture.

@qinzytech: https://x.com/qinzytech/status/2066585405479371092

X AI KOLs Timeline

A technical analysis of two approaches to building self-evolving AI agents: model-based (via architecture like SSMs or transformer with fast-weight updates, and training methods) and harness-based (via memory or meta harness that can rewrite itself). The author provides practical recommendations for different audiences.