Building AlphaGo from scratch – Eric Jang

Reddit r/singularity Tools

Summary

Eric Jang rebuilt AlphaGo from scratch and explained in detail the application of Monte Carlo Tree Search and deep learning in Go, demonstrating the feasibility of reproducing a powerful Go AI at low cost nowadays.

No content available
Original Article
View Cached Full Text

Cached at: 05/15/26, 05:08 PM

TL;DR: Eric Jang rebuilt AlphaGo from scratch during his sabbatical, making Go search feasible with Monte Carlo tree search and deep learning, and explained the underlying search algorithm, action selection strategy, and the current feasibility of low-cost replication. ## Why rebuild AlphaGo? Eric Jang (former VP of AI at 1X Technologies, former Senior Research Scientist at Google DeepMind Robotics) chose to rebuild AlphaGo during his sabbatical instead of going to the beach. AlphaGo and Go AI were what first led him into this field. Seeing AI solve Go—a game long considered impossible to search—through deep learning in 2014–2016 was mind-blowing for him. He was always curious: how could a ten-layer neural network simulate such deep computation within the game tree? In 2020, David Wu developed KataGo at Jane Street, reducing the compute required to train a strong Go AI by a factor of 40. Thanks to this, what once required a full team and millions of dollars at DeepMind could now be done with just a few thousand dollars’ worth of rented compute. ## How to play Go? The goal of Go is to place black and white stones on the board and capture as much territory as possible. Black moves first. Capturing opponent stones: if all four orthogonal neighbors of a stone are surrounded, it dies (loses “liberties”). Computer Go uses the **Tromp-Taylor rules**, which are completely unambiguous. For example, human rules forbid suicide moves, but Tromp-Taylor rules allow them—the stone is immediately ruled dead, leading to the same result. The game ends when both players pass consecutively or when someone resigns. ### Scoring differences - **Human rules (e.g., Chinese rules):** After the game, players negotiate to confirm territory, which can be ambiguous. - **Tromp-Taylor scoring:** Fully algorithmic. First, count how many stones you control; then count empty intersections not adjacent to opponent stones. However, some empty points are adjacent to both players’ stones; Tromp-Taylor awards these points to both sides, leading to results that differ from human intuition. For example, a shape surrounded by black but with a few white stones still alive: humans would consider white lost, but Tromp-Taylor might award points to white. ## Cracking Go: from brute-force search to Monte Carlo tree search Go has a huge branching factor: a 19×19 board, about 361 choices for the first move, and roughly 250–300 moves per game. The naive search tree size is about 361^300, far exceeding the number of atoms in the universe. AlphaGo uses **Monte Carlo tree search (MCTS)** to make the problem feasible. The core is to maintain a tree where nodes represent states (board configurations) and edges represent actions. The search interactively expands the tree, evaluating which leaves are worth exploring further. ### Data structure Each node stores: - **Visit count N(s,a):** Number of times this node has been reached from its parent via action a. - **Average action value Q(s,a):** Proportion of wins among all simulated games starting from this node. - **Action selection probability P(s,a)** (introduced later). - **Child node dictionary:** References to further nodes, forming a linked-list-like tree structure. ### Action selection: UCB and PUCT In tree search, the selection of child nodes is determined by a scoring criterion. Early game theory used **UCB1**: ``` Select argmax [ Q(s,a) + sqrt(ln N_parent / N(s,a)) ] ``` where Q is the “exploitation” part (average win probability), and the latter term is the “exploration” term, encouraging actions with fewer visits. AlphaGo uses an improved version, **PUCT (Predicted Upper Confidence with Trees)**: ``` Select argmax [ Q(s,a) + c * P(s,a) * sqrt(N_parent) / (1 + N(s,a)) ] ``` c is the exploration constant, and P(s,a) comes from the neural network’s initial probability estimate for the action, helping guide the search. ## Why it’s easier to reproduce now KataGo reduced training compute requirements by a factor of 40, and with today’s LLMs able to assist in writing MCTS implementations (Eric used Claude 4.6 on the fly to generate reasonable code), individual developers can now rent GPUs for a few thousand dollars to train a strong Go AI from scratch. This is why Eric was able to complete this project during his sabbatical—the technical barrier and cost have dropped dramatically. ## Key points from the conversation - Players sometimes intentionally let the opponent capture stones to gain a larger advantage (lose a battle to win the war). - Reasonable middle-game evaluation requires human “value function” consensus, while computers rely on algorithms. - Go is a perfect-information deterministic game; in theory, optimal strategies can be exhaustively enumerated, but the search explosion forces the use of neural networks as aids. --- Source: Building AlphaGo from scratch – Eric Jang (https://youtu.be/X_ZVSPcZhtw?si=TnOB7lF2rbpYpLdn)

Similar Articles

@0xLogicrw: Former OpenAI post-training core member Jiayi Weng proposed a new reinforcement learning paradigm called "Heuristic Learning" in his personal capacity and open-sourced all experimental code. He used Codex (GPT-5.4) to repeatedly play the Atari game Breakout, but GPT-5.4 was never retrained...

X AI KOLs Timeline

Former OpenAI researcher Jiayi Weng proposed a new paradigm called "Heuristic Learning", which uses large language models to generate and iteratively modify Python code to solve reinforcement learning tasks. Knowledge is stored in interpretable code rather than neural network parameters, effectively avoiding catastrophic forgetting. It has achieved excellent results on Atari and MuJoCo benchmarks and the code has been open-sourced.

@berryxia: Moonshot AI founder Yang Zhilin recently released a 40-minute video. Born in 1992, valedictorian of Tsinghua CS undergrad, PhD from CMU, co-author of Transformer-XL and XLNet, former researcher at Google Brain and Meta, he calmly deconstructs Kimi K2 in front of the camera...

X AI KOLs Timeline

Moonshot AI founder Yang Zhilin released a 40-minute video detailing the training process of the Kimi K2 model, which cost only $4.6 million. In an 8-model real-time programming competition, Kimi K2 took first place, defeating GPT-5.5 and others, demonstrating how a small team can overturn the traditional compute-stacking paradigm through architecture optimization.