Building AlphaGo from scratch – Eric Jang

Reddit r/singularity 05/15/26, 04:49 PM Tools

alpha-go monte-carlo-tree-search deep-learning go open-source ai-research

Summary

Eric Jang rebuilt AlphaGo from scratch and explained in detail the application of Monte Carlo Tree Search and deep learning in Go, demonstrating the feasibility of reproducing a powerful Go AI at low cost nowadays.

No content available

Original Article

View Cached Full Text

Cached at: 05/15/26, 05:08 PM

TL;DR: Eric Jang rebuilt AlphaGo from scratch during his sabbatical, making Go search feasible with Monte Carlo tree search and deep learning, and explained the underlying search algorithm, action selection strategy, and the current feasibility of low-cost replication. ## Why rebuild AlphaGo? Eric Jang (former VP of AI at 1X Technologies, former Senior Research Scientist at Google DeepMind Robotics) chose to rebuild AlphaGo during his sabbatical instead of going to the beach. AlphaGo and Go AI were what first led him into this field. Seeing AI solve Go—a game long considered impossible to search—through deep learning in 2014–2016 was mind-blowing for him. He was always curious: how could a ten-layer neural network simulate such deep computation within the game tree? In 2020, David Wu developed KataGo at Jane Street, reducing the compute required to train a strong Go AI by a factor of 40. Thanks to this, what once required a full team and millions of dollars at DeepMind could now be done with just a few thousand dollars’ worth of rented compute. ## How to play Go? The goal of Go is to place black and white stones on the board and capture as much territory as possible. Black moves first. Capturing opponent stones: if all four orthogonal neighbors of a stone are surrounded, it dies (loses “liberties”). Computer Go uses the **Tromp-Taylor rules**, which are completely unambiguous. For example, human rules forbid suicide moves, but Tromp-Taylor rules allow them—the stone is immediately ruled dead, leading to the same result. The game ends when both players pass consecutively or when someone resigns. ### Scoring differences - **Human rules (e.g., Chinese rules):** After the game, players negotiate to confirm territory, which can be ambiguous. - **Tromp-Taylor scoring:** Fully algorithmic. First, count how many stones you control; then count empty intersections not adjacent to opponent stones. However, some empty points are adjacent to both players’ stones; Tromp-Taylor awards these points to both sides, leading to results that differ from human intuition. For example, a shape surrounded by black but with a few white stones still alive: humans would consider white lost, but Tromp-Taylor might award points to white. ## Cracking Go: from brute-force search to Monte Carlo tree search Go has a huge branching factor: a 19×19 board, about 361 choices for the first move, and roughly 250–300 moves per game. The naive search tree size is about 361^300, far exceeding the number of atoms in the universe. AlphaGo uses **Monte Carlo tree search (MCTS)** to make the problem feasible. The core is to maintain a tree where nodes represent states (board configurations) and edges represent actions. The search interactively expands the tree, evaluating which leaves are worth exploring further. ### Data structure Each node stores: - **Visit count N(s,a):** Number of times this node has been reached from its parent via action a. - **Average action value Q(s,a):** Proportion of wins among all simulated games starting from this node. - **Action selection probability P(s,a)** (introduced later). - **Child node dictionary:** References to further nodes, forming a linked-list-like tree structure. ### Action selection: UCB and PUCT In tree search, the selection of child nodes is determined by a scoring criterion. Early game theory used **UCB1**: ``` Select argmax [ Q(s,a) + sqrt(ln N_parent / N(s,a)) ] ``` where Q is the “exploitation” part (average win probability), and the latter term is the “exploration” term, encouraging actions with fewer visits. AlphaGo uses an improved version, **PUCT (Predicted Upper Confidence with Trees)**: ``` Select argmax [ Q(s,a) + c * P(s,a) * sqrt(N_parent) / (1 + N(s,a)) ] ``` c is the exploration constant, and P(s,a) comes from the neural network’s initial probability estimate for the action, helping guide the search. ## Why it’s easier to reproduce now KataGo reduced training compute requirements by a factor of 40, and with today’s LLMs able to assist in writing MCTS implementations (Eric used Claude 4.6 on the fly to generate reasonable code), individual developers can now rent GPUs for a few thousand dollars to train a strong Go AI from scratch. This is why Eric was able to complete this project during his sabbatical—the technical barrier and cost have dropped dramatically. ## Key points from the conversation - Players sometimes intentionally let the opponent capture stones to gain a larger advantage (lose a battle to win the war). - Reasonable middle-game evaluation requires human “value function” consensus, while computers rely on algorithms. - Go is a perfect-information deterministic game; in theory, optimal strategies can be exhaustively enumerated, but the search explosion forces the use of neural networks as aids. --- Source: Building AlphaGo from scratch – Eric Jang (https://youtu.be/X_ZVSPcZhtw?si=TnOB7lF2rbpYpLdn)

Building AlphaGo from scratch – Eric Jang

Similar Articles

@dwarkesh_sp: New blackboard lecture w @ericjang11 He walks through how to build AlphaGo from scratch, but with modern AI tools. Some…

10 years of AlphaGo: The turning point for AI | Thore Graepel & Pushmeet Kohli

From games to biology and beyond: 10 years of AlphaGo’s impact

@berryxia: Moonshot AI founder Yang Zhilin recently released a 40-minute video. Born in 1992, valedictorian of Tsinghua CS undergrad, PhD from CMU, co-author of Transformer-XL and XLNet, former researcher at Google Brain and Meta, he calmly deconstructs Kimi K2 in front of the camera...

Submit Feedback

Similar Articles

@dwarkesh_sp: New blackboard lecture w @ericjang11 He walks through how to build AlphaGo from scratch, but with modern AI tools. Some…

10 years of AlphaGo: The turning point for AI | Thore Graepel & Pushmeet Kohli

From games to biology and beyond: 10 years of AlphaGo’s impact

@0xLogicrw: Former OpenAI post-training core member Jiayi Weng proposed a new reinforcement learning paradigm called "Heuristic Learning" in his personal capacity and open-sourced all experimental code. He used Codex (GPT-5.4) to repeatedly play the Atari game Breakout, but GPT-5.4 was never retrained...

@berryxia: Moonshot AI founder Yang Zhilin recently released a 40-minute video. Born in 1992, valedictorian of Tsinghua CS undergrad, PhD from CMU, co-author of Transformer-XL and XLNet, former researcher at Google Brain and Meta, he calmly deconstructs Kimi K2 in front of the camera...