MuZero: Checkmate For Software 1.0?

ML at Berkeley 09/06/22, 12:30 PM News

muzero deepmind software-2-0 reinforcement-learning ai-history deep-learning

Summary

This article discusses Google DeepMind's MuZero algorithm as an example of 'Software 2.0,' arguing that while deep learning surpasses traditional software, it still relies on classical computational techniques like game tree search.

<p><em>By <a href="https://ashwinreddy.github.io/">Ashwin Reddy</a></em></p><p>Deep learning differs from mainstream software so starkly that Andrej Karpathy calls it <strong><a href="https://karpathy.medium.com/software-2-0-a64152b37c35">Software 2.0</a></strong>. The name points to deep learning’s superiority in domains as complex as protein folding prediction. But I want to argue that deep learning, though it surpasses Software 1.0, still relies on classical techniques.</p><p>MuZero, an algorithm developed by Google DeepMind, serves as an excellent example of Software 2.0’s advancement. Consider its applications.</p><ul><li><p>MuZero’s predecessor AlphaGo defeated champion Lee Sedol in a five-game Go match (Silver, 2016).</p></li><li><p>YouTube found promising results in <a href="https://deepmind.com/blog/article/MuZeros-first-step-from-research-into-the-real-world">compressing videos</a> with MuZero.</p></li><li><p>Tesla’s self-driving cars navigate the real world using MuZero.</p></li></ul><p>Yet MuZero and AlphaGo are at heart extensions of chess algorithms. Computing pioneers like Alan Turing and Claude Shannon worked on such chess algorithms, which remain fundamentally the same today. Broadly, they follow this procedure.</p><ol><li><p>Consider all the moves you can make. For each move, determine where the game is likely to end. Concretely, build a game tree.</p></li><li><p>Choose the move ending in the highest score for the computer. To do this, use a heuristic, which assigns a score to each chess configuration. The heuristic might come from a human expert, or, as we’ll see later, learned from experience.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fAWK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fAWK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 424w, https://substackcdn.com/image/fetch/$s_!fAWK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 848w, https://substackcdn.com/image/fetch/$s_!fAWK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 1272w, https://substackcdn.com/image/fetch/$s_!fAWK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fAWK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png" width="539" height="219" data-attrs="{"src":"https://substack-post-media.s3.amazonaws.com/public/images/eeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png","srcNoWatermark":null,"fullscreen":null,"imageSize":null,"height":219,"width":539,"resizeWidth":null,"bytes":null,"alt":null,"title":null,"type":null,"href":null,"belowTheFold":false,"topImage":true,"internalRedirect":null,"isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fAWK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 424w, https://substackcdn.com/image/fetch/$s_!fAWK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 848w, https://substackcdn.com/image/fetch/$s_!fAWK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 1272w, https://substackcdn.com/image/fetch/$s_!fAWK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feeb2eb2e-0d6f-4413-8b09-79b3e00917dc_539x219.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">A game tree, where edges represent actions so that a child node is a potential next state from its parent. Leaf nodes are terminal states, namely win, draw, or loss. (<a href="http://garabedyan.files.wordpress.com/2011/04/chess-shannon-type-a.png">Source</a>)</figcaption></figure></div><p>DeepBlue defeated world chess champion Garry Kasparov with this approach in 1997. Because Go and chess are similar, you might think this algorithm could also play Go competently. But there are two problems.</p><ol><li><p>Go uses a bigger board and allows for many more moves than chess. As a result, the game tree is bigger, and naive tree searches become too slow in practice.</p></li><li><p>Handcrafted heuristics haven’t been able to achieve human levels of Go performance.</p></li></ol><p>In the ideal world of Software 2.0, we would design a single network that predicts optimal moves for a given Go board. The network learns a heuristic that runs in constant time, solving both problems. Collect or generate a dataset, train the model, and voila — problem solved.</p><p>If it were that simple, then we might expect MuZero to simply be a novel network architecture. Instead, DeepMind kept the tree structure. Rather than exhaustively searching the game tree, MuZero uses Monte Carlo Tree Search (MCTS), first used in Go programs in Brugman (1993), to focus on the most promising moves. Deep learning powers the heuristic, which in this case is a prediction function <em>f</em>.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><div class="latex-rendered" data-attrs="{"persistentExpression":"\\text{policy}, \\text{value} = f(\\text{state}) \\tag{Prediction Function}","id":"JDFKCFGKCW"}" data-component-name="LatexBlockToDOM"></div><p>The value measures how good the state is, and the policy recommends the next action to take.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zKJQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zKJQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zKJQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zKJQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zKJQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zKJQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg" width="1456" height="910" data-attrs="{"src":"https://substack-post-media.s3.amazonaws.com/public/images/8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg","srcNoWatermark":null,"fullscreen":null,"imageSize":null,"height":910,"width":1456,"resizeWidth":null,"bytes":null,"alt":null,"title":null,"type":null,"href":null,"belowTheFold":true,"topImage":false,"internalRedirect":null,"isProcessing":false,"align":null,"offset":false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zKJQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 424w, https://substackcdn.com/image/fetch/$s_!zKJQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 848w, https://substackcdn.com/image/fetch/$s_!zKJQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!zKJQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f5ff759-1817-4997-bcd2-4ded39622bdc_3360x2100.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A slide from <a href="https://www.youtube.com/watch?v=j0z4FweCy4M">Tesla’s AI day presentation</a>, explaining their implementation of MuZero for path-planning. The blue trapezoids on the left (labeled <em>f</em>, <em>g</em>, <em>h</em>) represent neural network models.</figcaption></figure></div><p>As the name Monte Carlo suggests, MCTS runs simulations to estimate how good a move is. For MuZero, the estimate is a visit count <em>n(s,a)</em>, which counts how many times we have simulated action <em>a</em> in state <em>s</em>. To run a simulation, we do the following.</p><ol><li><p>Use the value and policy from <em>f(s)</em> to pick the most promising move <em>a</em>.</p></li><li><p>Make move <em>a</em> and increment visit count <em>n(s,a)</em>. Go to the next state.</p></li><li><p>Repeat steps (1) and (2) until we find a state we’ve never seen (i.e. with visit count 0).</p></li></ol><p>To play a round of Go, we run one simulation and then pick the action with highest visit count, and repeat.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> We then update the policy to reflect the new visit counts. MCTS has more foresight than the policy because it sees more than a single step ahead. Schrittwieser et al. (2020) explain that “MCTS may therefore be viewed as a powerful policy improvement operator.”</p><p>Notice MCTS makes its decision from the visit counts; the policy and value only help to build the tree. If you want to encourage or discourage specific behaviors, you incorporate it into the tree, not the model. For example, Tesla prioritizes passenger comfort and penalizes the need for driver intervention through manual heuristics in the tree.</p><p>Software 2.0 promises neural networks that replace standard algorithmic approaches. But MuZero represents a continuation in the rich history of chess and Go algorithms, not a break from it. Because neural networks can’t handle the large state space or incorporate explicit factors, MuZero keeps the tree search approach, proving that a combination of deep learning and carefully chosen standard methods can dramatically outperform either one alone. In that regard, principled engineering and lucid algorithmic thinking still matter. For the time being, Software 2.0 will have to wait.</p><h1>Citations</h1><p>Schrittwieser, J., Antonoglou, I., Hubert, T. et al. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588, 604–609 (2020). <a href="https://doi.org/10.1038/s41586-020-03051-4">https://doi.org/10.1038/s41586-020-03051-4</a></p><p>Brügmann, B. (1993). Monte carlo go (Vol. 44). Syracuse, NY: Technical report, Physics Department, Syracuse University.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>The functions <em>h</em> and <em>g </em>allow MuZero to capture the most relevant parts of the observation (the full chess or Go board or the car’s sensor information) and predict what will happen when different actions are taken.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>This count is meant to correlate with how good a state is. A high visit count indicates high value with high confidence. We know it’s high value because we try to visit only promising states. Our confidence is high because we’ve seen it often.</p></div></div>

Original Article

MuZero: Checkmate For Software 1.0?

Similar Articles

@GoogleDeepMind: Algorithms are part of nearly every aspect of life, from the physics of the natural world to planning shipping routes. …

Claude Mythos, Deepseek v4, HappyHorse, Meta’s new AI, realtime video games: AI NEWS

[Google DeepMind] the AI co-mathematician also achieves state of the art results on hard problemsolving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

Introducing Deep Research and Deep Research Max

Games people — and machines — play: Untangling strategic reasoning to advance AI

Submit Feedback

Similar Articles

@GoogleDeepMind: Algorithms are part of nearly every aspect of life, from the physics of the natural world to planning shipping routes. …

Claude Mythos, Deepseek v4, HappyHorse, Meta’s new AI, realtime video games: AI NEWS

[Google DeepMind] the AI co-mathematician also achieves state of the art results on hard problemsolving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.

Introducing Deep Research and Deep Research Max

Games people — and machines — play: Untangling strategic reasoning to advance AI