Thoughts on WebAssembly as a stack machine

Eli Bendersky News

Summary

This blog post responds to the claim that WebAssembly is not a pure stack machine by discussing its design with locals and comparing it to Forth, arguing that it still fits the definition of a stack machine and that its register-like locals improve readability and performance.

<p>This week the article <a class="reference external" href="https://purplesyringa.moe/blog/wasm-is-not-quite-a-stack-machine/">Wasm is not quite a stack machine</a> has been making the rounds and has caught my eye. The post claims that WASM is not a pure stack machine because it has locals and is missing some stack manipulation operations like <tt class="docutils literal">dup</tt> and <tt class="docutils literal">swap</tt>.</p> <p>While I don't necessarily disagree, IMHO it's a bit of a semantic discussion because - to the best of my knowledge - there is no <em>formal</em> definition of what is a stack machine. Wikipedia, for example, says:</p> <blockquote> [...], a stack machine is a computer processor or a process virtual machine in which the primary interaction is moving short-lived temporary values to and from a push-down stack.</blockquote> <p>WASM certainly fits this definition; the <em>primary</em> interaction is through the stack, though WASM is augmented with an infinite register file (locals). The more purist stack machines like Forth are only limited to the stack and a memory (pointers into which are managed on the stack); WASM has these too, plus the registers.</p> <p>Speaking of Forth, the mention of <tt class="docutils literal">dup</tt> reminded me of my own impressions of programming in that language, documented in my post about <a class="reference external" href="https://eli.thegreenplace.net/2025/implementing-forth-in-go-and-c/">implementing Forth in Go and C</a>. There, I highlighted the following essential library function for Forth; it adds an addend to a value stored in memory.</p> <div class="highlight"><pre><span></span><span class="kn">:</span><span class="w"> </span><span class="nc">+!</span><span class="w"> </span><span class="c1">( addend addr -- )</span><span class="w"></span> <span class="w"> </span><span class="k">tuck</span><span class="w"> </span><span class="c1">( addr addend addr )</span><span class="w"></span> <span class="w"> </span><span class="k">@</span><span class="w"> </span><span class="c1">( addr addend value-at-addr )</span><span class="w"></span> <span class="w"> </span><span class="k">+</span><span class="w"> </span><span class="c1">( addr updated-value )</span><span class="w"></span> <span class="w"> </span><span class="k">swap</span><span class="w"> </span><span class="c1">( updated-value addr )</span><span class="w"></span> <span class="w"> </span><span class="k">!</span><span class="w"> </span><span class="k">;</span><span class="w"></span> </pre></div> <p>And lamented how difficult it is to understand such code without the detailed stack view in comments alongside it.</p> <p>I find it much simpler to reason about this WASM code:</p> <div class="highlight"><pre><span></span><span class="p">(</span><span class="k">func</span> <span class="p">(</span><span class="k">export</span> <span class="s2">&quot;add_to_byte&quot;</span><span class="p">)</span> <span class="p">(</span><span class="k">param</span> <span class="nv">$addr</span> <span class="kt">i32</span><span class="p">)</span> <span class="p">(</span><span class="k">param</span> <span class="nv">$delta</span> <span class="kt">i32</span><span class="p">)</span> <span class="p">(</span><span class="nb">i32.store8</span> <span class="p">(</span><span class="nb">local.get</span> <span class="nv">$addr</span><span class="p">)</span> <span class="p">(</span><span class="nb">i32.add</span> <span class="p">(</span><span class="nb">i32.load8_u</span> <span class="p">(</span><span class="nb">local.get</span> <span class="nv">$addr</span><span class="p">))</span> <span class="p">(</span><span class="nb">local.get</span> <span class="nv">$delta</span><span class="p">)))</span> <span class="p">)</span> </pre></div> <p>You may say this is cheating because folded WASM instructions help readability and they're just syntactic sugar; OK, here's the linear code:</p> <div class="highlight"><pre><span></span><span class="nb">local.get</span> <span class="nv">$addr</span> <span class="nb">local.get</span> <span class="nv">$addr</span> <span class="nb">i32.load8_u</span> <span class="nb">local.get</span> <span class="nv">$delta</span> <span class="nb">i32.add</span> <span class="nb">i32.store8</span> </pre></div> <p>It's still very readable, because - while the stack is used for all the calculations and actual commands - some of the data lives in named &quot;registers&quot; instead of on the stack. So we don't need all those tuck-swap contortions to get things into the right order.</p> <p>One might worry about the duplicated <tt class="docutils literal">local.get $addr</tt>; wouldn't a real <tt class="docutils literal">dup</tt> be better? Well, not in terms of readability, as we've already discussed. How about performance? Since the stack VM is just an abstraction and the underlying CPUs executing this code are register machines anyway, the answer is no - it doesn't matter at all.</p> <p>Modern compiler engineers were forged in the fires of C and its descendants; arbitrary control flow, arbitrary register and memory access, anything goes. Compilers are quite sophisticated. Let's see how <tt class="docutils literal">wasmtime</tt> compiles our <tt class="docutils literal">add_to_byte</tt> to native code (using <tt class="docutils literal">wasmtime explore</tt> with its default <tt class="docutils literal"><span class="pre">opt-level=2</span></tt>); comments are added by me:</p> <div class="highlight"><pre><span></span><span class="c1">// Prologue</span> <span class="n">push</span><span class="w"> </span><span class="n">rbp</span><span class="w"></span> <span class="n">mov</span><span class="w"> </span><span class="n">rbp</span><span class="p">,</span><span class="w"> </span><span class="n">rsp</span><span class="w"></span> <span class="c1">// wasmtime&#39;s VM context pointer lives in rdi; 0x38 is likely its offset</span> <span class="c1">// to the default linear memory. Therefore, r10 will hold the base address</span> <span class="c1">// of the linear memory buffer</span> <span class="n">mov</span><span class="w"> </span><span class="n">r10</span><span class="p">,</span><span class="w"> </span><span class="n">qword</span><span class="w"> </span><span class="n">ptr</span><span class="w"> </span><span class="p">[</span><span class="n">rdi</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mh">0x38</span><span class="p">]</span><span class="w"></span> <span class="c1">// The first parameter ($addr) is in edx; since WASM values are i32, it&#39;s</span> <span class="c1">// zero-extended into the 64-bit r11 by copying into r11d</span> <span class="n">mov</span><span class="w"> </span><span class="n">r11d</span><span class="p">,</span><span class="w"> </span><span class="n">edx</span><span class="w"></span> <span class="c1">// r10+r11 is memory[$addr]; this loads the current value into rsi</span> <span class="c1">// (zero-extending from 8 bits)</span> <span class="n">movzx</span><span class="w"> </span><span class="n">rsi</span><span class="p">,</span><span class="w"> </span><span class="n">byte</span><span class="w"> </span><span class="n">ptr</span><span class="w"> </span><span class="p">[</span><span class="n">r10</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">r11</span><span class="p">]</span><span class="w"></span> <span class="c1">// ecx is the first parameter ($delta); this adds the addend to the</span> <span class="c1">// current value</span> <span class="n">add</span><span class="w"> </span><span class="n">esi</span><span class="p">,</span><span class="w"> </span><span class="n">ecx</span><span class="w"></span> <span class="c1">// Store cur_value+addend back into memory[$addr]</span> <span class="n">mov</span><span class="w"> </span><span class="n">byte</span><span class="w"> </span><span class="n">ptr</span><span class="w"> </span><span class="p">[</span><span class="n">r10</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">r11</span><span class="p">],</span><span class="w"> </span><span class="n">sil</span><span class="w"></span> <span class="c1">// Epilogue</span> <span class="n">mov</span><span class="w"> </span><span class="n">rsp</span><span class="p">,</span><span class="w"> </span><span class="n">rbp</span><span class="w"></span> <span class="n">pop</span><span class="w"> </span><span class="n">rbp</span><span class="w"></span> <span class="n">ret</span><span class="w"></span> </pre></div> <p>This is pretty much the code we'd expect to be emitted for the C statement <tt class="docutils literal">mem[addr] += addend</tt>, or if we were writing x86-64 assembly by hand. The compiler had no difficulty figuring out that two consecutive loads from the same WASM local produce the same value and do not - in fact - have to be duplicated. The WASM model makes it rather easy, because you can't alias locals; as long as there are no intervening writes into the same local, multiple reads are known to produce the same value (redundant load elimination).</p>
Original Article
View Cached Full Text

Cached at: 05/16/26, 03:35 AM

# Thoughts on WebAssembly as a stack machine Source: [https://eli.thegreenplace.net/2026/thoughts-on-webassembly-as-a-stack-machine](https://eli.thegreenplace.net/2026/thoughts-on-webassembly-as-a-stack-machine) This week the article[Wasm is not quite a stack machine](https://purplesyringa.moe/blog/wasm-is-not-quite-a-stack-machine/)has been making the rounds and has caught my eye\. The post claims that WASM is not a pure stack machine because it has locals and is missing some stack manipulation operations likedupandswap\. While I don't necessarily disagree, IMHO it's a bit of a semantic discussion because \- to the best of my knowledge \- there is no*formal*definition of what is a stack machine\. Wikipedia, for example, says: > \[\.\.\.\], a stack machine is a computer processor or a process virtual machine in which the primary interaction is moving short\-lived temporary values to and from a push\-down stack\. WASM certainly fits this definition; the*primary*interaction is through the stack, though WASM is augmented with an infinite register file \(locals\)\. The more purist stack machines like Forth are only limited to the stack and a memory \(pointers into which are managed on the stack\); WASM has these too, plus the registers\. Speaking of Forth, the mention ofdupreminded me of my own impressions of programming in that language, documented in my post about[implementing Forth in Go and C](https://eli.thegreenplace.net/2025/implementing-forth-in-go-and-c/)\. There, I highlighted the following essential library function for Forth; it adds an addend to a value stored in memory\. ``` : +! ( addend addr -- ) tuck ( addr addend addr ) @ ( addr addend value-at-addr ) + ( addr updated-value ) swap ( updated-value addr ) ! ; ``` And lamented how difficult it is to understand such code without the detailed stack view in comments alongside it\. I find it much simpler to reason about this WASM code: ``` (func (export "add_to_byte") (param $addr i32) (param $delta i32) (i32.store8 (local.get $addr) (i32.add (i32.load8_u (local.get $addr)) (local.get $delta))) ) ``` You may say this is cheating because folded WASM instructions help readability and they're just syntactic sugar; OK, here's the linear code: ``` local.get $addr local.get $addr i32.load8_u local.get $delta i32.add i32.store8 ``` It's still very readable, because \- while the stack is used for all the calculations and actual commands \- some of the data lives in named "registers" instead of on the stack\. So we don't need all those tuck\-swap contortions to get things into the right order\. One might worry about the duplicatedlocal\.get $addr; wouldn't a realdupbe better? Well, not in terms of readability, as we've already discussed\. How about performance? Since the stack VM is just an abstraction and the underlying CPUs executing this code are register machines anyway, the answer is no \- it doesn't matter at all\. Modern compiler engineers were forged in the fires of C and its descendants; arbitrary control flow, arbitrary register and memory access, anything goes\. Compilers are quite sophisticated\. Let's see howwasmtimecompiles ouradd\_to\_byteto native code \(usingwasmtime explorewith its defaultopt\-level=2\); comments are added by me: ``` // Prologue push rbp mov rbp, rsp // wasmtime's VM context pointer lives in rdi; 0x38 is likely its offset // to the default linear memory. Therefore, r10 will hold the base address // of the linear memory buffer mov r10, qword ptr [rdi + 0x38] // The first parameter ($addr) is in edx; since WASM values are i32, it's // zero-extended into the 64-bit r11 by copying into r11d mov r11d, edx // r10+r11 is memory[$addr]; this loads the current value into rsi // (zero-extending from 8 bits) movzx rsi, byte ptr [r10 + r11] // ecx is the first parameter ($delta); this adds the addend to the // current value add esi, ecx // Store cur_value+addend back into memory[$addr] mov byte ptr [r10 + r11], sil // Epilogue mov rsp, rbp pop rbp ret ``` This is pretty much the code we'd expect to be emitted for the C statementmem\[addr\] \+= addend, or if we were writing x86\-64 assembly by hand\. The compiler had no difficulty figuring out that two consecutive loads from the same WASM local produce the same value and do not \- in fact \- have to be duplicated\. The WASM model makes it rather easy, because you can't alias locals; as long as there are no intervening writes into the same local, multiple reads are known to produce the same value \(redundant load elimination\)\. --- For comments, please send me[an email](mailto:[email protected])\.

Similar Articles

Abstract Machines for Logic Programs

Lobsters Hottest

The article explores the implementation of logic programs using abstract stack machines, detailing how different mode assignments for inference rules (such as addition) translate into state machine transitions for computation.

What Codex Brings to Wasmer

YouTube AI Channels

Wasmer used OpenAI Codex to build a C++ JavaScript runtime for edge WebAssembly in two weeks—work they estimate would have taken a year—turning the model into an autonomous teammate that debugs and largely replaces traditional IDE use.

A Linux desktop in x86_64 Assembly

Lobsters Hottest

A developer rebuilt their entire Linux desktop stack—from shell to terminal, window manager, and utilities—in pure x86_64 Assembly using Claude Code, achieving microsecond startup times and hours of extra battery life.