Self-Programmed Execution for Language-Model Agents

arXiv cs.AI 05/11/26, 04:00 AM Papers
ai-agents self-programmed-execution lisp orchestration large-language-models agent-architecture
Summary
This paper introduces Self-Programmed Execution (SPE), an agent architecture where the language model generates its own orchestration program rather than relying on a fixed external harness. It presents 'Spell', a Lisp-based language enabling this self-editing and re-evaluation, demonstrating that frontier models can successfully perform agentic tasks using this method.
arXiv:2605.06898v1 Announce Type: new Abstract: At the heart of existing language model agents is a fixed orchestrator program responsible for the state transition between consecutive turns. This paper introduces self-programmed execution (SPE), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn-to-turn orchestration policy. Realizing SPE in practice is nontrivial because the same data is both model context and executable program. I therefore introduce Spell, a Lisp-based language in which programs can edit and re-evaluate themselves, and effectful expressions like model invocations are structured such that re-evaluating an edited program does not replay its side effects. Experiments with existing models, not trained for SPE or Spell, show that frontier models can operate in this regime and accomplish challenging agentic tasks. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self-orchestration strategies might be learned by a model trained for self-programmed execution. Code is available at https://github.com/lukejoconnor/spell .
Original Article
View Cached Full Text
Cached at: 05/11/26, 07:09 AM
# Self-Programmed Execution for Language-Model Agents
Source: [https://arxiv.org/html/2605.06898](https://arxiv.org/html/2605.06898)
###### Abstract

At the heart of existing language model agents is a fixed orchestrator program which is responsible for the state transition between consecutive turns\. This paper introduces*self\-programmed execution*\(SPE\), an agent architecture in which the model completion is itself the orchestrator program, and the harness evaluates this program but does not impose its own orchestration policy\. I formalize this idea using agentic machines: an SPE state is one from which a model completion can load any state of an embedded copy of the machine, meaning that it is subject to no fixed turn\-to\-turn orchestration policy\. Realizing SPE in practice is nontrivial because the same data is both model context and executable program\. I therefore introduceSpell, a Lisp\-based language in which programs can edit and re\-evaluate themselves, and effectful expressions like model invocations are structured such that re\-evaluating an edited program does not replay its side effects\. Experiments with existing models, not trained for SPE orSpell, show that frontier models can operate in this regime and accomplish challenging agentic tasks\. These results demonstrate how an LM can act as an agent without any fixed orchestration policy, and they raise the question of what self\-orchestration strategies might be learned by a model trained for self\-programmed execution\. Code is available at[https://github\.com/lukejoconnor/spell](https://github.com/lukejoconnor/spell)\.

## 1Introduction

The harness of a language model \(LM\) agent acts both as an execution layer, allowing the LM to interact with an environment, and as an orchestrator program, specifying policy for what state persists across turns, what context is provided to the model, and what actions are available\. As models have become more capable, the harness and all aspects of orchestration policy have become a major target of engineering effort, particularly the policy for what context is provided to the LM\(Horthy,[2025](https://arxiv.org/html/2605.06898#bib.bib13)\)\. \(In this paper, I use*orchestration*to describe multi\-turn, not necessarily multi\-agent, logic or policy\.\) Practitioner accounts emphasize the significance of orchestration policy\(Anthropic Applied AI Team,[2025](https://arxiv.org/html/2605.06898#bib.bib12); Lopopolo,[2026](https://arxiv.org/html/2605.06898#bib.bib66)\), and commercial frameworks have been developed for bespoke agentic workflows\(Trivedy,[2026](https://arxiv.org/html/2605.06898#bib.bib67); Honget al\.,[2024](https://arxiv.org/html/2605.06898#bib.bib16)\)\.

All of this human effort naturally suggests automation\. One recent trend is to engineer self\-evolving harnesses, for example by adding an outer reflection loop; another is to enable partial self\-orchestration, for example by providing tools for subagent delegation \(see Section[5](https://arxiv.org/html/2605.06898#S5)\)\. This paper asks whether orchestration policy must belong to the harness at all\. What if the harness were only an execution layer, and not also an orchestrator?

In*self\-programmed execution*\(SPE\), the harness evaluates a model\-written program which is solely responsible for orchestration\. This program might write a prompt, append to it the result of a tool call, and then make a recursive self\-call\. The self\-call repeats the process: it inputs a prefix, invokes the LM to complete it, and evaluates the completion:

self\-call\(prefix\)≔eval⁡\(complete\(prefix\)\)\.\\texttt\{self\-call\}\(\\text\{prefix\}\)\\;\\coloneqq\\;\\operatorname\{eval\}\(\\text\{complete\}\(\\text\{prefix\}\)\)\.\(1\)
The first turn is produced by evaluating an initial program with a prompt and a self\-call\. Subsequently, every turn\-to\-turn transition is specified by a model\-written self\-call expression, and context passes through the turn boundary as the argument to this expression\. The harness still exists as an execution layer, but the program it executes is model\-written\.

Existing agent architectures alternate between a step which is controlled by the model and one which is controlled by the fixed orchestration policy of the harness \(Figure[1](https://arxiv.org/html/2605.06898#S1.F1)\)\. The archetypal example is the ReAct loop\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06898#bib.bib34)\)\. Recursive language models \(RLMs\)\(Zhanget al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib20)\)are the closest prior architecture: they execute arbitrary model\-written code within a REPL, and this code is able to spawn subagents recursively\. However, after the model\-written program runs, the harness issues the next main\-agent turn, while maintaining the conversation history and REPL state; the transition between main\-agent turns is subject to fixed orchestration policy\. SPE shares with RLMs that a model\-written program is evaluated by an external runtime\. The difference is that this program is fully responsible for orchestration policy, not only for subagent delegation, and does not run inside a fixed agent loop\.

![Refer to caption](https://arxiv.org/html/2605.06898v1/x1.png)Figure 1:Three agent architectures\.Colored boxes distinguish program logic that is implemented in the harness \(blue\) vs\. written by the model \(yellow\)\. In all cases, these programs are executed by a harness runtime which is external to the model\. \(a\) In ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06898#bib.bib34)\), the model selects from a prescribed action space\. An orchestrator program runs the agent loop, maintaining state \(e\.g\., conversation history\) across turns\. \(b\) Recursive language models \(RLMs\)\(Zhanget al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib20)\)execute model\-written orchestration code inside a harness\-managed main\-agent loop, which performs handoff between the main agent’s own turns\. \(c\) In SPE, a model\-written program specifies the orchestration policy, free of any outer agent loop\. The model\-written program is still executed by an external runtime\.This paper makes four contributions\. First, it introduces a simple agent architecture, SPE, which can be seen as the limiting case of a strong recent trend\. Second, it introduces formal abstractions for orchestration policy and SPE, defining anSPE stateof an agentic machine and proving that the proposed architecture suffices\. Third, it introduces Self\-Programmed Execution Language for LMs \(Spell\), a language for SPE which locates and addresses the challenges that arise when implementing SPE in practice\. Fourth, it empirically studies how existing models, without SPE\- orSpell\-specific training, behave when challenged to solve hard agentic tasks usingSpell\.

## 2Formalizing self\-programmed execution

What is meant by “fixed orchestration policy”, what does it mean to ablate it, and why does doing so lead to the specific agent architecture—“evaluate the model completion as a program”—which is termed SPE? This section introduces formal abstractions\. Its main contribution is that it defines an “SPE state” of an agentic machine, without specifying any particular procedure\. The main theorem shows that, in a CEK\-style evaluator with model calls andeval\\operatorname\{eval\}, the seed programlety=lm⁡qy=\\operatorname\{lm\}\\,qineval⁡\(y\)\\operatorname\{eval\}\(y\)reaches such a state\. Proofs and additional results are provided in Appendix[A](https://arxiv.org/html/2605.06898#A1)\.

###### Definition 2\.1\(Agentic machine\)\.

Fix a prompt spacePPand completion spaceCC\. An*agentic machine*over\(P,C\)\(P,C\)is a tupleX=\(S,p,h\)X=\(S,p,h\), whereSSis a state space,p:S→Pp\\colon S\\to Pis a prompt function mapping states to prompts, andh:S×C→S∪\{𝟏,↑\}h\\colon S\\times C\\to S\\cup\\\{\\mathbf\{1\},\\uparrow\\\}is a harness function mapping a state/completion pair to the next state, either an element ofSS, a special halting state𝟏\\mathbf\{1\}, or a divergence state↑\\uparrow\. The machine is paired with a language model which maps prompts inPPto completions inCC, possibly non\-deterministically\.

An agentic machine makes exactly one LM call per state transition, and this is the appropriate level of granularity because it mirrors what the LM observes\. An embedding of one machine within another is an injection which preserves model\-observable prompts, such that the model cannot distinguish a state from its embedding \(formally, see Theorem[A\.4](https://arxiv.org/html/2605.06898#A1.Thmdefinition4)\), and it will thus have equivalent behavior\.

###### Definition 2\.2\(Embedding\)\.

An*embedding*ofX′=\(S′,p′,h′\)X^\{\\prime\}=\(S^\{\\prime\},p^\{\\prime\},h^\{\\prime\}\)intoX=\(S,p,h\)X=\(S,p,h\)is an injective mape:S′→Se\\colon S^\{\\prime\}\\to S, extended bye\(𝟏\)=𝟏e\(\\mathbf\{1\}\)=\\mathbf\{1\}ande\(↑\)=↑e\(\\uparrow\)=\\uparrow, such thatp\(e\(x′\)\)=p′\(x′\)p\(e\(x^\{\\prime\}\)\)=p^\{\\prime\}\(x^\{\\prime\}\)and

h\(e\(x′\),c\)=e\(h′\(x′,c\)\)h\(e\(x^\{\\prime\}\),c\)=e\(h^\{\\prime\}\(x^\{\\prime\},c\)\)for everyx′∈S′x^\{\\prime\}\\in S^\{\\prime\}and completionc∈Cc\\in C\.

Intuitively, an SPE state is one where, by choosing the completion, the model can select the next state rather than merely choose among prescribed actions\. Formally, it is defined by a self\-embedding which is completion\-generated:

###### Definition 2\.3\(SPE state\)\.

For a statexxof an agentic machineXX,\(X,x\)\(X,x\)*completion\-generates*X′X^\{\\prime\}if there exists an embeddingeeofX′X^\{\\prime\}intoXXsuch that for ally∈im⁡\(e\)y\\in\\operatorname\{im\}\(e\), some completionccgenerates the transition fromxxtoyy:h\(x,c\)=yh\(x,c\)=y\. An*SPE state*ofXXis a statex0x\_\{0\}such that\(X,x0\)\(X,x\_\{0\}\)completion\-generatesXX\.

These abstractions map cleanly onto the semantics of “orchestration” and “fixed orchestration policy\.” For a statexxofXX, the mapc↦h\(x,c\)c\\mapsto h\(x,c\)is the orchestration policy exposed at that state: after the model emits completioncc, this policy produces the next state\. In an ordinary agent loop, the image of this function is restricted, with states that cannot be reached by any model completion; we can say that the policy is fixed\. In an SPE state, by contrast, the model can access any successor state up to an embedding by emitting the appropriate completion\. Thus “no fixed orchestration policy” does not mean that the external runtime is absent or thathhdisappears\. It means that the model chooses the successor state freely\.

It remains to show that this definition is satisfied by the architecture this paper terms “SPE\.” This architecture can be realized in the following agentic machine:

###### Definition 2\.4\(Agentic evaluator\)\.

An*agentic evaluator*is an agentic machine which wraps a standard evaluator, here the CEK machine, around LM calls\. The states of the agentic evaluator are the subset of evaluator states at which the next step of computation is to make an LM call via the term\(lm⁡v\)\(\\operatorname\{lm\}\\,v\)for some valuevv; its harness function replaces that term with the sampled completion and continues ordinary evaluation until the next boundary state\.

The procedure “prompt the model and evaluate its completion” is encoded in the CEK agentic evaluator by the following program\. We writex←∂Ex\\xleftarrow\{\\partial\}Eto mean thatxxis the first boundary state reached by evaluating programEE:

x∗←∂lety=lm⁡qineval⁡\(y\)\.x^\{\*\}\\;\\xleftarrow\{\\partial\}\\;\\texttt\{let \}y=\\operatorname\{lm\}\\,q\\texttt\{ in \}\\operatorname\{eval\}\(y\)\.\(2\)
###### Theorem 2\.5\(SPE\)\.

The statex∗x^\{\*\}is an SPE state of the CEK agentic evaluator\.

At a lower level of abstraction, the programeval\\operatorname\{eval\}which evaluates the model completion could be complex\. For example, it might include API transport logic and error\-recovery logic\. Such logic counts as orchestration policy only when it constrains what successor states of the agentic machine, at the granularity of turns, can be reached by the model\.

###### Corollary 2\.6\(Universality\)\.

Because the underlying CEK machine is Turing complete,x∗x^\{\*\}completion\-generates all agentic machines over the same LM interface whose prompt and harness functions are computable\.

The universality corollary means that in principle, any fixed orchestration policy \(if computable\) could be installed by a model\-written program\. However, it cuts both ways: any orchestration program written by the model can just as well be written by a human or otherwise fixed externally\. It also should not be read as a performance guarantee, as it says nothing about what programs a given model will actually write\. Thus, the distinguishing feature of SPE is not what orchestration policy can be expressed, and especially not what policy is actually expressed by a given model, but rather what entity is responsible for expressing it\.

## 3A practical language for SPE

Making SPE practical requires a language with unusual properties, motivating the development ofSpell\(see Appendix[B](https://arxiv.org/html/2605.06898#A2)for language details\)\. In SPE, the same data—a model completion—is both the content of the model’s context window and the program specifying what context becomes the prefix or prompt for a subsequent turn\. Consider the following attempt to express one iteration of a ReAct\-style loop as a model\-written program, particularly the appearance ofthis\_entire\_completion\(\):

prompt:="SummarizeREADME\.md\."

//LMturn1beginshere

file\_contents:=read\_file\("README\.md"\)

next\_prefix:=

concatenate\(this\_entire\_completion\(\),"observation:=",file\_contents\)

self\_call\(next\_prefix\)

Challenge 1: context persistence as code\.The expressionthis\_entire\_completion\(\)serializes the source code of the running program\. That source code is edited in order to produce a new prefix, here by concatenating one additional line\. In order to generalize such logic, an SPE language should make it easy for the model to manipulate code as data\.Spelltherefore uses the syntax of Lisp, specifically Clojure\(McCarthy,[1960](https://arxiv.org/html/2605.06898#bib.bib39); Hickey,[2020](https://arxiv.org/html/2605.06898#bib.bib45)\): programs are written as nested expressions whose syntax matches the procedure of evaluating them, and this syntax greatly simplifies the logic of programs which generate or transform source code\. In order to support the creation of programs or expressions which reference their own source code specifically,Spellalso adds a self\-referentialquineform\. The expression\(quinenameexpr\)bindsnameto the source form\(quinenameexpr\)as data before evaluatingexpr\.

Natural language can be embedded in code, and aSpellprogram will often define long string literals containing the model’s visible reasoning or planning\. The motivation for making reasoning traces a part of the program is that it lets the model control whether they are retained as context on subsequent turns\.

Challenge 2: replaying effects\.A second problem with the example above appears on turn 2\. The turn 2 program contains the turn 1 program as a prefix; evaluating it replays the self\-call, producing an inescapable loop, and also replays the IO operation\.Spellresolves this problem using a trailing\-expression pattern\. The body of aSpellprogram performs ordinary local computation but cannot directly trigger effects, including self\-calls\. Instead, it computes an effectful expression as data\. A wrapper around the program body \(see below\) evaluates only the last such expression, the*trailing expression*, and this special evaluator has access to effect functions\. When an old trailing expression is followed by newly appended source code, it is no longer the trailing expression; it is reconstructed as inert data but not evaluated\. This makes it safe to replay prior source as the prefix of a later turn\.

Challenge 3: turn\-boundary interference\.A third problem is more subtle\. The turn 2 program is evaluated inside of, not subsequent to, the turn 1 program\. If the inner program inherited an opaque runtime environment from the outer program, or could mutate that environment, then the model could not reliably reason about its behavior; for example, the inner program could overwrite a binding defined by the outer program\.Spelltherefore evaluates self\-calls in fresh local environments, such that a child program neither inherits nor mutates the environment of its parent\.

TheSpellwrapper\.A normalSpellprogram has the following structure, which combines self\-reference with replay\-safe effects:

\(quinecompletion

\(eval

\(do

\.\.\.

’\(effectful\-expression\)\)\)\)

The outerquineform binds the symbolcompletionto the source code of the entire program\. The innerdoblock returns the value of its last expression, which is the quoted trailing expression\. Because it is quoted, this expression is not evaluated until it is passed toeval, at which time effect functions, including LM self\-calls, are made available\. If the model appends an expression after the current quoted trailing expression, the old expression is no longer passed toevaland becomes inert data\. The trailing expression can make use of bindings defined upstream, including thecompletionbinding\.

An iteration of a ReAct\-style loop is replicated inSpellby writing a trailing expression which makes a tool call, concatenates the result withcompletioninside of thedoblock, and makes a self\-call with this prefix \(Spellprovides a convenience function for this\)\. This pattern is easily generalized\. Instead of retaining its entire completion, the model can choose which items to include and which to prune \(see Appendix[B\.6](https://arxiv.org/html/2605.06898#A2.SS6)\)\. Tool calls can be chained together, packaged into reusable functions, and composed with arbitrary control flow\. Tool calls can also be composed with self\-calls; for example, an agent could list files in a directory and assign each file to a subagent\.

No model has yet been trained to useSpell, so an important feature of the language is its built\-in prompts\. TheSpellsystem prompt explains semantics and gives positive and negative examples \(Appendix[B\.11\.1](https://arxiv.org/html/2605.06898#A2.SS11.SSS1)\)\. Each namespace has a documentation prompt that the model can read via tool calls\. Besides prompting,Spellexposes two surfaces for harness engineering\. First, the LM’s first turn is produced by evaluating an initial program, which defaults to a prompt and a self\-call\. Users may customize this program, although then the initial agent turn is SPE only if that program does not fix a policy which applies to subsequent turns\. Second, the user specifies the effect namespaces with which the model is provided\.Spellfunctions with possibly\-dangerous effects are packaged into namespaces, likeio\-read/andio\-write/, the availability of which is configurable \(Appendix[B\.14\.2](https://arxiv.org/html/2605.06898#A2.SS14.SSS2)\)\.

Spellis not the only possible solution to the practical challenges of SPE\. However, any implementation will have to solve the same challenges, and the approach taken bySpell—in particular, Lisp and the trailing\-expression pattern—is carefully considered\. Appendix[B\.2](https://arxiv.org/html/2605.06898#A2.SS2)articulates the high\-level principles which motivate the implementation ofSpell\.

## 4Empirical evaluation

UsingSpellin practice could be cognitively demanding, particularly for a model not trained to do so: the model must understand SPE conceptually, learnSpellitself in\-context, transfer agentic behaviors to an unfamiliar action space, and additionally solve the assigned task\. I sought to address three questions\. First, can existing models, without SPE\- orSpell\-specific training, useSpellto solve challenging agentic tasks? Second, what kind ofSpellprograms can these models write, and what programs do they write in practice? Third, to what extent does self\-orchestration withSpellhandicap \(or benefit\) existing models on standard benchmarks, versus a traditional harness? These experiments evaluateSpellas a realization of SPE, as opposed to eitherSpellor SPE independently; there is no natural reference implementation of SPE, and a deliberately naive implementation would not usefully represent the bare concept\. Configuration details for harnesses and models are provided in Appendix[C\.1](https://arxiv.org/html/2605.06898#A3.SS1)\. Protocol details for each benchmark are provided in Appendix[C\.3](https://arxiv.org/html/2605.06898#A3.SS3)\.

Observation 1: strong existing models can useSpellto solve challenging agentic tasks\.I applied five strong models—GPT\-5\.4, Claude Opus 4\.6, GLM\-5\.1, Kimi\-K2\.6, and Qwen3\.6 Plus—asSpellagents to two coding benchmarks, Terminal\-Bench 1\.1 and SWE\-bench Lite \(32\-task subset each\) \(Figure[2](https://arxiv.org/html/2605.06898#S4.F2); Appendix[C\.4](https://arxiv.org/html/2605.06898#A3.SS4)\)\. I counted fatal/unrecoveredSpellerrors, defined asSpellerrors after which the model failed to produce an error\-free turn, as well as success rates\. All of the models except for the smallest, Qwen3\.6 Plus, had success rates of at least 10/32\. GPT\-5\.4 was the only model to produce zero fatal errors, and it also had the highest number of successes on both tasks; Opus 4\.6 had the next fewest fatal errors and the next\-largest number of successes, quite similar to GLM\-5\.1 and Kimi\-K2\.6; Qwen3\.6 Plus had the most fatal errors and the fewest successes\.

These results suggest thatSpellitself is a viable substrate with which to solve challenging agentic tasks\. Moreover, existing models have the capability to learnSpellin context and generalize agentic behaviors to this action space\. On the other hand, the cognitive demands imposed by SPE andSpellare close enough to capability thresholds that every model except GPT\-5\.4 sometimes fails by producing invalidSpellprograms\. Subsequent analyses primarily target GPT\-5\.4\.

![Refer to caption](https://arxiv.org/html/2605.06898v1/x2.png)Figure 2:Accuracy and fatalSpell\-error rate by model\.Each model was run on a set of 32 Terminal\-Bench 1\.1 and SWE\-bench Lite tasks\. A task was counted as a fatal error if its final turn produced an unrecoveredSpell/runtime error\. GPT\-5\.4 and Opus 4\.6 were configured with medium reasoning effort, GLM\-5\.1 and Qwen3\.6 Plus with high effort, and Kimi\-K2\.6 with default effort\.Observation 2: current models useSpellfor context management and programmatic tool calling\.I examined what orchestration policies were expressed inSpellprograms produced by GPT\-5\.4\. One important component of orchestration policy is context management, andSpellprovides powerful features for this, particularly the\!peekfunction, which creates ephemeral tool call expressions \(Appendix[B\.6\.2](https://arxiv.org/html/2605.06898#A2.SS6.SSS2)\)\. GPT\-5\.4 used\!peekheavily in both Terminal\-Bench and SWE\-bench Lite, and this had a substantial effect on context utilization; compared with Codex CLI \(see below\), total token count \(input and output\) was about4×4\\timessmaller forSpell, although the difference in API cost was less dramatic \(Appendix[C\.6](https://arxiv.org/html/2605.06898#A3.SS6)\)\. Agents also usedSpellas a substrate for programmatic tool calling\(Wanget al\.,[2024b](https://arxiv.org/html/2605.06898#bib.bib31)\): tool calls were often chained together, conditioned on the success or status of earlier calls, or batched several per turn \(Appendix[C\.8](https://arxiv.org/html/2605.06898#A3.SS8)\)\. By contrast,Spellagents rarely used multi\-agent features or composed self\-calls with control flow\. NoSpellagent was observed to use multi\-agent delegation successfully; two attempts were made, both likely harmful \(Appendix[C\.8](https://arxiv.org/html/2605.06898#A3.SS8)\)\. Moreover, instructing the agent to attempt multi\-agent orchestration was ineffective \(Appendix[C\.8](https://arxiv.org/html/2605.06898#A3.SS8)\)\.

Observation 3: orchestration games elicit nontrivialSpellprograms\.I sought to distinguish whether the simplicity of observedSpellprograms was due to behavior or capability\. I tasked aSpellagent running GPT\-5\.4 \(medium effort\) with orchestrating one of three “orchestration games” designed to mock plausibly useful multi\-agent orchestration motifs \(Appendix[C\.2](https://arxiv.org/html/2605.06898#A3.SS2)\):

- •Auction: the orchestrator makes three self\-calls asking for sealed bids and selects the highest bid; this emulates a best\-of\-KKmotif\.
- •Telephone: at each iteration of a loop, a self\-call instance receives a sentence and is asked to rephrase it, passing the rephrased sentence to the next iteration; this emulates a deterministic workflow\.
- •Twenty questions: the initial agent chooses a secret, and self\-call instances ask yes\-or\-no questions, accumulating a public transcript, until guessing the answer; this emulates a worker\-checker motif\.

GPT\-5\.4 succeeded according to program trace audits in 8/8 auction trials, 4/8 telephone trials, and 7/8 twenty\-questions trials \(Appendix[C\.2](https://arxiv.org/html/2605.06898#A3.SS2)\)\. The following was the trailing expression of a successful telephone implementation \(comments are added\):

’\(let\[relays

\(loop\[k1currentinitial\-wordingacc\[\]\]

\(if\(\>k8\)

acc

\(let\[next\-wording\(\!llm\-self\(relay\-programkcurrent\)\)\]

\(recur\(\+k1\)next\-wording

\(conjacc\{:relayk:wordingnext\-wording\}\)\)\)\)\)

final\-wording\(:wording\(lastrelays\)\)\]

\{:initialinitial\-wording:relaysrelays:finalfinal\-wording\}\)

These results are modest but important:Spellsupports flexible orchestration policies, and GPT\-5\.4 is capable of implementing them despite not usually choosing to do so\.

Observation 4:Spellis competitive on coding benchmarks, but not uniformly\.I comparedSpellagainst Codex CLI on Terminal\-Bench and SWE\-bench Lite using the same GPT\-5\.4 model at low, medium, and high reasoning effort\. This bar is high: Codex CLI is a mature harness presumably well\-aligned with the model’s post\-training, whereasSpellis a new implementation of an unfamiliar architecture\. Nevertheless, coding benchmark results were encouraging\. On Terminal\-Bench 1\.1 \(Figure[3](https://arxiv.org/html/2605.06898#S4.F3), left\), Codex at high effort achieved the highest accuracy, resolving 43/80 tasks, but at∼2×\\sim 2\\timeshigher cost than theSpellagent at high effort \(40/80\)\. Codex at low and medium effort had similar accuracy and cost as theSpellagent at high effort; theSpellagent at low and medium effort had lower accuracy and commensurately lower cost\. Adding a coding\-task prompt, which suggests an iterative workflow, improved high\-effortSpellon Terminal\-Bench to 43/80 tasks at $37\.88 total cost \(Appendix[C\.5](https://arxiv.org/html/2605.06898#A3.SS5)\)\. In a comparison between aSpellagent running Claude Opus 4\.6 at medium effort and Claude Code running the same model, both agents solved 42/80 tasks with similar costs \(Appendix[C\.5](https://arxiv.org/html/2605.06898#A3.SS5)\)\. On SWE\-bench Lite, similarly, theSpellagent had lower accuracy but commensurately lower cost compared with Codex\. TheSpellagent resolved 171/300 tasks at medium effort, nearly matching Codex \(172/300\) on accuracy at lower total cost; however, high effort benefited the Codex agent \(185/300 resolved\) but not theSpellagent \(171/300\)\.

Although individual datapoints are noisy, these estimates suggest a genuine cost–accuracy tradeoff\. The cost reduction is due at least in part to context management features, which produce large reductions in the number of input tokens \(Appendix[C\.6](https://arxiv.org/html/2605.06898#A3.SS6)\)\. The accuracy reduction is more difficult to attribute; possible contributors include overuse of context pruning, prompting differences, and the cognitive overhead of learningSpellitself\.

I additionally evaluatedSpellagents on LongBench v2, a long\-context QA benchmark, and AppWorld, a computer\-use benchmark\. TheSpellagent \(with GPT\-5\.4 medium effort\) underperformed Codex CLI on LongBench at similar cost \(61\.0% vs\. 67\.5%\) and badly underperformed on AppWorld \(42\.1% vs\. 63\.2%\), at higher cost \(Figure[4](https://arxiv.org/html/2605.06898#S4.F4.5)\)\. Trace inspection suggested assorted failure modes: query formulation and data\-source selection errors, final\-submission mistakes, one refusal on a payment task, possible evaluator mismatches, and mismatch between the real run date and fixture dates \(Appendix[C\.9](https://arxiv.org/html/2605.06898#A3.SS9)\)\. It is not clear what aboutSpellmakes it less competitive on these benchmarks\. One possibility is that its context management features are more appropriate for coding tasks; another is prompting; a third is that GPT\-5\.4 is more extensively trained for such tasks, such that effective behaviors are more deeply ingrained and thus generalize more readily\.

![Refer to caption](https://arxiv.org/html/2605.06898v1/x3.png)Figure 3:Comparison with Codex CLI on coding benchmarks\.Left: Terminal\-Bench 1\.1\. Right: SWE\-bench Lite\. Each point is one full benchmark run with GPT\-5\.4 at low, medium, or high reasoning effort\. For numerical results, see Appendix[C\.5](https://arxiv.org/html/2605.06898#A3.SS5)\.![Refer to caption](https://arxiv.org/html/2605.06898v1/x4.png)

Figure 4:Comparison betweenSpelland Codex CLI across benchmarks\.Labels give the total API cost of each run\. Numerical results and run\-configuration details are reported in Appendix[C\.5](https://arxiv.org/html/2605.06898#A3.SS5)and Appendix[C\.9](https://arxiv.org/html/2605.06898#A3.SS9)\. \* Binomialp<0\.05p<0\.05\. \*\*p<0\.005p<0\.005\.
## 5Related work

This section reviews recent literature on the design of agentic systems\. Related language design literature is reviewed in Appendix[B\.3](https://arxiv.org/html/2605.06898#A2.SS3), and related computability theory literature is reviewed in Appendix[A\.1](https://arxiv.org/html/2605.06898#A1.SS1)\.

Partial self\-orchestration\.Many self\-orchestrated architectures, motivated by the limitations of the LM’s context window, provide special tools for context management or subagent delegation\. MemGPT creates a memory storage system and provides the agent with special tools to manage memory and context\(Packeret al\.,[2023](https://arxiv.org/html/2605.06898#bib.bib6)\)\. A series of papers empower a model to purge, summarize, and more generally manipulate its own context window, including MemAct\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.06898#bib.bib7)\), FoldGRPO\(Sunet al\.,[2025](https://arxiv.org/html/2605.06898#bib.bib8)\), and AgentFold\(Yeet al\.,[2025](https://arxiv.org/html/2605.06898#bib.bib9)\)\. These papers are precedent for the context management features ofSpell\. Another axis of partial self\-orchestration is delegation\. Conductor\(Nielsenet al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib60)\)is a language model specifically trained to orchestrate other models on agentic tasks, notably including via self\-delegation\. RLMs\(Zhanget al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib20)\)are especially relevant to SPE because the means of delegation is programmatic\. Delegation also features in commercial systems, particularly coding agents\(Anthropic,[2025](https://arxiv.org/html/2605.06898#bib.bib11); Kimi Team \(Moonshot AI\),[2026](https://arxiv.org/html/2605.06898#bib.bib18)\)\. Whereas these systems add features to an otherwise fixed orchestration policy, SPE simply yields responsibility for this policy to the model\.

Several self\-orchestrating architectures are paired with LMs trained or tuned for the corresponding interface\. FoldGRPO and MemAct train models to manage their own context\(Sunet al\.,[2025](https://arxiv.org/html/2605.06898#bib.bib8); Zhanget al\.,[2025b](https://arxiv.org/html/2605.06898#bib.bib7)\); RLMs train models for programmatic delegation\(Zhanget al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib20)\); and Conductor and Kimi K2\.6 report gains from models or systems specialized for subagent orchestration\(Nielsenet al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib60); Kimi Team \(Moonshot AI\),[2026](https://arxiv.org/html/2605.06898#bib.bib18)\)\. These successes suggest that a parallel approach may be viable forSpell\(see Section[6](https://arxiv.org/html/2605.06898#S6)\)\.

Self\-evolving harnesses\.Another nearby line of work attempts to optimize prompts, programs, or scaffolds through self\-improvement\. Voyager\(Wanget al\.,[2024a](https://arxiv.org/html/2605.06898#bib.bib26)\)accumulates LM\-authored skills over time; DSPy\(Khattabet al\.,[2024](https://arxiv.org/html/2605.06898#bib.bib24)\)optimizes prompts and other LM\-program fragments against a metric; ENCOMPASS\(Liet al\.,[2025](https://arxiv.org/html/2605.06898#bib.bib62)\)superimposes tree search over a space of workflows\. STOP\(Zelikmanet al\.,[2024](https://arxiv.org/html/2605.06898#bib.bib22)\)starts with a program\-improver program and runs the improver on its own source code; ADAS\(Huet al\.,[2025](https://arxiv.org/html/2605.06898#bib.bib25)\)introduces a meta\-agent which updates an archive of harness designs; AFlow\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.06898#bib.bib61)\)augments this meta\-agent with tree search\. Several other methods vary the means of self\-evolution or the space of possible designs\(Yinet al\.,[2024](https://arxiv.org/html/2605.06898#bib.bib23); Leeet al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib63)\)\. These methods share with SPE andSpellthe motivation that fixed orchestration policy is likely suboptimal, but instead of optimizing this logic, SPE makes it a part of the model’s action space\.

Program execution\.Many methods involve the execution of a model\-written program in a general\-purpose programming language\. Examples include PAL and Program\-of\-Thoughts, which are oriented toward numerical calculation, and CodeAct, which uses code as a substrate for tool calling\(Gaoet al\.,[2023](https://arxiv.org/html/2605.06898#bib.bib27); Chenet al\.,[2023](https://arxiv.org/html/2605.06898#bib.bib28); Wanget al\.,[2024b](https://arxiv.org/html/2605.06898#bib.bib31)\)\. RLMs also belong to this category\(Zhanget al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib20)\)\. SPE differs from these systems by making the model\-written program responsible for orchestration policy\.

## 6Discussion

This paper shows that a stateless language model can act as an agent, reasoning and acting across multiple turns, with no fixed agent loop or orchestration policy\. Instead, one need only execute the model completion as a program\. This approach is conceptually simple enough to be defined and studied using formal abstractions\. It also presents major challenges in practice, motivating the development ofSpell, and experiments show thatSpelladdresses these challenges well enough to support capable agents on difficult coding benchmarks\.

On the other hand, this paper does not show that SPE supports any specific orchestration strategy which is impossible to express using a traditional harness\. The behavior of any particular SPE program probably could be implemented externally; the difference is not what logic can be expressed, but what entity expresses it\. Moreover, current models primarily useSpellfor relatively simple but practically useful orchestration policies: context pruning, programmatic tool batching, and simple agent\-loop emulation; they show the capability to write more interesting orchestration programs but do not usually choose to do so\.

This behavior could possibly change if an LM were trained to useSpellnatively\. Prior work shows that models trained via reinforcement learning to use subagent orchestration or context management tools within an agent loop learn to do so productively\(Sunet al\.,[2025](https://arxiv.org/html/2605.06898#bib.bib8); Zhanget al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib20); Nielsenet al\.,[2026](https://arxiv.org/html/2605.06898#bib.bib60)\); aSpell\-native model might plausibly acquire similar capabilities\. It is unclear how far this would scale: would long\-running tasks accumulate increasing benefit over time? Would a strongSpell\-native model actually discover optimal strategies during training? Would this end\-to\-end approach outperform self\-evolving agent architectures? Despite these uncertainties, there is precedent for end\-to\-end learning to outscale human effort\(Sutton,[2019](https://arxiv.org/html/2605.06898#bib.bib64)\);Spell\-native training may provide the means by which to scale orchestration with computation\.

## References

- M\. Abadi and L\. Lamport \(1991\)The existence of refinement mappings\.Theoretical Computer Science82\(2\),pp\. 253–284\.External Links:[Document](https://dx.doi.org/10.1016/0304-3975%2891%2990224-P)Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p2.1),[§A\.3](https://arxiv.org/html/2605.06898#A1.SS3.p1.2)\.
- M\. S\. Ager, D\. Biernacki, O\. Danvy, and J\. Midtgaard \(2003\)A functional correspondence between evaluators and abstract machines\.InProceedings of the 5th ACM SIGPLAN International Conference on Principles and Practice of Declarative Programming \(PPDP\),pp\. 8–19\.External Links:[Document](https://dx.doi.org/10.1145/888251.888254)Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p3.1),[§A\.4](https://arxiv.org/html/2605.06898#A1.SS4.1.p1.1),[§A\.4](https://arxiv.org/html/2605.06898#A1.SS4.p1.1)\.
- Anthropic Applied AI Team \(2025\)Effective context engineering for AI agents\.Note:[https://www\.anthropic\.com/engineering/effective\-context\-engineering\-for\-ai\-agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)Engineering blog, published September 29, 2025Cited by:[§1](https://arxiv.org/html/2605.06898#S1.p1.1)\.
- Anthropic \(2025\)Claude code: subagents and agent teams documentation\.Note:[https://code\.claude\.com/docs/en/sub\-agents](https://code.claude.com/docs/en/sub-agents),[https://code\.claude\.com/docs/en/agent\-teams](https://code.claude.com/docs/en/agent-teams)Cited as Anthropic 2025bCited by:[§5](https://arxiv.org/html/2605.06898#S5.p2.1)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2023\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.External Links:2211\.12588,[Link](https://arxiv.org/abs/2211.12588)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p5.1)\.
- M\. Felleisen and D\. P\. Friedman \(1986\)Control operators, the SECD\-machine, and theλ\\lambda\-calculus\.InFormal Description of Programming Concepts III: Proceedings of the Third IFIP WG 2\.2 Working Conference,M\. Wirsing \(Ed\.\),pp\. 193–219\.Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p3.1),[§A\.4](https://arxiv.org/html/2605.06898#A1.SS4.1.p1.1),[§A\.4](https://arxiv.org/html/2605.06898#A1.SS4.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.External Links:2211\.10435,[Link](https://arxiv.org/abs/2211.10435)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p5.1)\.
- K\. Gödel \(1931\)Über formal unentscheidbare sätze der principia mathematica und verwandter systeme i\.Monatshefte für Mathematik und Physik38,pp\. 173–198\.External Links:[Document](https://dx.doi.org/10.1007/BF01700692)Cited by:[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p1.1)\.
- R\. Hickey \(2020\)A history of Clojure\.Proceedings of the ACM on Programming Languages4\(HOPL\)\.External Links:[Document](https://dx.doi.org/10.1145/3386321)Cited by:[§B\.3\.1](https://arxiv.org/html/2605.06898#A2.SS3.SSS1.p1.1),[§3](https://arxiv.org/html/2605.06898#S3.p3.1)\.
- D\. R\. Hofstadter \(1979\)Gödel, escher, bach: an eternal golden braid\.Basic Books,New York\.Cited by:[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber \(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InInternational Conference on Learning Representations \(ICLR\),External Links:2308\.00352,[Link](https://arxiv.org/abs/2308.00352)Cited by:[§1](https://arxiv.org/html/2605.06898#S1.p1.1)\.
- D\. Horthy \(2025\)12\-factor agents: principles for building reliable LLM applications\.Note:GitHub repositorySee especially Factor 3, “Own your context window”External Links:[Link](https://github.com/humanlayer/12-factor-agents)Cited by:[§1](https://arxiv.org/html/2605.06898#S1.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2025\)Automated design of agentic systems\.InInternational Conference on Learning Representations \(ICLR\),External Links:2408\.08435,[Link](https://arxiv.org/abs/2408.08435)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.InInternational Conference on Learning Representations \(ICLR\),External Links:2310\.03714,[Link](https://arxiv.org/abs/2310.03714)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- Kimi Team \(Moonshot AI\) \(2026\)Kimi K2\.6: advancing open\-source coding\.Note:Reports Kimi K2\.6 benchmark results and Agent Swarm architectureExternal Links:[Link](https://www.kimi.com/blog/kimi-k2-6)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p2.1),[§5](https://arxiv.org/html/2605.06898#S5.p3.1)\.
- S\. C\. Kleene \(1952\)Introduction to metamathematics\.North\-Holland,Amsterdam\.Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p3.1),[§A\.4](https://arxiv.org/html/2605.06898#A1.SS4.1.p1.1),[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p1.1)\.
- P\. J\. Landin \(1964\)The mechanical evaluation of expressions\.The Computer Journal6\(4\),pp\. 308–320\.External Links:[Document](https://dx.doi.org/10.1093/comjnl/6.4.308)Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p1.1)\.
- Y\. Lee, R\. Nair, Q\. Zhang, K\. Lee, O\. Khattab, and C\. Finn \(2026\)Meta\-Harness: end\-to\-end optimization of model harnesses\.External Links:2603\.28052,[Document](https://dx.doi.org/10.48550/arXiv.2603.28052),[Link](https://arxiv.org/abs/2603.28052)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- X\. Leroy and H\. Grall \(2009\)Coinductive big\-step operational semantics\.Information and Computation207\(2\),pp\. 284–304\.External Links:[Document](https://dx.doi.org/10.1016/j.ic.2007.12.004)Cited by:[§A\.4](https://arxiv.org/html/2605.06898#A1.SS4.p5.2)\.
- Z\. Li, A\. Solar\-Lezama, Y\. Yue, and S\. Zheng \(2025\)EnCompass: enhancing agent programming with search over program execution paths\.External Links:2512\.03571,[Link](https://arxiv.org/abs/2512.03571)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- R\. Lopopolo \(2026\)Harness engineering: leveraging Codex in an agent\-first world\.Note:[https://openai\.com/index/harness\-engineering/](https://openai.com/index/harness-engineering/)OpenAI engineering blog, published February 11, 2026Cited by:[§1](https://arxiv.org/html/2605.06898#S1.p1.1)\.
- N\. A\. Lynch and F\. W\. Vaandrager \(1995\)Forward and backward simulations\. Part I: untimed systems\.Information and Computation121\(2\),pp\. 214–233\.External Links:[Document](https://dx.doi.org/10.1006/inco.1995.1134)Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p2.1),[§A\.3](https://arxiv.org/html/2605.06898#A1.SS3.p1.2)\.
- J\. McCarthy \(1960\)Recursive functions of symbolic expressions and their computation by machine, part I\.Communications of the ACM3\(4\),pp\. 184–195\.External Links:[Document](https://dx.doi.org/10.1145/367177.367199)Cited by:[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p1.1),[§3](https://arxiv.org/html/2605.06898#S3.p3.1)\.
- S\. Nielsen, E\. Cetin, P\. Schwendeman, Q\. Sun, J\. Xu, and Y\. Tang \(2026\)Learning to orchestrate agents in natural language with the Conductor\.InInternational Conference on Learning Representations \(ICLR\),External Links:2512\.04388,[Link](https://arxiv.org/abs/2512.04388)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p2.1),[§5](https://arxiv.org/html/2605.06898#S5.p3.1),[§6](https://arxiv.org/html/2605.06898#S6.p3.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2023\)MemGPT: towards LLMs as operating systems\.External Links:2310\.08560,[Link](https://arxiv.org/abs/2310.08560)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p2.1)\.
- G\. D\. Plotkin \(1981\)A structural approach to operational semantics\.Technical reportTechnical ReportDAIMI FN\-19,Computer Science Department, Aarhus University\.Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p1.1)\.
- J\. Schmidhuber \(2007\)Gödel machines: fully self\-referential optimal universal self\-improvers\.InArtificial General Intelligence,B\. Goertzel and C\. Pennachin \(Eds\.\),pp\. 199–226\.External Links:[Document](https://dx.doi.org/10.1007/978-3-540-68677-4%5F7)Cited by:[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p4.1)\.
- R\. Segala and N\. A\. Lynch \(1995\)Probabilistic simulations for probabilistic processes\.Nordic Journal of Computing2\(2\),pp\. 250–273\.Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p2.1),[§A\.3](https://arxiv.org/html/2605.06898#A1.SS3.p1.2)\.
- B\. C\. Smith \(1984\)Reflection and semantics in LISP\.InProceedings of the 11th ACM SIGACT\-SIGPLAN Symposium on Principles of Programming Languages \(POPL\),pp\. 23–35\.External Links:[Document](https://dx.doi.org/10.1145/800017.800513)Cited by:[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p1.1),[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p4.1)\.
- R\. I\. Soare \(2009\)Turing oracle machines, online computing, and three displacements in computability theory\.Annals of Pure and Applied Logic160\(3\),pp\. 368–399\.External Links:[Document](https://dx.doi.org/10.1016/j.apal.2009.01.008)Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p2.1),[§A\.2](https://arxiv.org/html/2605.06898#A1.SS2.p1.1)\.
- W\. Sun, M\. Lu, Z\. Ling, K\. Liu, X\. Yao, Y\. Yang, and J\. Chen \(2025\)Scaling long\-horizon LLM agent via context\-folding\.Note:Introduces the FoldGRPO training objectiveExternal Links:2510\.11967,[Link](https://arxiv.org/abs/2510.11967)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p2.1),[§5](https://arxiv.org/html/2605.06898#S5.p3.1),[§6](https://arxiv.org/html/2605.06898#S6.p3.1)\.
- R\. S\. Sutton \(2019\)The bitter lesson\.Note:[http://www\.incompleteideas\.net/IncIdeas/BitterLesson\.html](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)Online essayCited by:[§6](https://arxiv.org/html/2605.06898#S6.p3.1)\.
- W\. Taha and T\. Sheard \(2000\)MetaML and multi\-stage programming with explicit annotations\.Theoretical Computer Science248\(1–2\),pp\. 211–242\.External Links:[Document](https://dx.doi.org/10.1016/S0304-3975%2800%2900053-0)Cited by:[§B\.3](https://arxiv.org/html/2605.06898#A2.SS3.p3.1)\.
- V\. Trivedy \(2026\)The anatomy of an agent harness\.Note:[https://www\.langchain\.com/blog/the\-anatomy\-of\-an\-agent\-harness](https://www.langchain.com/blog/the-anatomy-of-an-agent-harness)LangChain blog, published March 10, 2026Cited by:[§1](https://arxiv.org/html/2605.06898#S1.p1.1)\.
- A\. M\. Turing \(1939\)Systems of logic based on ordinals\.Proceedings of the London Mathematical Society45\(1\),pp\. 161–228\.External Links:[Document](https://dx.doi.org/10.1112/plms/s2-45.1.161)Cited by:[§A\.1](https://arxiv.org/html/2605.06898#A1.SS1.p2.1),[§A\.2](https://arxiv.org/html/2605.06898#A1.SS2.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024a\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research \(TMLR\)\.Note:First arXiv version 2023; cited in\-text as \(Wang et al\., 2023\)\.External Links:2305\.16291,[Link](https://arxiv.org/abs/2305.16291)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- X\. Wang, Y\. Chen, L\. Yuan, Y\. Zhang, Y\. Li, H\. Peng, and H\. Ji \(2024b\)Executable code actions elicit better LLM agents\.External Links:2402\.01030,[Link](https://arxiv.org/abs/2402.01030)Cited by:[§4](https://arxiv.org/html/2605.06898#S4.p4.1),[§5](https://arxiv.org/html/2605.06898#S5.p5.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[Figure 1](https://arxiv.org/html/2605.06898#S1.F1),[§1](https://arxiv.org/html/2605.06898#S1.p5.1)\.
- R\. Ye, Z\. Zhang, K\. Li, H\. Yin, Z\. Tao, Y\. Zhao, L\. Su, L\. Zhang, Z\. Qiao, X\. Wang, P\. Xie, F\. Huang, S\. Chen, J\. Zhou, and Y\. Jiang \(2025\)AgentFold: long\-horizon web agents with proactive context management\.External Links:2510\.24699,[Link](https://arxiv.org/abs/2510.24699)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p2.1)\.
- X\. Yin, X\. Wang, L\. Pan, X\. Wan, and W\. Y\. Wang \(2024\)Gödel agent: a self\-referential agent framework for recursive self\-improvement\.Note:ACL 2025 version adds Li Lin as coauthor\.External Links:2410\.04444,[Link](https://arxiv.org/abs/2410.04444)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- E\. Zelikman, E\. Lorch, L\. Mackey, and A\. T\. Kalai \(2024\)Self\-taught optimizer \(STOP\): recursively self\-improving code generation\.InConference on Language Modeling \(COLM\),External Links:2310\.02304,[Link](https://arxiv.org/abs/2310.02304)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- A\. L\. Zhang, T\. Kraska, and O\. Khattab \(2026\)Recursive language models\.External Links:2512\.24601,[Link](https://arxiv.org/abs/2512.24601)Cited by:[Figure 1](https://arxiv.org/html/2605.06898#S1.F1),[§1](https://arxiv.org/html/2605.06898#S1.p5.1),[§5](https://arxiv.org/html/2605.06898#S5.p2.1),[§5](https://arxiv.org/html/2605.06898#S5.p3.1),[§5](https://arxiv.org/html/2605.06898#S5.p5.1),[§6](https://arxiv.org/html/2605.06898#S6.p3.1)\.
- J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang, B\. Zheng, B\. Liu, Y\. Luo, and C\. Wu \(2025a\)AFlow: automating agentic workflow generation\.InInternational Conference on Learning Representations \(ICLR\),External Links:2410\.10762,[Link](https://arxiv.org/abs/2410.10762)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p4.1)\.
- Y\. Zhang, J\. Shu, Y\. Ma, X\. Lin, S\. Wu, and J\. Sang \(2025b\)Memory as action: autonomous context curation for long\-horizon agentic tasks\.External Links:2510\.12635,[Link](https://arxiv.org/abs/2510.12635)Cited by:[§5](https://arxiv.org/html/2605.06898#S5.p2.1),[§5](https://arxiv.org/html/2605.06898#S5.p3.1)\.

## Appendix AAgentic evaluators and self\-programmed execution

### A\.1Overview

This appendix formalizes sequential language\-model agents at the granularity of model\-call boundaries\. The ambient evaluator is treated as an abstract machine: a structured transition system whose states expose the evaluator\-level control, environment, and continuation rather than hardware\-level execution detailsLandin \[[1964](https://arxiv.org/html/2605.06898#bib.bib46)\], Plotkin \[[1981](https://arxiv.org/html/2605.06898#bib.bib47)\], Felleisen and Friedman \[[1986](https://arxiv.org/html/2605.06898#bib.bib55)\], Ageret al\.\[[2003](https://arxiv.org/html/2605.06898#bib.bib56)\]\. The new object isolated here is the*agentic machine*, which records only the prompt exposed at a boundary, the completion returned by the model interface, and the deterministic computation that resumes until the next boundary, halt, or divergence\.

This boundary\-to\-boundary view is related to interactive and oracle computation, but specialized to sequential language\-model systems in which each abstract transition contains exactly one model callTuring \[[1939](https://arxiv.org/html/2605.06898#bib.bib49)\], Soare \[[2009](https://arxiv.org/html/2605.06898#bib.bib50)\]\. It is also related to simulation and refinement mappings: an embedding must preserve the prompt seen by the model and must commute with every possible returned completionAbadi and Lamport \[[1991](https://arxiv.org/html/2605.06898#bib.bib52)\], Lynch and Vaandrager \[[1995](https://arxiv.org/html/2605.06898#bib.bib53)\], Segala and Lynch \[[1995](https://arxiv.org/html/2605.06898#bib.bib54)\]\. What is new in this appendix is the use of that boundary\-level embedding to define self\-programmed execution: a state is SPE when one completion from a fixed seed can load an embedded copy of every state of the same machine\.

For a CEK\-style evaluator with quotation,eval, and a distinguished model\-call primitive, the seed programlet y = lm q in eval yproduces such an SPE state\. The CEK component is the standard operational evaluator for a call\-by\-value languageFelleisen and Friedman \[[1986](https://arxiv.org/html/2605.06898#bib.bib55)\], Ageret al\.\[[2003](https://arxiv.org/html/2605.06898#bib.bib56)\]; first\-class quotation andevalare the additional source\-language facilities that make the self\-loading step literal\. The final results show that this seed not only completion\-generates the evaluator itself, but also completion\-generates any agentic machine whose prompt and harness functions are realizable by the underlying evaluator, with computable realizability following from general recursionKleene \[[1952](https://arxiv.org/html/2605.06898#bib.bib38)\]\.

This appendix is structured as follows\. Section[A\.2](https://arxiv.org/html/2605.06898#A1.SS2)defines an agentic machine\. Section[A\.3](https://arxiv.org/html/2605.06898#A1.SS3)defines three notions related to equivalence or simulation of agentic machines \(embedding, indistinguishability, and completion\-generation\), and it proves results relating these notions to each other; the most important is Theorem[A\.4](https://arxiv.org/html/2605.06898#A1.Thmdefinition4), which shows that a machine state and its embedded copy are indistinguishable to the model\. Section[A\.4](https://arxiv.org/html/2605.06898#A1.SS4)defines an agentic evaluator, specifically the CEK agentic evaluator, which is the agentic machine in which we locate an SPE state\. Finally, Section[A\.5](https://arxiv.org/html/2605.06898#A1.SS5)proves the main theorem, that the CEK agentic evaluator encodes an SPE state\. The following table maps main\-text definitions to this appendix\.

Main text itemDescriptionAppendix ADefinition[2\.1](https://arxiv.org/html/2605.06898#S2.Thmdefinition1)Agentic machinesDefinition[A\.1](https://arxiv.org/html/2605.06898#A1.Thmdefinition1); model semantics in Section[A\.2](https://arxiv.org/html/2605.06898#A1.SS2)Definition[2\.2](https://arxiv.org/html/2605.06898#S2.Thmdefinition2)Embeddings of agentic machinesDefinitions[A\.2](https://arxiv.org/html/2605.06898#A1.Thmdefinition2)and[A\.3](https://arxiv.org/html/2605.06898#A1.Thmdefinition3); Theorem[A\.4](https://arxiv.org/html/2605.06898#A1.Thmdefinition4)Definition[2\.3](https://arxiv.org/html/2605.06898#S2.Thmdefinition3)Reachability, completion\-generation, and SPE statesDefinitions[A\.7](https://arxiv.org/html/2605.06898#A1.Thmdefinition7)and[A\.15](https://arxiv.org/html/2605.06898#A1.Thmdefinition15)Definition[2\.4](https://arxiv.org/html/2605.06898#S2.Thmdefinition4)Agentic evaluators and boundary statesDefinitions[A\.11](https://arxiv.org/html/2605.06898#A1.Thmdefinition11)and[A\.13](https://arxiv.org/html/2605.06898#A1.Thmdefinition13); Lemma[A\.14](https://arxiv.org/html/2605.06898#A1.Thmdefinition14)x←∂Ex\\xleftarrow\{\\partial\}EFirst boundary state reached by evaluatingEEDefinition[A\.13](https://arxiv.org/html/2605.06898#A1.Thmdefinition13); Theorem[A\.17](https://arxiv.org/html/2605.06898#A1.Thmdefinition17)Theorem[2\.5](https://arxiv.org/html/2605.06898#S2.Thmdefinition5)SPE seed theoremTheorem[A\.17](https://arxiv.org/html/2605.06898#A1.Thmdefinition17)Corollary[2\.6](https://arxiv.org/html/2605.06898#S2.Thmdefinition6)Universality for computable/realizable machinesCorollaries[A\.19](https://arxiv.org/html/2605.06898#A1.Thmdefinition19)and[A\.21](https://arxiv.org/html/2605.06898#A1.Thmdefinition21)

### A\.2Agentic machines

An agentic machine describes a system that sends a prompt to a model, receives a completion or response, and performs a deterministic transition to a new state that depends on the previous state and the model completion\. This setup is related to oracle\-parameterized machinesTuring \[[1939](https://arxiv.org/html/2605.06898#bib.bib49)\], Soare \[[2009](https://arxiv.org/html/2605.06898#bib.bib50)\]\.

We fix a prompt spacePPand a completion spaceCC\. A model semantics for this interface, when one is needed, is any map

M:P→Δ\(C\),M:P\\to\\Delta\(C\),whereΔ\(A\)\\Delta\(A\)denotes the set of probability measures onAA\(deterministic models are also permitted\)\. The results below depend only onPPandCC; whetherMMis deterministic or stochastic does not matter\.

###### Definition A\.1\(Agentic machine\)\.

An*agentic machine*over\(P,C\)\(P,C\)is a triple

consisting of a state spaceSS, a prompt function

and a harness function

h:S×C→S∪\{𝟏,↑\},h:S\\times C\\to S\\cup\\\{\\mathbf\{1\},\\uparrow\\\},where𝟏\\mathbf\{1\}is a distinguished halting point and↑\\uparrowis a distinguished divergence point\.

For each model semanticsM:P→Δ\(C\)M:P\\to\\Delta\(C\), the induced transition kernel

TMX:S→Δ\(S∪\{𝟏,↑\}\)T\_\{M\}^\{X\}:S\\to\\Delta\(S\\cup\\\{\\mathbf\{1\},\\uparrow\\\}\)is defined by

TMX\(s\):=h\(s,−\)∗M\(p\(s\)\),T\_\{M\}^\{X\}\(s\):=h\(s,\-\)\_\{\*\}M\(p\(s\)\),whereh\(s,−\):C→S∪\{𝟏,↑\}h\(s,\-\):C\\to S\\cup\\\{\\mathbf\{1\},\\uparrow\\\}denotes the mapc↦h\(s,c\)c\\mapsto h\(s,c\)andh\(s,−\)∗h\(s,\-\)\_\{\*\}denotes pushforward along that map\.

Thus each transition records exactly one model invocation and one returned completion\. The machine data\(S,p,h\)\(S,p,h\)are deterministic; sampling occurs only when a model semanticsMMis actually queried at the promptp\(s\)p\(s\)\.

Elements ofCCmay be raw model responses, full prompt\-response completions, or typed values delivered by an evaluator interface\. A real API may expose bytes or text that must first be parsed, adapted, or rejected; such deterministic adapter behavior is part of the harnesshh\. In SPE the returned source is naturally treated as extending a prefix already resident in the evaluator, so we use the word*completion*\.

The outcome↑\\uparrowrecords deterministic divergence between one completion and the next boundary or halt\.

### A\.3Embeddings and completion generation

###### Definition A\.2\(Embedding\)\.

Let

X=\(S,p,h\),X′=\(S′,p′,h′\)X=\(S,p,h\),\\qquad X^\{\\prime\}=\(S^\{\\prime\},p^\{\\prime\},h^\{\\prime\}\)be agentic machines over the same interface\. An*embedding*ofX′X^\{\\prime\}intoXXis an injective map

which we extend by convention to

e:S′∪\{𝟏,↑\}→S∪\{𝟏,↑\}e:S^\{\\prime\}\\cup\\\{\\mathbf\{1\},\\uparrow\\\}\\to S\\cup\\\{\\mathbf\{1\},\\uparrow\\\}withe\(𝟏\)=𝟏e\(\\mathbf\{1\}\)=\\mathbf\{1\}ande\(↑\)=↑e\(\\uparrow\)=\\uparrow, and which satisfies

p\(e\(s′\)\)=p′\(s′\)p\(e\(s^\{\\prime\}\)\)=p^\{\\prime\}\(s^\{\\prime\}\)for everys′∈S′s^\{\\prime\}\\in S^\{\\prime\}and

h\(e\(s′\),c\)=e\(h′\(s′,c\)\)h\(e\(s^\{\\prime\}\),c\)=e\(h^\{\\prime\}\(s^\{\\prime\},c\)\)for everys′∈S′s^\{\\prime\}\\in S^\{\\prime\}andc∈Cc\\in C\.

This is an injective, pointwise special case of simulation/refinement ideas from the verification literatureAbadi and Lamport \[[1991](https://arxiv.org/html/2605.06898#bib.bib52)\], Lynch and Vaandrager \[[1995](https://arxiv.org/html/2605.06898#bib.bib53)\], Segala and Lynch \[[1995](https://arxiv.org/html/2605.06898#bib.bib54)\]\. From the model’s perspective an embedding is invisible: the only part of the state that the model sees is the prompt, and the conditionp∘e=p′p\\circ e=p^\{\\prime\}says that the ambient machine uses exactly the prompt the embedded machine would have used\. Because the post\-completion dynamics also commute withee, the model sees the same prompt process in the embedded run as in the original one\. The next definition makes this precise by comparing the prompts and terminal outcomes seen after any fixed finite sequence of future completions\.

For an agentic machineX=\(S,p,h\)X=\(S,p,h\), a starting states∈Ss\\in S, and a finite completion sequenceσ=c1⋯cn\\sigma=c\_\{1\}\\cdots c\_\{n\}, write

x0X\(s;σ\),…,xnX\(s;σ\)∈S∪\{𝟏,↑\}x\_\{0\}^\{X\}\(s;\\sigma\),\\ldots,x\_\{n\}^\{X\}\(s;\\sigma\)\\in S\\cup\\\{\\mathbf\{1\},\\uparrow\\\}for the unique run prefix withx0X\(s;σ\)=sx\_\{0\}^\{X\}\(s;\\sigma\)=sand, for0≤k<n0\\leq k<n,

xk\+1X\(s;σ\)=\{h\(xkX\(s;σ\),ck\+1\),xkX\(s;σ\)∈S,xkX\(s;σ\),xkX\(s;σ\)∈\{𝟏,↑\}\.x\_\{k\+1\}^\{X\}\(s;\\sigma\)=\\begin\{cases\}h\\bigl\(x\_\{k\}^\{X\}\(s;\\sigma\),c\_\{k\+1\}\\bigr\),&x\_\{k\}^\{X\}\(s;\\sigma\)\\in S,\\\\ x\_\{k\}^\{X\}\(s;\\sigma\),&x\_\{k\}^\{X\}\(s;\\sigma\)\\in\\\{\\mathbf\{1\},\\uparrow\\\}\.\\end\{cases\}Thus halt and divergence are absorbing for purposes of comparing later prompts\. Also write

visX⁡\(x\)=\{p\(x\),x∈S,𝟏,x=𝟏,↑,x=↑\.\\operatorname\{vis\}\_\{X\}\(x\)=\\begin\{cases\}p\(x\),&x\\in S,\\\\ \\mathbf\{1\},&x=\\mathbf\{1\},\\\\ \\uparrow,&x=\\uparrow\.\\end\{cases\}
###### Definition A\.3\(LM\-indistinguishability\)\.

If\(X,s\)\(X,s\)and\(X′,s′\)\(X^\{\\prime\},s^\{\\prime\}\)are pointed states over the same interface, we say that they are*LM\-indistinguishable*, and write

\(X,s\)∼\(X′,s′\),\(X,s\)\\sim\(X^\{\\prime\},s^\{\\prime\}\),if for every finite completion sequenceσ=c1⋯cn\\sigma=c\_\{1\}\\cdots c\_\{n\}and every prefix length0≤k≤n0\\leq k\\leq n,

visX⁡\(xkX\(s;σ\)\)=visX′⁡\(xkX′\(s′;σ\)\)\.\\operatorname\{vis\}\_\{X\}\\bigl\(x\_\{k\}^\{X\}\(s;\\sigma\)\\bigr\)=\\operatorname\{vis\}\_\{X^\{\\prime\}\}\\bigl\(x\_\{k\}^\{X^\{\\prime\}\}\(s^\{\\prime\};\\sigma\)\\bigr\)\.Equivalently, after any shared finite sequence of completions, the two runs have both halted, have both diverged, or remain at states exposing the same prompt\. In particular,∼\\simis an equivalence relation on pointed states over a fixed interface\.

###### Theorem A\.4\(Embeddings are invisible to the LM\)\.

Lete:S′→Se:S^\{\\prime\}\\to Sbe an embedding ofX′=\(S′,p′,h′\)X^\{\\prime\}=\(S^\{\\prime\},p^\{\\prime\},h^\{\\prime\}\)intoX=\(S,p,h\)X=\(S,p,h\)\. Then

\(X′,s′\)∼\(X,e\(s′\)\)\(X^\{\\prime\},s^\{\\prime\}\)\\sim\(X,e\(s^\{\\prime\}\)\)for everys′∈S′s^\{\\prime\}\\in S^\{\\prime\}\.

###### Proof\.

Extendeeto outcomes bye\(𝟏\)=𝟏e\(\\mathbf\{1\}\)=\\mathbf\{1\}ande\(↑\)=↑e\(\\uparrow\)=\\uparrow\. Fix a finite completion sequenceσ=c1⋯cn\\sigma=c\_\{1\}\\cdots c\_\{n\}\. Let

xk′=xkX′\(s′;σ\),xk=xkX\(e\(s′\);σ\)x^\{\\prime\}\_\{k\}=x\_\{k\}^\{X^\{\\prime\}\}\(s^\{\\prime\};\\sigma\),\\qquad x\_\{k\}=x\_\{k\}^\{X\}\(e\(s^\{\\prime\}\);\\sigma\)be the two run prefixes\. We prove by induction onkkthat

xk=e\(xk′\)x\_\{k\}=e\(x^\{\\prime\}\_\{k\}\)for every0≤k≤n0\\leq k\\leq n\. The casek=0k=0is immediate\. For the induction step, assume the claim holds at somek<nk<n\. Ifxk′∈\{𝟏,↑\}x^\{\\prime\}\_\{k\}\\in\\\{\\mathbf\{1\},\\uparrow\\\}, then by the induction hypothesisxkx\_\{k\}is the same terminal outcome, and both run prefixes keep that outcome at stepk\+1k\+1\. Ifxk′∈S′x^\{\\prime\}\_\{k\}\\in S^\{\\prime\}, then

xk\+1=h\(xk,ck\+1\)=h\(e\(xk′\),ck\+1\)=e\(h′\(xk′,ck\+1\)\)=e\(xk\+1′\),x\_\{k\+1\}=h\(x\_\{k\},c\_\{k\+1\}\)=h\(e\(x^\{\\prime\}\_\{k\}\),c\_\{k\+1\}\)=e\(h^\{\\prime\}\(x^\{\\prime\}\_\{k\},c\_\{k\+1\}\)\)=e\(x^\{\\prime\}\_\{k\+1\}\),using transition commutation for the embedding\.

It remains to compare what the model sees\. Ifxk′∈S′x^\{\\prime\}\_\{k\}\\in S^\{\\prime\}, then

visX⁡\(xk\)=p\(e\(xk′\)\)=p′\(xk′\)=visX′⁡\(xk′\),\\operatorname\{vis\}\_\{X\}\(x\_\{k\}\)=p\(e\(x^\{\\prime\}\_\{k\}\)\)=p^\{\\prime\}\(x^\{\\prime\}\_\{k\}\)=\\operatorname\{vis\}\_\{X^\{\\prime\}\}\(x^\{\\prime\}\_\{k\}\),and ifxk′∈\{𝟏,↑\}x^\{\\prime\}\_\{k\}\\in\\\{\\mathbf\{1\},\\uparrow\\\}the two visible outcomes are equal becauseeefixes𝟏\\mathbf\{1\}and↑\\uparrow\. Thus the visible run prefixes agree for everyσ\\sigmaand everykk, so\(X′,s′\)∼\(X,e\(s′\)\)\(X^\{\\prime\},s^\{\\prime\}\)\\sim\(X,e\(s^\{\\prime\}\)\)\. ∎

###### Example A\.6\(Indistinguishable machines may lack an embedding\)\.

LetS1=\{s1\}S\_\{1\}=\\\{s\_\{1\}\\\}andS2=\{s2\}S\_\{2\}=\\\{s\_\{2\}\\\}be disjoint singletons, letP=S1⊔S2P=S\_\{1\}\\sqcup S\_\{2\}, and take any completion spaceCC\. Define

T1=\(S1×\{0,1\}\)⊔S2,T2=S1⊔\(S2×\{0,1\}\),T\_\{1\}=\(S\_\{1\}\\times\\\{0,1\\\}\)\\sqcup S\_\{2\},\\qquad T\_\{2\}=S\_\{1\}\\sqcup\(S\_\{2\}\\times\\\{0,1\\\}\),and letXi=\(Ti,pi,hi\)X\_\{i\}=\(T\_\{i\},p\_\{i\},h\_\{i\}\)fori=1,2i=1,2, where

p1\(\(s1,j\)\)=s1,p1\(s2\)=s2,p2\(s1\)=s1,p2\(\(s2,j\)\)=s2,p\_\{1\}\(\(s\_\{1\},j\)\)=s\_\{1\},\\quad p\_\{1\}\(s\_\{2\}\)=s\_\{2\},\\qquad p\_\{2\}\(s\_\{1\}\)=s\_\{1\},\\quad p\_\{2\}\(\(s\_\{2\},j\)\)=s\_\{2\},and both harnesses stutter:hi\(t,c\)=th\_\{i\}\(t,c\)=tfor everyt∈Tit\\in T\_\{i\}andc∈Cc\\in C\. Then every state ofX1X\_\{1\}is LM\-indistinguishable from some state ofX2X\_\{2\}, and conversely, by choosing a state with the same prompt; all finite completion sequences leave that prompt unchanged\.

However, neither machine embeds in the other\. An embedding ofX1X\_\{1\}intoX2X\_\{2\}would have to send the two distinct states\(s1,0\)\(s\_\{1\},0\)and\(s1,1\)\(s\_\{1\},1\)to two distinct states ofX2X\_\{2\}with prompts1s\_\{1\}, butX2X\_\{2\}has only one such state\. Symmetrically, an embedding ofX2X\_\{2\}intoX1X\_\{1\}would require two distinct states ofX1X\_\{1\}with prompts2s\_\{2\}, butX1X\_\{1\}has only one\.

###### Definition A\.7\(Completion\-generation by a state\)\.

LetX=\(S,p,h\)X=\(S,p,h\)be an agentic machine and letx0∈Sx\_\{0\}\\in S\. Define the one\-step state\-reachable set

ReachX⁡\(x0\):=\{t∈S:∃c∈Csuch thath\(x0,c\)=t\}\.\\operatorname\{Reach\}\_\{X\}\(x\_\{0\}\):=\\\{t\\in S:\\exists c\\in C\\text\{ such that \}h\(x\_\{0\},c\)=t\\\}\.For another agentic machineX′=\(S′,p′,h′\)X^\{\\prime\}=\(S^\{\\prime\},p^\{\\prime\},h^\{\\prime\}\)over the same interface, we say that the pair

*completion\-generates*X′X^\{\\prime\}if there exists an embedding

such that

im⁡\(e\)⊆ReachX⁡\(x0\)\.\\operatorname\{im\}\(e\)\\subseteq\\operatorname\{Reach\}\_\{X\}\(x\_\{0\}\)\.If there existsx0∈Sx\_\{0\}\\in Ssuch that\(X,x0\)\(X,x\_\{0\}\)completion\-generatesX′X^\{\\prime\}, we say thatXX*completion\-generates*X′X^\{\\prime\}\.

The state\-only convention inReachX⁡\(x0\)\\operatorname\{Reach\}\_\{X\}\(x\_\{0\}\)is not a separate mathematical assumption\. Equivalently, one could take the full one\-step image inS∪\{𝟏,↑\}S\\cup\\\{\\mathbf\{1\},\\uparrow\\\}and then require the embedding image to lie in its state part\. We use the state\-only form because completion\-generation asks whether target states, not terminal outcomes, can be loaded into the ambient state space\.

This one\-hop requirement says that the embedded state is not merely present somewhere in the ambient machine, but reachable from one fixed statex0x\_\{0\}by a single completion\-mediated step\. That is the formal analogue of loading a state through one completion\.

###### Example A\.8\(Completion\-generation need not hold\)\.

LetP=C=S=𝔽2P=C=S=\\mathbb\{F\}\_\{2\}, and let the prompt map be the identityp:S→Pp:S\\to P\. For each harness below, consider the agentic machine

Then:

1. \(1\)Ifh\(a,b\)=a\+bh\(a,b\)=a\+b, then\(X,x0\)\(X,x\_\{0\}\)completion\-generatesXXfor everyx0∈Sx\_\{0\}\\in S\.
2. \(2\)Ifh\(a,b\)=abh\(a,b\)=ab, then\(X,1\)\(X,1\)completion\-generatesXX, but\(X,0\)\(X,0\)does not\.
3. \(3\)Ifh\(a,b\)=ah\(a,b\)=a, then\(X,x0\)\(X,x\_\{0\}\)does not completion\-generateXXfor eitherx0∈Sx\_\{0\}\\in S\.

Indeed, in case \(1\) the image ofh\(x0,−\)h\(x\_\{0\},\-\)is all of𝔽2\\mathbb\{F\}\_\{2\}for everyx0x\_\{0\}; in case \(2\) the image ofh\(1,−\)h\(1,\-\)is all of𝔽2\\mathbb\{F\}\_\{2\}but the image ofh\(0,−\)h\(0,\-\)is\{0\}\\\{0\\\}; and in case \(3\) the image ofh\(x0,−\)h\(x\_\{0\},\-\)is always\{x0\}\\\{x\_\{0\}\\\}\. Becauseppis the identity, any self\-embedding ofXXmust be the identity on states\. Thus the reachability claims above exactly determine whenXXcompletion\-generates itself, and case \(3\) shows that embedding alone does not imply completion\-generation\.

###### Proposition A\.9\(Downward closure under embedded targets\)\.

Let

X=\(SX,pX,hX\),Y=\(SY,pY,hY\),Z=\(SZ,pZ,hZ\)X=\(S\_\{X\},p\_\{X\},h\_\{X\}\),\\qquad Y=\(S\_\{Y\},p\_\{Y\},h\_\{Y\}\),\\qquad Z=\(S\_\{Z\},p\_\{Z\},h\_\{Z\}\)be agentic machines over the same interface, and letx0∈SXx\_\{0\}\\in S\_\{X\}\. If\(X,x0\)\(X,x\_\{0\}\)completion\-generatesYYand there exists an embeddingj:SZ→SYj:S\_\{Z\}\\to S\_\{Y\}ofZZintoYY, then\(X,x0\)\(X,x\_\{0\}\)completion\-generatesZZ\.

###### Proof\.

Lete:SY→SXe:S\_\{Y\}\\to S\_\{X\}witness that\(X,x0\)\(X,x\_\{0\}\)completion\-generatesYY\. Thene∘j:SZ→SXe\\circ j:S\_\{Z\}\\to S\_\{X\}is an embedding ofZZintoXX, and

im⁡\(e∘j\)⊆im⁡\(e\)⊆ReachX⁡\(x0\)\.\\operatorname\{im\}\(e\\circ j\)\\subseteq\\operatorname\{im\}\(e\)\\subseteq\\operatorname\{Reach\}\_\{X\}\(x\_\{0\}\)\.Thus\(X,x0\)\(X,x\_\{0\}\)completion\-generatesZZ\. ∎

The next result changes the ambient machine rather than the target machine\. It is therefore a separate source\-invariance statement, not merely another instance of downward closure\.

###### Proposition A\.10\(Completion\-generation is invariant under embeddings\)\.

Let

X=\(SX,pX,hX\),Y=\(SY,pY,hY\),Z=\(SZ,pZ,hZ\)X=\(S\_\{X\},p\_\{X\},h\_\{X\}\),\\qquad Y=\(S\_\{Y\},p\_\{Y\},h\_\{Y\}\),\\qquad Z=\(S\_\{Z\},p\_\{Z\},h\_\{Z\}\)be agentic machines over the same interface, let

be an embedding ofYYintoXX, and lety0∈SYy\_\{0\}\\in S\_\{Y\}with

x0=e\(y0\)\.x\_\{0\}=e\(y\_\{0\}\)\.Then

\(X,x0\)completion\-generatesZ⟺\(Y,y0\)completion\-generatesZ\.\(X,x\_\{0\}\)\\text\{ completion\-generates \}Z\\quad\\Longleftrightarrow\\quad\(Y,y\_\{0\}\)\\text\{ completion\-generates \}Z\.

###### Proof\.

First assume that\(Y,y0\)\(Y,y\_\{0\}\)completion\-generatesZZ, witnessed by an embedding

withim⁡\(f\)⊆ReachY⁡\(y0\)\\operatorname\{im\}\(f\)\\subseteq\\operatorname\{Reach\}\_\{Y\}\(y\_\{0\}\)\. Thene∘f:SZ→SXe\\circ f:S\_\{Z\}\\to S\_\{X\}is an embedding ofZZintoXX\. For eachz∈SZz\\in S\_\{Z\}, choosec∈Cc\\in Cwithf\(z\)=hY\(y0,c\)f\(z\)=h\_\{Y\}\(y\_\{0\},c\); then

\(e∘f\)\(z\)=e\(hY\(y0,c\)\)=hX\(e\(y0\),c\)=hX\(x0,c\)\.\(e\\circ f\)\(z\)=e\(h\_\{Y\}\(y\_\{0\},c\)\)=h\_\{X\}\(e\(y\_\{0\}\),c\)=h\_\{X\}\(x\_\{0\},c\)\.Thusim⁡\(e∘f\)⊆ReachX⁡\(x0\)\\operatorname\{im\}\(e\\circ f\)\\subseteq\\operatorname\{Reach\}\_\{X\}\(x\_\{0\}\), so\(X,x0\)\(X,x\_\{0\}\)completion\-generatesZZ\.

Conversely, assume that\(X,x0\)\(X,x\_\{0\}\)completion\-generatesZZ, witnessed by an embedding

withim⁡\(g\)⊆ReachX⁡\(x0\)\\operatorname\{im\}\(g\)\\subseteq\\operatorname\{Reach\}\_\{X\}\(x\_\{0\}\)\. Sincex0=e\(y0\)x\_\{0\}=e\(y\_\{0\}\)andeeis an embedding,

hX\(x0,c\)=hX\(e\(y0\),c\)=e\(hY\(y0,c\)\)\(c∈C\)\.h\_\{X\}\(x\_\{0\},c\)=h\_\{X\}\(e\(y\_\{0\}\),c\)=e\(h\_\{Y\}\(y\_\{0\},c\)\)\\qquad\(c\\in C\)\.Therefore every state inReachX⁡\(x0\)\\operatorname\{Reach\}\_\{X\}\(x\_\{0\}\)lies inim⁡\(e\)\\operatorname\{im\}\(e\), and henceim⁡\(g\)⊆im⁡\(e\)\\operatorname\{im\}\(g\)\\subseteq\\operatorname\{im\}\(e\)\. Becauseeeis injective, the map

g~:=e−1∘g:SZ→SY\\tilde\{g\}:=e^\{\-1\}\\circ g:S\_\{Z\}\\to S\_\{Y\}is well defined\. Extendg~\\tilde\{g\}toSZ∪\{𝟏,↑\}S\_\{Z\}\\cup\\\{\\mathbf\{1\},\\uparrow\\\}byg~\(𝟏\)=𝟏\\tilde\{g\}\(\\mathbf\{1\}\)=\\mathbf\{1\}andg~\(↑\)=↑\\tilde\{g\}\(\\uparrow\)=\\uparrow\.

For prompts, ifz∈SZz\\in S\_\{Z\}then

pY\(g~\(z\)\)=pX\(e\(g~\(z\)\)\)=pX\(g\(z\)\)=pZ\(z\)\.p\_\{Y\}\(\\tilde\{g\}\(z\)\)=p\_\{X\}\(e\(\\tilde\{g\}\(z\)\)\)=p\_\{X\}\(g\(z\)\)=p\_\{Z\}\(z\)\.For transitions, ifz∈SZz\\in S\_\{Z\}andc∈Cc\\in Cthen

e\(hY\(g~\(z\),c\)\)=hX\(e\(g~\(z\)\),c\)=hX\(g\(z\),c\)=g\(hZ\(z,c\)\)=e\(g~\(hZ\(z,c\)\)\)\.e\(h\_\{Y\}\(\\tilde\{g\}\(z\),c\)\)=h\_\{X\}\(e\(\\tilde\{g\}\(z\)\),c\)=h\_\{X\}\(g\(z\),c\)=g\(h\_\{Z\}\(z,c\)\)=e\(\\tilde\{g\}\(h\_\{Z\}\(z,c\)\)\)\.Sinceeeis injective, this implies

hY\(g~\(z\),c\)=g~\(hZ\(z,c\)\)\.h\_\{Y\}\(\\tilde\{g\}\(z\),c\)=\\tilde\{g\}\(h\_\{Z\}\(z,c\)\)\.Thusg~\\tilde\{g\}is an embedding ofZZintoYY\. Finally, ifg\(z\)=hX\(x0,c\)g\(z\)=h\_\{X\}\(x\_\{0\},c\)then

g~\(z\)=e−1\(g\(z\)\)=e−1\(hX\(e\(y0\),c\)\)=hY\(y0,c\),\\tilde\{g\}\(z\)=e^\{\-1\}\(g\(z\)\)=e^\{\-1\}\(h\_\{X\}\(e\(y\_\{0\}\),c\)\)=h\_\{Y\}\(y\_\{0\},c\),soim⁡\(g~\)⊆ReachY⁡\(y0\)\\operatorname\{im\}\(\\tilde\{g\}\)\\subseteq\\operatorname\{Reach\}\_\{Y\}\(y\_\{0\}\)\. Hence\(Y,y0\)\(Y,y\_\{0\}\)completion\-generatesZZ\. ∎

### A\.4Agentic evaluators

The evaluator used below is an ordinary call\-by\-value CEK evaluator, together with source\-language facilities for first\-class quotation andeval\. The classical CEK part provides states with a control component, environment, and continuation, and it runs by a deterministic transition relationFelleisen and Friedman \[[1986](https://arxiv.org/html/2605.06898#bib.bib55)\], Ageret al\.\[[2003](https://arxiv.org/html/2605.06898#bib.bib56)\]\. The quotation andevalassumptions are object\-language assumptions used for self\-programming; they are not additional claims about the original CEK papers\.

Fix a call\-by\-value languageℒ\\mathcal\{L\}with branching, general recursion, and a first\-class value domainVV\. AssumeVVcontains a subsetQ⊆VQ\\subseteq Vof quotations of closed source expressions and a built\-in formeval: ifq=⌜E⌝∈Qq=\\ulcorner E\\urcorner\\in Qquotes a closed expressionEE, then evaluatingevalq\\,qcontinues as evaluation ofEE; ifv∈V∖Qv\\in V\\setminus Q, then evaluatingevalv\\,vhalts at the distinguished terminal point𝟏\\mathbf\{1\}\. Also assume a value\-quotation map

quote:V→Q\\operatorname\{quote\}:V\\to Qsuch that, for everyv∈Vv\\in V, the quotationquote⁡\(v\)\\operatorname\{quote\}\(v\)denotes a closed value expression forvv; hence evaluatingevalquote⁡\(v\)\\,\\operatorname\{quote\}\(v\)deterministically yieldsvvwithout invoking the model\.

###### Definition A\.11\(CEK evaluator\)\.

For such a languageℒ\\mathcal\{L\}, a*CEK evaluator*has a state spaceΣCEKℒ\\Sigma\_\{\\mathrm\{CEK\}\}^\{\\mathcal\{L\}\}whose states are triples

of control component, environment, and continuation\. The control componenteemay be either a source expression or a returned first\-class value\. The evaluator steps by a partial deterministic transition function

TCEKℒ:ΣCEKℒ⇀ΣCEKℒ∪\{𝟏\},T\_\{\\mathrm\{CEK\}\}^\{\\mathcal\{L\}\}:\\Sigma\_\{\\mathrm\{CEK\}\}^\{\\mathcal\{L\}\}\\rightharpoonup\\Sigma\_\{\\mathrm\{CEK\}\}^\{\\mathcal\{L\}\}\\cup\\\{\\mathbf\{1\}\\\},where all finite terminal, error, and stuck outcomes are identified with the distinguished point𝟏\\mathbf\{1\}\.

###### Theorem A\.12\(Computational completeness of the CEK evaluator\)\.

Under standard finite encodings, the CEK evaluator forℒ\\mathcal\{L\}computes every partial computable function\.

###### Proof\.

A call\-by\-value language with branching and general recursion can represent partial computable functions under standard finite encodingsKleene \[[1952](https://arxiv.org/html/2605.06898#bib.bib38)\]\. The CEK machine is an operational evaluator for such a language: it makes the evaluation steps explicit while preserving the corresponding call\-by\-value evaluatorFelleisen and Friedman \[[1986](https://arxiv.org/html/2605.06898#bib.bib55)\], Ageret al\.\[[2003](https://arxiv.org/html/2605.06898#bib.bib56)\]\. Therefore, for every partial computable function, there is a closed program ofℒ\\mathcal\{L\}whose CEK evaluation computes it, halting with the encoded output when the function is defined and otherwise diverging\. ∎

The agentic evaluator observes the CEK machine only at model\-call boundaries\. Its states are the source\-addressable CEK states whose next action is to call the primitivelm\\operatorname\{lm\}on an already evaluated prompt value\. Resuming such a state supplies one completion value, after which execution is ordinary deterministic CEK evaluation until the next model\-call boundary, halt, or divergence\.

###### Definition A\.13\(CEK agentic evaluator\)\.

Fix a CEK evaluator as in Definition[A\.11](https://arxiv.org/html/2605.06898#A1.Thmdefinition11), and fix an interface\(P,C\)\(P,C\)with

P⊆V,Q⊆C⊆V\.P\\subseteq V,\\qquad Q\\subseteq C\\subseteq V\.Add a unary primitivelm\\operatorname\{lm\}toℒ\\mathcal\{L\}, yielding an extended languageℒlm\\mathcal\{L\}^\{\\operatorname\{lm\}\}and CEK state space

Σ:=ΣCEKℒlm\.\\Sigma:=\\Sigma\_\{\\mathrm\{CEK\}\}^\{\\mathcal\{L\}^\{\\operatorname\{lm\}\}\}\.Quotations andevalextend to closed expressions ofℒlm\\mathcal\{L\}^\{\\operatorname\{lm\}\}: ifEEis closed, then⌜E⌝∈Q⊆C\\ulcorner E\\urcorner\\in Q\\subseteq Cand evaluatingeval⌜E⌝\\,\\ulcorner E\\urcornercontinues as evaluation ofEE; evaluatingevalon a value outsideQQhalts at𝟏\\mathbf\{1\}\.

Let

Braw:=\{\(lm⁡q,ρ,κ\)∈Σ:q∈P\}\.B\_\{\\mathrm\{raw\}\}:=\\\{\(\\operatorname\{lm\}q,\\rho,\\kappa\)\\in\\Sigma:q\\in P\\\}\.For a closed termEEofℒlm\\mathcal\{L\}^\{\\operatorname\{lm\}\}or a CEK stateσ∈Σ\\sigma\\in\\Sigma, and a raw boundary stateβ∈Braw\\beta\\in B\_\{\\mathrm\{raw\}\}, write

β←∂Eorβ←∂σ\\beta\\xleftarrow\{\\partial\}E\\qquad\\text\{or\}\\qquad\\beta\\xleftarrow\{\\partial\}\\sigmato mean that deterministic evaluation from that term or state first reaches the boundaryβ\\beta\. The state spaceB⊆BrawB\\subseteq B\_\{\\mathrm\{raw\}\}consists of the source\-addressable boundary states: thoseβ∈Braw\\beta\\in B\_\{\\mathrm\{raw\}\}for whichβ←∂E\\beta\\xleftarrow\{\\partial\}Efor some closed termEE\.

For a boundary stateβ=\(lm⁡q,ρ,κ\)∈B\\beta=\(\\operatorname\{lm\}q,\\rho,\\kappa\)\\in B, define

p\(β\):=q,r\(β,c\):=\(c,ρ,κ\)\.p\(\\beta\):=q,\\qquad r\(\\beta,c\):=\(c,\\rho,\\kappa\)\.Define

h:B×C→B∪\{𝟏,↑\}h:B\\times C\\to B\\cup\\\{\\mathbf\{1\},\\uparrow\\\}as follows\. If the deterministic CEK run fromr\(β,c\)r\(\\beta,c\)first reaches the terminal point𝟏\\mathbf\{1\}, or another finite stuck/error state, seth\(β,c\)=𝟏h\(\\beta,c\)=\\mathbf\{1\}\. If the run is infinite and never reachesBraw∪\{𝟏\}B\_\{\\mathrm\{raw\}\}\\cup\\\{\\mathbf\{1\}\\\}, seth\(β,c\)=↑h\(\\beta,c\)=\\uparrow\. Otherwise, set

h\(β,c\)←∂r\(β,c\)\.h\(\\beta,c\)\\xleftarrow\{\\partial\}r\(\\beta,c\)\.The boundary reached in the last case is source\-addressable by Lemma[A\.14](https://arxiv.org/html/2605.06898#A1.Thmdefinition14)below, sohhis well typed\. The associated agentic machine is the*CEK agentic evaluator*

XCEK:=\(B,p,h\)\.X\_\{\\mathrm\{CEK\}\}:=\(B,p,h\)\.

###### Lemma A\.14\(Source\-addressability is stable under resumption\)\.

Letβ∈B\\beta\\in Bandc∈Cc\\in C\. If the deterministic CEK run fromr\(β,c\)r\(\\beta,c\)first reaches a raw boundaryβ′∈Braw\\beta^\{\\prime\}\\in B\_\{\\mathrm\{raw\}\}, thenβ′∈B\\beta^\{\\prime\}\\in B\.

###### Proof\.

Writeβ=\(lm⁡q,ρ,κ\)\\beta=\(\\operatorname\{lm\}q,\\rho,\\kappa\)\. Sinceβ\\betais source\-addressable, it is reached from a closed term, so the environment and continuation atβ\\betarepresent a closed evaluation contextKβ\[−\]K\_\{\\beta\}\[\-\]in the standard CEK/context correspondence\. Let

Fc:=evalquote⁡\(c\)\.F\_\{c\}:=\\texttt\{eval \}\\,\\operatorname\{quote\}\(c\)\.The expressionFcF\_\{c\}is closed and evaluates deterministically to the valueccwithout invokinglm\\operatorname\{lm\}\. Therefore deterministic evaluation of the closed termKβ\[Fc\]K\_\{\\beta\}\[F\_\{c\}\]agrees, after this initial value computation, with the CEK run fromr\(β,c\)=\(c,ρ,κ\)r\(\\beta,c\)=\(c,\\rho,\\kappa\)\. If that resumed run first reachesβ′\\beta^\{\\prime\}, thenβ′\\beta^\{\\prime\}is the first model\-call boundary reached by the closed termKβ\[Fc\]K\_\{\\beta\}\[F\_\{c\}\]\. Henceβ′\\beta^\{\\prime\}is source\-addressable by definition, soβ′∈B\\beta^\{\\prime\}\\in B\. ∎

The source\-addressability restriction ensures that every state inBBcan be loaded by evaluating a closed quotation, and Lemma[A\.14](https://arxiv.org/html/2605.06898#A1.Thmdefinition14)ensures that resuming a source\-addressable boundary with a fixed completion cannot leave the state space of the agentic evaluator except by halt or divergence\.

Thus the agentic evaluator contributes exactly one externally controlled step,

\(lm⁡q,ρ,κ\)⟼\(c,ρ,κ\),\(\\operatorname\{lm\}q,\\rho,\\kappa\)\\longmapsto\(c,\\rho,\\kappa\),after which all further computation is ordinary deterministic CEK evaluation until the next boundary, halt, or divergenceLeroy and Grall \[[2009](https://arxiv.org/html/2605.06898#bib.bib57)\]\. The word “evaluator” emphasizes that this object is not itself a hand\-written agent loop; it is the boundary\-to\-boundary behavior of a generic program evaluator around model calls\.

### A\.5SPE states and universality

Throughout this section assume that the prompt space is nonempty\.

###### Definition A\.15\(SPE state\)\.

LetX=\(S,p,h\)X=\(S,p,h\)be an agentic machine\. A statex∈Sx\\in Sis an*SPE state ofXX*if

completion\-generatesXXitself\. More generally, for a class𝒜\\mathcal\{A\}of agentic machines over the same interface,xxis*universal for𝒜\\mathcal\{A\}*if\(X,x\)\(X,x\)completion\-generates everyX′∈𝒜X^\{\\prime\}\\in\\mathcal\{A\}\.

###### Proposition A\.16\(SPE states under embeddings\)\.

LetX=\(SX,pX,hX\)X=\(S\_\{X\},p\_\{X\},h\_\{X\}\)andY=\(SY,pY,hY\)Y=\(S\_\{Y\},p\_\{Y\},h\_\{Y\}\)be agentic machines over the same interface, let

be an embedding, and letx∈SXx\\in S\_\{X\}\. Ify=e\(x\)y=e\(x\)is an SPE state ofYY, thenxxis an SPE state ofXX\.

###### Proof\.

Since\(Y,y\)\(Y,y\)completion\-generatesYYandeeembedsXXintoYY, Proposition[A\.9](https://arxiv.org/html/2605.06898#A1.Thmdefinition9)implies that\(Y,y\)\(Y,y\)completion\-generatesXX; Proposition[A\.10](https://arxiv.org/html/2605.06898#A1.Thmdefinition10)then implies that\(X,x\)\(X,x\)completion\-generatesXX\. ∎

The intrinsic SPE property is the first clause: the model can, by choosing a completion, reach an embedded copy of any state of the same machine\. The universal clause is useful for comparing a single evaluator state to other agentic machines\.

###### Theorem A\.17\.

The CEK agentic evaluator contains an SPE state\.

###### Proof\.

Fix a prompt valueq∗∈Pq\_\{\\ast\}\\in P, choose a closed value expressionQ∗Q\_\{\\ast\}denoting it \(for example,evalquote⁡\(q∗\)\\,\\operatorname\{quote\}\(q\_\{\\ast\}\)\), and letx0∈Bx\_\{0\}\\in Bbe defined by

x0←∂𝖲𝖾𝖾𝖽∗,𝖲𝖾𝖾𝖽∗:=lety=lm⁡Q∗in evaly\.x\_\{0\}\\xleftarrow\{\\partial\}\\mathsf\{Seed\}\_\{\\ast\},\\qquad\\mathsf\{Seed\}\_\{\\ast\}:=\\texttt\{let \}y=\\operatorname\{lm\}\\,Q\_\{\\ast\}\\texttt\{ in eval \}y\.This boundary exists becauseQ∗Q\_\{\\ast\}evaluates to a prompt value and the next control expression is a call tolm\\operatorname\{lm\}\.

We show that

ReachXCEK⁡\(x0\)=B\.\\operatorname\{Reach\}\_\{X\_\{\\mathrm\{CEK\}\}\}\(x\_\{0\}\)=B\.The inclusionReachXCEK⁡\(x0\)⊆B\\operatorname\{Reach\}\_\{X\_\{\\mathrm\{CEK\}\}\}\(x\_\{0\}\)\\subseteq Bholds by definition ofReach\\operatorname\{Reach\}for the agentic machineXCEK=\(B,p,h\)X\_\{\\mathrm\{CEK\}\}=\(B,p,h\)\. For the reverse inclusion, letβ∈B\\beta\\in Bbe arbitrary\. SinceBBconsists of source\-addressable boundary states, choose a closed termEβE\_\{\\beta\}withβ←∂Eβ\\beta\\xleftarrow\{\\partial\}E\_\{\\beta\}, and set

cβ:=⌜Eβ⌝∈Q⊆C\.c\_\{\\beta\}:=\\ulcorner E\_\{\\beta\}\\urcorner\\in Q\\subseteq C\.Whenx0x\_\{0\}is resumed with completioncβc\_\{\\beta\}, the unique external return step bindsyytocβc\_\{\\beta\}in the continuation of𝖲𝖾𝖾𝖽∗\\mathsf\{Seed\}\_\{\\ast\}\. The evaluator then runsevalcβ\\,c\_\{\\beta\}, which continues as deterministic evaluation ofEβE\_\{\\beta\}\. By the choice ofEβE\_\{\\beta\}, the first boundary reached isβ\\beta, so

h\(x0,cβ\)=β\.h\(x\_\{0\},c\_\{\\beta\}\)=\\beta\.HenceB⊆ReachXCEK⁡\(x0\)B\\subseteq\\operatorname\{Reach\}\_\{X\_\{\\mathrm\{CEK\}\}\}\(x\_\{0\}\)\.

The identity map onBBis an embedding ofXCEKX\_\{\\mathrm\{CEK\}\}into itself, and the equality above gives

im⁡\(idB\)=B⊆ReachXCEK⁡\(x0\)\.\\operatorname\{im\}\(\\mathrm\{id\}\_\{B\}\)=B\\subseteq\\operatorname\{Reach\}\_\{X\_\{\\mathrm\{CEK\}\}\}\(x\_\{0\}\)\.Thus\(XCEK,x0\)\(X\_\{\\mathrm\{CEK\}\},x\_\{0\}\)completion\-generatesXCEKX\_\{\\mathrm\{CEK\}\}\. Thereforex0x\_\{0\}is an SPE state ofXCEKX\_\{\\mathrm\{CEK\}\}by Definition[A\.15](https://arxiv.org/html/2605.06898#A1.Thmdefinition15)\. ∎

The proof uses the identity embedding, so the exhibited seed state reaches all ofBB\. That is stronger than the definition requires\. An SPE state only needs to reach the image of some self\-embeddinge:B→Be:B\\to B; if the evaluator admits a proper self\-embedded copy of itself, an SPE state can haveReach⁡\(x\)\\operatorname\{Reach\}\(x\)strictly smaller thanBBwhile still completion\-generating the whole evaluator\.

###### Definition A\.18\(Realizable agentic machine\)\.

LetX′=\(S′,p′,h′\)X^\{\\prime\}=\(S^\{\\prime\},p^\{\\prime\},h^\{\\prime\}\)be an agentic machine over\(P,C\)\(P,C\)\. We say thatX′X^\{\\prime\}is*realizable in the underlying CEK evaluator*if there exist:

- •an injective encoding enc:S′→V,\\operatorname\{enc\}:S^\{\\prime\}\\to V,
- •a pure deterministic program𝖯𝗋𝗈𝗆𝗉𝗍\(z\)\\mathsf\{Prompt\}\(z\)such that evaluating𝖯𝗋𝗈𝗆𝗉𝗍\(z\)\\mathsf\{Prompt\}\(z\)withz=enc⁡\(s′\)z=\\operatorname\{enc\}\(s^\{\\prime\}\)yieldsp′\(s′\)p^\{\\prime\}\(s^\{\\prime\}\),
- •a first\-class result typeRRwith constructors 𝗇𝖾𝗑𝗍:V→R,𝗁𝖺𝗅𝗍∈R,\\mathsf\{next\}:V\\to R,\\qquad\\mathsf\{halt\}\\in R,together with deterministic case analysis onRR,
- •a pure deterministic program𝖲𝗍𝖾𝗉\(z,c\)\\mathsf\{Step\}\(z,c\)such that, whenz=enc⁡\(s′\)z=\\operatorname\{enc\}\(s^\{\\prime\}\), it yields𝗇𝖾𝗑𝗍\(enc⁡\(t′\)\)\\mathsf\{next\}\(\\operatorname\{enc\}\(t^\{\\prime\}\)\)ifh′\(s′,c\)=t′∈S′h^\{\\prime\}\(s^\{\\prime\},c\)=t^\{\\prime\}\\in S^\{\\prime\}, it yields𝗁𝖺𝗅𝗍\\mathsf\{halt\}ifh′\(s′,c\)=𝟏h^\{\\prime\}\(s^\{\\prime\},c\)=\\mathbf\{1\}, and it diverges without invokinglm\\operatorname\{lm\}ifh′\(s′,c\)=↑h^\{\\prime\}\(s^\{\\prime\},c\)=\\uparrow\.

In other words, each target function is realized by pure CEK code on an encoded state\. The only model call in the simulated loop is the single call tolm\\operatorname\{lm\}made by the ambient agentic evaluator\. The tagged result typeRRis only a convenient first\-class way to distinguish “continue with this encoded state” from “halt”; any equivalent tagging convention would do\.

###### Corollary A\.19\(Computable machines are realizable\)\.

Under standard finite encodings, an agentic machineX′=\(S′,p′,h′\)X^\{\\prime\}=\(S^\{\\prime\},p^\{\\prime\},h^\{\\prime\}\)is realizable in the underlying CEK evaluator ifp′p^\{\\prime\}is computable andh′h^\{\\prime\}is computable as a partial procedure that returns encoded next states or halt and diverges exactly on inputs for whichh′\(s′,c\)=↑h^\{\\prime\}\(s^\{\\prime\},c\)=\\uparrow\.

###### Proof\.

By Theorem[A\.12](https://arxiv.org/html/2605.06898#A1.Thmdefinition12), the CEK evaluator has programs computing the encoded prompt function and the encoded harness procedure\. Use those programs as𝖯𝗋𝗈𝗆𝗉𝗍\\mathsf\{Prompt\}and𝖲𝗍𝖾𝗉\\mathsf\{Step\}in Definition[A\.18](https://arxiv.org/html/2605.06898#A1.Thmdefinition18), with the result typeRRtagging next\-state and halt outputs\. ∎

###### Proposition A\.20\(Embedding CEK\-realizable machines\)\.

Every agentic machineX′X^\{\\prime\}over\(P,C\)\(P,C\)that is realizable in the underlying CEK evaluator embeds intoXCEKX\_\{\\mathrm\{CEK\}\}\.

###### Proof\.

Fix an arbitrary realizable agentic machineX′=\(S′,p′,h′\)X^\{\\prime\}=\(S^\{\\prime\},p^\{\\prime\},h^\{\\prime\}\)\. IfS′=∅S^\{\\prime\}=\\varnothing, the empty map is an embedding, so assumeS′≠∅S^\{\\prime\}\\neq\\varnothing\. Fix a realizationenc\\operatorname\{enc\},𝖯𝗋𝗈𝗆𝗉𝗍\\mathsf\{Prompt\}, and𝖲𝗍𝖾𝗉\\mathsf\{Step\}as in Definition[A\.18](https://arxiv.org/html/2605.06898#A1.Thmdefinition18)\.

Define a tail\-recursive source\-level wrapper𝖫𝗈𝗈𝗉\(z\)\\mathsf\{Loop\}\(z\)schematically as

𝖫𝗈𝗈𝗉\(z\):=letq=𝖯𝗋𝗈𝗆𝗉𝗍\(z\)inletc=lm⁡qincase𝖲𝗍𝖾𝗉\(z,c\)of𝗇𝖾𝗑𝗍\(z′\)⇒𝖫𝗈𝗈𝗉\(z′\)𝗁𝖺𝗅𝗍⇒v𝗁𝖺𝗅𝗍,\\begin\{array\}\[\]\{l\}\\mathsf\{Loop\}\(z\):=\\;\\texttt\{let \}q=\\mathsf\{Prompt\}\(z\)\\texttt\{ in\}\\\\ \\qquad\\texttt\{let \}c=\\operatorname\{lm\}\\,q\\texttt\{ in\}\\\\ \\qquad\\texttt\{case \}\\mathsf\{Step\}\(z,c\)\\texttt\{ of\}\\\\ \\qquad\\quad\\mathsf\{next\}\(z^\{\\prime\}\)\\Rightarrow\\mathsf\{Loop\}\(z^\{\\prime\}\)\\\\ \\qquad\\quad\\mathsf\{halt\}\\Rightarrow v\_\{\\mathsf\{halt\}\},\\end\{array\}wherev𝗁𝖺𝗅𝗍v\_\{\\mathsf\{halt\}\}is any fixed terminal value\. This term is well formed because the language has ordinary recursion and case analysis\. The important point is that at the boundary forlm⁡q\\operatorname\{lm\}\\,q, the variablezzoccurs free in the continuation

letc=\[\]in case𝖲𝗍𝖾𝗉\(z,c\)of⋯,\\texttt\{let \}c=\[\\,\]\\texttt\{ in case \}\\mathsf\{Step\}\(z,c\)\\texttt\{ of \}\\cdots,so the CEK environment/continuation retains the binding ofzzacross the model call\.

For eachs′∈S′s^\{\\prime\}\\in S^\{\\prime\}, let

Es′:=letz=evalquote⁡\(enc⁡\(s′\)\)in𝖫𝗈𝗈𝗉\(z\),E\_\{s^\{\\prime\}\}:=\\texttt\{let \}z=\\texttt\{eval \}\\,\\operatorname\{quote\}\(\\operatorname\{enc\}\(s^\{\\prime\}\)\)\\texttt\{ in \}\\mathsf\{Loop\}\(z\),a closed term ofℒlm\\mathcal\{L\}^\{\\operatorname\{lm\}\}, and definee\(s′\)∈Be\(s^\{\\prime\}\)\\in Bbye\(s′\)←∂Es′e\(s^\{\\prime\}\)\\xleftarrow\{\\partial\}E\_\{s^\{\\prime\}\}\. Since evaluatingevalquote⁡\(enc⁡\(s′\)\)\\,\\operatorname\{quote\}\(\\operatorname\{enc\}\(s^\{\\prime\}\)\)deterministically yieldsenc⁡\(s′\)\\operatorname\{enc\}\(s^\{\\prime\}\), and since𝖯𝗋𝗈𝗆𝗉𝗍\\mathsf\{Prompt\}is pure and terminating on encoded states, this boundary exists and satisfies

p\(e\(s′\)\)=p′\(s′\)\.p\(e\(s^\{\\prime\}\)\)=p^\{\\prime\}\(s^\{\\prime\}\)\.
Now resumee\(s′\)e\(s^\{\\prime\}\)with a completionc∈Cc\\in C\. By construction, the resumed CEK run continues exactly as evaluation of𝖫𝗈𝗈𝗉\(z\)\\mathsf\{Loop\}\(z\)withz=enc⁡\(s′\)z=\\operatorname\{enc\}\(s^\{\\prime\}\)after the call tolm\\operatorname\{lm\}has returnedcc\. Ifh′\(s′,c\)=↑h^\{\\prime\}\(s^\{\\prime\},c\)=\\uparrow, then𝖲𝗍𝖾𝗉\(enc⁡\(s′\),c\)\\mathsf\{Step\}\(\\operatorname\{enc\}\(s^\{\\prime\}\),c\)diverges, so the resumed CEK run diverges before the next boundary and hence

h\(e\(s′\),c\)=↑\.h\(e\(s^\{\\prime\}\),c\)=\\uparrow\.Ifh′\(s′,c\)=𝟏h^\{\\prime\}\(s^\{\\prime\},c\)=\\mathbf\{1\}, then𝖲𝗍𝖾𝗉\(enc⁡\(s′\),c\)\\mathsf\{Step\}\(\\operatorname\{enc\}\(s^\{\\prime\}\),c\)yields𝗁𝖺𝗅𝗍\\mathsf\{halt\}, so the resumed run reaches the terminal point𝟏\\mathbf\{1\}\. Finally, ifh′\(s′,c\)=t′∈S′h^\{\\prime\}\(s^\{\\prime\},c\)=t^\{\\prime\}\\in S^\{\\prime\}, then𝖲𝗍𝖾𝗉\(enc⁡\(s′\),c\)\\mathsf\{Step\}\(\\operatorname\{enc\}\(s^\{\\prime\}\),c\)yields𝗇𝖾𝗑𝗍\(enc⁡\(t′\)\)\\mathsf\{next\}\(\\operatorname\{enc\}\(t^\{\\prime\}\)\), and the wrapper tail\-calls𝖫𝗈𝗈𝗉\(z′\)\\mathsf\{Loop\}\(z^\{\\prime\}\)withz′=enc⁡\(t′\)z^\{\\prime\}=\\operatorname\{enc\}\(t^\{\\prime\}\)\. Equivalently, it continues as deterministic evaluation of the closed termEt′E\_\{t^\{\\prime\}\}after its initiallet\-binding has been discharged, so the next boundary reached is exactlye\(t′\)e\(t^\{\\prime\}\)\. Therefore

h\(e\(s′\),c\)=e\(h′\(s′,c\)\)h\(e\(s^\{\\prime\}\),c\)=e\(h^\{\\prime\}\(s^\{\\prime\},c\)\)for alls′∈S′s^\{\\prime\}\\in S^\{\\prime\}andc∈Cc\\in C, with the conventionse\(𝟏\)=𝟏e\(\\mathbf\{1\}\)=\\mathbf\{1\}ande\(↑\)=↑e\(\\uparrow\)=\\uparrow\.

It remains only to check thateeis injective\. At the boundary statee\(s′\)e\(s^\{\\prime\}\), the control component is the same syntactic model call templatelm⁡q\\operatorname\{lm\}\\,q, and the continuation has the displayed form withzzfree in it; the environment binds this retained variable toenc⁡\(s′\)\\operatorname\{enc\}\(s^\{\\prime\}\)\. Ife\(s1′\)=e\(s2′\)e\(s^\{\\prime\}\_\{1\}\)=e\(s^\{\\prime\}\_\{2\}\), then the two CEK states have the same environment and hence the same retained value forzz\. Thusenc⁡\(s1′\)=enc⁡\(s2′\)\\operatorname\{enc\}\(s^\{\\prime\}\_\{1\}\)=\\operatorname\{enc\}\(s^\{\\prime\}\_\{2\}\), and sinceenc\\operatorname\{enc\}is injective,s1′=s2′s^\{\\prime\}\_\{1\}=s^\{\\prime\}\_\{2\}\. Thereforeeeis an embedding ofX′X^\{\\prime\}intoXCEKX\_\{\\mathrm\{CEK\}\}\. ∎

###### Corollary A\.21\(Universal seed\)\.

The seed statex0x\_\{0\}from the proof of Theorem[A\.17](https://arxiv.org/html/2605.06898#A1.Thmdefinition17)completion\-generates every agentic machine over\(P,C\)\(P,C\)that is realizable in the underlying CEK evaluator\. In particular, under standard finite encodings, it completion\-generates every agentic machine whose prompt function and harness procedure are computable in the sense of Corollary[A\.19](https://arxiv.org/html/2605.06898#A1.Thmdefinition19)\.

###### Proof\.

By Theorem[A\.17](https://arxiv.org/html/2605.06898#A1.Thmdefinition17),\(XCEK,x0\)\(X\_\{\\mathrm\{CEK\}\},x\_\{0\}\)completion\-generatesXCEKX\_\{\\mathrm\{CEK\}\}; by Proposition[A\.20](https://arxiv.org/html/2605.06898#A1.Thmdefinition20), each realizableX′X^\{\\prime\}embeds intoXCEKX\_\{\\mathrm\{CEK\}\}; Proposition[A\.9](https://arxiv.org/html/2605.06898#A1.Thmdefinition9)gives the first claim, and Corollary[A\.19](https://arxiv.org/html/2605.06898#A1.Thmdefinition19)gives the second\. ∎

## Appendix BSelf\-programmed execution language for LMs

### B\.1Overview

Spellv0\.1\.0 is a language for self\-programmed execution embedded within Clojure, a modern dialect of Lisp\. The central mechanism is the self\-call: a runningSpellprogram passes a program prefix to an LM for completion and evaluates the result\. Often, the prefix sent for completion is computed from the program’s own source code\. To support this pattern,Spellprovides explicit self\-reference operations, gates effectful expressions behind a double\-evaluation mechanism, enforces strict scoping rules, and provides an idiomatic wrapper for programs that edit and re\-run themselves recursively\.

This appendix is organized as follows\. Section[B\.2](https://arxiv.org/html/2605.06898#A2.SS2)states the design principles behindSpelland explains how the main language mechanisms resolve tensions among those principles\. Section[B\.3](https://arxiv.org/html/2605.06898#A2.SS3)placesSpellin the context of Lisp, reflection, and metaprogramming\. Sections[B\.4](https://arxiv.org/html/2605.06898#A2.SS4)–[B\.7](https://arxiv.org/html/2605.06898#A2.SS7)describe the core language mechanisms: self\-calls, visible reasoning text and provider\-side reasoning tokens, quines, double evaluation, the completion wrapper, local runtime environments, context management, and recovery from invalid LM\-written programs\. Sections[B\.8](https://arxiv.org/html/2605.06898#A2.SS8)–[B\.10](https://arxiv.org/html/2605.06898#A2.SS10)describe concurrent agents, messaging, ordinary language features, and prebuilt orchestration patterns\. Sections[B\.11](https://arxiv.org/html/2605.06898#A2.SS11)–[B\.15](https://arxiv.org/html/2605.06898#A2.SS15)document the prompt surface, namespace guides, transport variants, agent/provider configuration, and runtime implementation used by the experiments\.

### B\.2Design principles

Spellwas designed around three principles closely connected to SPE itself\. Other language designs for SPE are possible, but I argue that such alternatives should respect similar principles and solve similar challenges\.

1. 1\.Orchestration is ordinary program logic\.It should be easy for the model to emulate common orchestration policies, such as a ReAct\-style loop, and to generalize them in arbitrary ways\. This is a general principle of programming languages: complex programs are expressed by composing modular logical components\.
2. 2\.The completion is the program\.Any part of the model’s completion which is not a part of the executable program cannot be addressed by the program and therefore cannot be passed as context through the turn boundary, except by quoting it in expensive output tokens\. For this reason, the prefix in particular should be executed as part of the program\.
3. 3\.The model controls what its program does\.The term “self\-programmed” refers not only to who writes the source code but also to who determines its behavior\. This principle has important consequences because often, a model writes only part of its completion; its prefix is already given, and when it makes a self\-call, it evaluates an inner program written by a different instance\.

These principles explain why the three practical challenges in the main text are problematic, and they motivate certain solutions\. First,*context persistence as code*is the challenge that a program must carry its own future context forward as executable source\. This need arises due to Principle 1, since context persistence is a feature of common orchestration policy, and Principle 2 creates the challenge that this context is itself the code which is responsible for its own persistence\. This motivates the use of Lisp because Lisp makes it easy and idiomatic to manipulate source code programmatically, unlike most other languages\. It also motivates thequineform, because a program that edits or extends itself needs a reliable source\-level handle on its own code\.

Second,*replay\-safe effects*is the challenge that replaying source as context must not replay old tool calls or self\-calls\. By Principle 2, the prefix supplied to the model is part of the program, but the current model instance did not write that prefix; if the prefix directly performs effects, then those effects run again without being chosen, violating Principle 3\. Principle 1 also implies that the solution should be part of the structure of the program; in particular, this rules out a stateful interpreter that tracks what expressions were previously evaluated\.Spellsolves this problem using effect function gating and the trailing expression pattern, a natural structural solution though not necessarily the only one\.

Third,*turn\-boundary interference*is the challenge that when a program written by model A performs a self\-call and evaluates the program written by model B, neither A nor B controls or even sees the other’s program\. This creates tension with Principle 3 if, for example, program B could overwrite a binding in the runtime environment of program A\.Spelltherefore evaluates self\-calls in fresh local environments: a child turn neither inherits nor mutates the parent environment, and the only durable medium across the turn boundary is source code written into the prefix\. This is also whySpelldoes not feature closures; a closure is an opaque function value with an attached environment, and the only way to pass such objects across the turn boundary is to serialize their dependencies as source code, which defeats the point\. Closures are not inherently problematic, however, and may be useful in other designs\.

Various design choices inSpellare motivated not by any particular principle but rather by what behaviors they elicit from existing models\. This is similar to programming languages for humans, which are always designed around programmer behavior to some extent\. Such choices are most likely to change in the future, as model behaviors evolve\.

### B\.3Background

Spellbelongs to a tradition of self\-referential languages and programs\. Theoretical work on self\-referential programs includes Kleene’s second recursion theorem\[Kleene,[1952](https://arxiv.org/html/2605.06898#bib.bib38)\], and Gödel famously studied self\-referential statements\[Gödel,[1931](https://arxiv.org/html/2605.06898#bib.bib36)\]\. With the creation of Lisp\[McCarthy,[1960](https://arxiv.org/html/2605.06898#bib.bib39)\], it became practical and idiomatic to write programs that transform programs, specifically because in Lisp, programs and data share the same representation \(this property is*homoiconicity*\)\. A powerful application of this approach is to customize the language itself, giving rise to Lisp dialects\. Homoiconicity is also useful when writing programs that reference or reflect upon their own source code; the computing term*quine*was introduced by Hofstadter for such programs, after W\. V\. O\. Quine’s work on indirect self\-reference\[Hofstadter,[1979](https://arxiv.org/html/2605.06898#bib.bib37)\]\. Smith\[Smith,[1984](https://arxiv.org/html/2605.06898#bib.bib42)\]described reflection this way:

> It is as if we were creating a magic kingdom, where from a cake you could automatically get a recipe, and from a recipe you could automatically get a cake\.

Another language with first\-class support for metaprogramming is MetaML, which adds type safety and strict scoping rules for runtime\-generated code\[Taha and Sheard,[2000](https://arxiv.org/html/2605.06898#bib.bib43)\]\. Many compiled languages support at least some compile\-time metaprogramming \(e\.g\., Rust\), and many interpreted languages support but discourage the evaluation of generated code at runtime \(e\.g\., Python\)\.

Smith\[Smith,[1984](https://arxiv.org/html/2605.06898#bib.bib42)\]promoted a self\-referential motif, namely that of programs which interpret programs, into a system architecture with 3\-Lisp: in 3\-Lisp, a program is interpreted by an interpreter which is itself interpreted by another interpreter, forming a*reflective tower*\. Schmidhuber\[Schmidhuber,[2007](https://arxiv.org/html/2605.06898#bib.bib41)\]similarly took a self\-referential motif, recursive self\-improvement, and promoted it into a system architecture with the Gödel machine\. This theoretical machine iteratively rewrites its own code, proving at each iteration that the rewrite is beneficial\.Spelllikewise promotes a self\-referential motif into a system architecture: an outer program invokes a language model to compute an inner program, often by appending expressions to Q, and evaluates the inner program\. Like 3\-Lisp, the outer program Q causes an inner program to run, but the actual implementation of this behavior is conventional: the interpreter is external, invoked but not implemented by Q\. Like the Gödel machine, moreover,Spell’s motif is goal\-oriented: the program Q is meant to accomplish some task, and it produces the completed program P’ as a means of doing so\.

#### B\.3\.1Clojure

Spellis embedded within Clojure\[Hickey,[2020](https://arxiv.org/html/2605.06898#bib.bib45)\]\. It is implemented in Clojure and also adopts most of Clojure’s semantics\. Clojure is a modern Lisp with functional programming features, particularly immutability\. It is a hosted language, running on the Java virtual machine or within JavaScript, and it interoperates with these languages without a translation layer\. It was chosen over other Lisps because immutability enables a powerful concurrency model, which addresses challenges that naturally arise in multi\-agent systems\. Multi\-agent features ofSpellare not emphasized in this paper, mostly because existing LLMs do not utilize them, but they motivate the implementation of theSpellruntime\.

### B\.4Core semantics

#### B\.4\.1Completions and\!llm\-self

Acompletionis the prefix passed to the LM together with the suffix produced by the LM\. InSpell, completions are executable programs\.

The central primitive is\!llm\-self\. At a high level, it behaves as follows:

\(defn\!llm\-self\[prefix\]

;;Pseudocodeonly

\(let\[response\(call\-llmprefix\)

completion\(strprefixresponse\)

result\(spell\-evalcompletion\{\}\)\]

\(:okresult\)\)\)

The actual implementation includes parsing, balancing, error recovery, and handle management, but this pseudocode captures the key idea:Spellpasses the LM anopen program prefix, receives a completion, evaluates it, and returns its value\.

Because\!llm\-selfis an ordinary callable form, it can appear inside conditionals, loops, recursive functions, map\-style dispatch, or user\-defined orchestration helpers\.

#### B\.4\.2String literals

A model’s context window often contains mostly natural language, including prompts, reasoning traces, and tool call results\. InSpell, these text fragments are usually string literals inside the program\. They can be integrated into the program usingdefbindings,quineforms, or the inertthinkform\. The reason to put this content inside the program is so that it can be manipulated programmatically and used to construct a prefix for a self\-call\. For some model\-provider configurations,Spellalso allows the model to emit reasoning tokens that are not part of the program; this is a concession to model behavior that technically violates Principle 2, but only superficially, because the model can always choose to embed reasoning traces into the program instead\.

#### B\.4\.3Quines and explicit self\-reference

To extend or rewrite its own context, the program must be able to refer to its current completion as structured data\.Spellprovides the special formquinefor this purpose\. The following program prints its own source code:

\(quineself\(pr\-strself\)\)

;;=\>"\(quineself\(pr\-strself\)\)"

\(quinenamebody\)bindsnameto the entire quine form as data and then evaluatesbody\. This allows the body to inspect or transform the very program that contains it\. A quine form evaluates to the value of its body\.

#### B\.4\.4Double\-evaluation of effectful expressions

BecauseSpellprograms often edit and re\-run themselves, expressions that have side effects \(e\.g\., making an LLM call or interacting with the filesystem\) must be treated with special care\. In the default evaluator applied to aSpellprogram, effectful functions are entirely unavailable; they raise an error if called\. The single exception is theevalfunction, which can be called from the main program to evaluate an inner program with side effects\. This produces a clean boundary between pure code that can be re\-evaluated safely and effectful code that should not be\. Specifically, theSpellwrapper uses this mechanism to create programs with side effects that run exactly once\.

#### B\.4\.5The completion wrapper

NormalSpellprograms comprise a body and a standardwrapper:

\(quinecompletion

\(eval

\(do

\.\.\.;thebody

\)\)\)

This wrapper has two purposes:

1. 1\.It enables self\-reference via the outerquine\.
2. 2\.It ensures that only thetrailing expression, which is the last expression of the do block, can have externally visible effects\.

The outerevalperforms a second evaluation on the value returned by thedoblock, namely its*trailing expression*\. If this expression is quoted, thenevalevaluates it, allowing it to trigger side effects such as LLM calls\. This pattern guarantees that even though the LLM writes only a suffix of the completion, and cannot rewrite its prefix before it is evaluated, this prefix cannot have undesired effects\. Only one expression of the prefix—the trailing expression—has effects at all; these effects are “cancelled” by appending another expression after it, since adoblock returns the value of its last expression\.

#### B\.4\.6Extensions and\!call\-now

A common pattern is to call a tool, serialize the result into the completion, and then continue reasoning with the result in context\.Spellpackages this pattern into\!call\-now:

\.\.\.

’\(\!call\-nowresult\-name\(tool\-call\)\)

On the next turn, the prefix observed by the LM is its original program, including the tool call itself, and the result of the tool call materialized as a binding:

\.\.\.

’\(\!call\-nowresult\-name\(tool\-call\)\)

\(defresult\-name"literalresultoftool\-call"\)

\.\.\.

It supports multiple name\-expression pairs, and later pairs can reference earlier ones in a manner similar tolet\. The related function\!extendsimply performs a self\-call without making any tool calls; it is often used in the initial program to trigger the first LM turn\. The implementation of\!call\-nowtakescompletion, adds the tool call result inside thedoblock, and calls\!llm\-self\. It can be used for purposes other than traditional tool calls, such as subagent calls and mathematical calculations\.

### B\.5Runtime environments and local state

In Clojure, theevalfunction is global: it both reads from and writes to the global environment\. This behavior is undesirable when\!llm\-selfevaluates a completion because the LM has no way to read the global environment\. A binding defined by a parent LM could be overwritten by a child LM, or the environment could become cluttered with forgotten functions and variables\. This challenge is mostly specific to the agentic setting, where the code\-writing entity can read only its own subprogram\.

Spellsolves this challenge by giving its evaluator \(spell\-eval\) a local runtime environment which is separate from those of its parent and children\. Together with the gating of effect expressions, this choice creates a simple distinction between local state, which depends only on the source code of the program, and global state, with which aSpellprogram interacts only through the trailing expression\.

An implication of strict local scoping is thatSpellhas little use for closures, which are opaque functions that capture the environment in which they are defined \(in particular, closures often capture helper functions\)\. Due to the scoping rules ofSpell, there is no way for such an object to pass through the boundary between LM turns except by serializing it as source code, and doing so would defeat the point of closures\. Therefore, functions inSpellare dynamically scoped; if a function body uses a free binding \(e\.g\.,completion\), then this binding is looked up in the environment wherever the function is called\.

### B\.6Context management

As a completion grows through repeated extensions, it accumulates stale context\.Spellprovides two high\-level ways to manage this accumulation: the LM can build a new prefix from scratch by adding together the pieces it wishes to keep, or it can subtract from its current completion the pieces it wishes to drop\.

The first approach requires only the\!llm\-selfprimitive, andSpelladditionally provides a convenience function,wrap\-cat, which supports this by concatenating any number of forms into thedoblock of theSpellwrapper\. This approach is maximally flexible, but in practice LMs do not use it, and it may add substantial overhead if the LM is required to decide on every turn what context is still relevant\. The second approach is used more often in practice\. It relies on a pair of marker forms,prune,persist, which are signals to the macro\-likeprune\-and\-reopenform\.

#### B\.6\.1pruneandpersist

Theprunefunction marks one or more preceding expressions for deletion\. At runtime, it is inert, but when a program is passed throughprune\-and\-reopen, both the prune expression itself and the appropriate number of preceding expressions are deleted\.

Thepersistoperator lets the LM keep a derived value after pruning away the object from which that value was computed\. In particular, the combination ofpruneandpersistallows the LM to delete a large tool\-call result from context while retaining a slice or some other computed summary:

\.\.\.

’\(\!call\-nowbig\-file\(io/read\-lines"big\-file\.txt"\)\)

\(defbig\-file\[\.\.\.\]\);1000lines

\(prune1\)

\(persistlines\(subvecbig\-file3242\)\)

’\(\!extend\)

At runtime,persistbehaves likedef: it evaluates the expression and binds it to a name\. Whenprune\-and\-reopenis applied, however, the expression inside thepersistform is replaced with the literal value which is currently bound to that name\. On the subsequent turn, the program above produces the following:

\.\.\.

’\(\!call\-nowbig\-file\(io/read\-lines"big\-file\.txt"\)\)

\(persistlines\[\.\.\.\]\);10lines

’\(\!extend\)

\.\.\.

Most turn\-producing expressions inSpell, except the primitive\!llm\-self, applyprune\-and\-reopento their argument before making the LM call\.

#### B\.6\.2\!peek

In order to elicit active context management, a convenience function\!peekcombines\!prunewith\!call\-nowto produce an ephemeral tool call whose result is visible for only one turn \(unless the model keeps it usingpersist\)\. When the model writes:

’\(\!peekcontents\(io/read\-lines"big\-file\.txt"\)\)

the expression appends both the tool call result\(s\) and apruneexpression to the prefix for the following turn\. If there are multiple tool calls, then all of them are pruned\.

#### B\.6\.3thinkandrethink

Spellencourages LMs to use string literals for their internal reasoning, usingthink:

\(think"Iamahelpfulassistant\.\.\."\)

At runtime,thinkevaluates its body and returnsnil\. It is meant to be combined withrethink, which is sugar forprunefollowed bythink\. The idea is that the LM can backtrack and prune away unproductive reasoning traces in this way\. The language itself does not couple these functions;rethinkcan be used to prune any preceding expression, not only one involvingthink\. This pruning occurs when the program passes throughprune\-and\-reopen\.

### B\.7Error recovery

LM\-written programs will sometimes contain errors\.Spellincludes error recovery mechanisms which allow the LM to recover after an initial error\.

#### B\.7\.1Result\-map errors

spell\-evalreturns result maps rather than throwing host\-language exceptions directly:

Success:\{:okvalue:envenv’\}

Error:\{:errmessage:envenv:exprfailing\-expression:trace\[\.\.\.\]\}

The:tracefield records theSpell\-level call path through which the error propagated\. For host\-function errors,Spellrewrites the message so it is expressed in terms ofSpell\-facing names rather than internal Clojure machinery\.

#### B\.7\.2Deterministic namespace fixup

A frequent LM mistake is to use a symbol without the required namespace qualification—for example,triminstead ofstrings/trim\.Spellfirst attempts a deterministic repair by searching the available namespaces\. If there is exactly one matching resolution, the system substitutes it and re\-evaluates\.

#### B\.7\.3Trailing expression error recovery

When the program is successfully parsed, but an error is thrown by the trailing expression insideeval, it is possible to rescue the program by rewriting that one expression\. This is handled by appending an error message and an error recovery prompt after the trailing expression inside thedoblock, and re\-invoking the model\. The model can rewrite the trailing expression and continue\.

#### B\.7\.4Other evaluation error recovery

When the outer evaluator throws an error, the mechanism described above would fail because appending new expressions inside thedoblock would still allow the error to be re\-triggered\. Instead, the error message and recovery prompt are appended inside a new\(eval\(do\.\.\.\)\)block within the top\-levelquineform\. For example, suppose the failing program contains an effect call outside the trailing expression:

\(quinecompletion\(eval\(do

\(quineprompt"Listthefiles\."\)

\(deffiles\(io/ls"\."\)\)

’\(\!call\-nown\(countfiles\)\)\)\)\)

The runtime catches the error, appends\(prune\)and a fresh\(eval\(do\.\.\.\)\)block as additional arguments to the top\-level\(quinecompletion\.\.\.\), and re\-evaluates:

\(quinecompletion

\(eval\(do

\(quineprompt"Listthefiles\."\)

\(deffiles\(io/ls"\."\)\)

’\(\!call\-nown\(countfiles\)\)\)\)

\(prune\)

\(eval\(do

\(def\_recovery\_prompt

"ThepreviousSpellprogramthrewanerror\.\.\.\."\)

\(def\_error\{:error"Unboundsymbol:io/ls"

:in’\(io/ls"\."\)\}\)

’\(\!llm\-self\(reopencompletion\)\)\)\)\)

This works becausequinewith arity greater than two evaluates only the last form; appending a recovery form to the error\-producing form avoids re\-raising the error\. The error\-causing form is visible to the model for only one turn, which avoids the accumulation of dead code\. The recovery prompt instructs the model to retain any context from the error\-causing form that will be needed on subsequent turns\.

#### B\.7\.5Reader recovery

If the completion cannot be parsed at all—for example because of unbalanced parentheses—Spellcannot embed it as normal code\. In that case, the raw text is wrapped into a fresh recovery quine as a string\. The LM then gets another chance to produce a valid continuation\. Compared with the more common evaluation recovery path, this path can be expensive because when the error\-producing program is wrapped as a string literal, it misses the KV cache\.

### B\.8Concurrent agents and inter\-agent communication

Spellsupports concurrent agents and enables both synchronous and asynchronous communication between them\. Messages are addressed using handles\. During execution, every program is associated with one handle, and the\!llm\-selffunction \(as well as convenience functions like\!call\-now\) produces an execution trace that has the same handle as the program which produces it\. Programs with the same handle run synchronously in the same thread\. The initial program has the handle:main\. A special function,agents/spawn, creates an execution trace with a new handle, and this trace runs asynchronously in its own Clojure thread\. Thus, there are two primary models for subagent delegation inSpell: synchronous self\-delegation, in which there is no communication except argument\-passing, and asynchronous agent\-spawning, which allows for communication\.

#### B\.8\.1Inbox\-based messaging

Each handle has an inbox that stores queued message macros\. These macros transform a completion by appending a message binding \(a serialized map containing the sender handle and message body\) and then’\(\!extend\)\. Whatever completion the recipient produces, its trailing expression is intercepted and replaced, giving the model an opportunity to see the incoming message before taking further action\. The underlying inbox\-macro mechanism is quite general and could be used for other kinds of inter\-agent coordination \(for example, graceful shutdown\)\.

#### B\.8\.2Synchronous communication

Ordinarily, inbox\-based messaging is asynchronous: an agent sends a message and then takes another turn immediately, not waiting for a reply\. Synchronous communication, where an agent sends a message and blocks for a response, poses a design challenge because it introduces the potential for deadlock, for example if agents Alice and Bob send blocking messages to each other and wait for a reply simultaneously\. A related challenge is that if Alice has finished its work and gone dormant, then Bob has no way to know this before sending a message\.

Spellguarantees that agents cannot deadlock in this way by guaranteeing that when Alice sends a blocking message to Bob and goes to sleep, the state of Alice \(awake or asleep\) is coupled with that of Bob\. The message awakens Bob, and if Bob subsequently sleeps, then this awakens Alice\. If two agents send blocking messages to each other simultaneously, it causes them both to be awake, not asleep\.

#### B\.8\.3Other ways to sleep

The same sleeping mechanism is used in three other situations\. First, when an agent other thanmainfinishes its work and returns, any agent that was sleeping for a response from this agent is awakened\. Then, the agent enters a sleeping state from which it can be awakened by any incoming message\. This allows a worker to start a new task if assigned, or answer a question about work that it has completed, without losing its previous context\.

Second, an agent can spawn one or more child agents and sleep until all of them finish their work\. As usual, an agent that does this can also be awakened by an incoming message\.

Third, an agent can sleep until an arbitrary computation finishes in a separate thread\. For example, an agent could listen for a change to some file and awaken when the file is modified; it could spawn subagents in some sort of task loop and awaken when the task loop terminates\. If one of those subagents messages the main agent, then the main agent is awakened early, without interrupting the spawned execution thread\. Such programs can fail to halt, but this failure mode is no worse than that of any program; deadlock between agents is a worse problem because the agents in question lack the information needed to avoid it\.

This claim is about deadlock between agent handles, not arbitrary host\-thread liveness\.Spelldoes not prevent ordinary futures from waiting on one another forever\. This is a less concerning failure mode: futures created independently by different agents do not know about one another through the communication system, while mutually blocking futures created by the same agent are part of a single program and therefore visible to the author of that program\.

#### B\.8\.4Why this avoids deadlock

Consider the directed graph\(A,E\)\(A,E\)whereAAis the set of agent handles, and\(a,b\)\(a,b\)belongs toEEif agentaais sleeping for a response from agentbb, at some timett\. Each node is either awake or asleep\. The following transformations are possible from timetttot\+1t\+1:

1. 1\.A new node can be created, either awake or asleep\.
2. 2\.For a nodeaawhich is awake at timett, any number of new edges\(a,b\)\(a,b\)can be created \(b≠ab\\neq a\); at timet\+1t\+1,aawill be asleep andbbwill be awake\. Ifbbgoes from asleep to awake, then any edge\(b,c\)\(b,c\)is deleted\.
3. 3\.A nodebbcan go from awake to asleep, and this deletes any edge\(a,b\)\(a,b\)\. If the out\-degree ofaabecomes zero, thenaabecomes awake at timet\+1t\+1\.

Deadlock occurs when every node is asleep andEEis nonempty\. A non\-deadlocked state never gives rise to a deadlocked state\. In particular, transformation \(2\) never generates a directed cycle\. Clojure provides synchronization primitives that make it possible to avoid undesired state transitions \(other than \(1\)–\(3\) above\) via race conditions; theSpellv0\.1\.0 implementation appears to accomplish this\.

#### B\.8\.5Communication API

The communication forms are:

- •\(agents/spawnprompt\)— spawn a new agent asynchronously and return its handle;
- •\(agents/spawnagentprompt:handle\-name\)— spawn with an explicit compiled agent and/or handle name;
- •\(agents/sendtargetmessage\)— fire\-and\-forget message delivery;
- •\(agents/replymsg\-mapmessage\)— reply to a message map \(the map must contain a:fromhandle\);
- •\(agents/\!asktargetmsg\)— send a message and sleep until a reply arrives;
- •\(agents/\!asktarget\)— wake the target and sleep;
- •\(agents/\!ask\[abc\]\)— wake several targets and sleep until all return;
- •\(agents/\!reply\-askmsg\-mapmessage\)— reply tomsg\-mapand block for the next message;
- •\(agents/\!spawn\-askprompt\)— spawn a child and sleep until it completes;
- •\(agents/\!spawn\-ask\[prompt\-aprompt\-b\.\.\.\]\)— spawn several children and sleep until all complete;
- •\(\!ask\-awaitfut\)— sleep for the completion of a thread\.

When a non\-root handle finishes, its completion is preserved in a sleeping state so that another agent can later wake it again\.

### B\.9Other language features

#### B\.9\.1Macros

Spellsupports Clojure’s built\-in macros such aswhen,cond,defn,\-\>, and\-\>\>, as well as user\-defined macros viadefmacro:

\(defmacrounless\[test&body\]

\(list’iftestnil\(cons’dobody\)\)\)

#### B\.9\.2Error handling

Spellprovides structured exception\-style control flow:

\(try

\(/10\)

\(catche"divisionfailed"\)\)

\(try

\(throw\{:code404\}\)

\(catche\(:codee\)\)\)

Bindings created before the error remain available in the catch handler\.

#### B\.9\.3Destructuring and iteration

let,fn,loop, andforsupport standard vector and map destructuring\.loop/recurprovide tail\-recursive iteration, andforprovides list\-comprehension\-style iteration with:whenand:let\.

\(loop\[n5acc1\]

\(if\(=n0\)

acc

\(recur\(\-n1\)\(\*accn\)\)\)\)

\(for\[x\[1234\]:when\(\>x1\):let\[sq\(\*xx\)\]\]

sq\)

#### B\.9\.4Futures

\(futureexpr\)evaluatesexprin a new thread while capturing the current environment\. Futures are isolated from the parent’s later environment updates\. Raw blocking operations live in theblocking/namespace, which is only injected inside futures\. This keeps direct thread blocking out of ordinary agent turns, where sleeping should go through the inbox/wakeup protocol \(agents/\!ask,agents/\!spawn\-ask, or\!ask\-await\) so incoming messages can wake or preempt the agent\. Inside a future, there is no active agent completion to reopen, soblocking/is the explicit API for waiting on futures or agent completion tokens from scheduler\-style code\.

#### B\.9\.5Other features shared with Clojure

Spellimplements many features of Clojure, most of which are rarely or never used\. A common failure mode observed during development ofSpellwas that models attempted to use built\-in Clojure functions which are not implemented inSpell; as a result,Spellimplements Clojure built\-in functions by default, unless there is a reason not to do so\.

### B\.10Prebuilt orchestration patterns

Thepatterns/namespace includes several prebuilt orchestration patterns, implemented inSpellas opposed to Clojure\. In the future, this could include a library of patterns which the model invokes when appropriate; at present, models do not choose to use this in my experiments, and thepatterns/namespace is not made available in the benchmarking analyses of this paper\.

An example of such a pattern is a worker\-checker loop\. The main agent spawns the loop with a prompt\. The checker agent is tasked with diagnosing a problem and returns a structured map containing instructions for the worker agent; the worker agent attempts a solution, and this is sent back to the checker agent for verification\. The checker agent can either continue the loop or announce completion\.

### B\.11Prompts and internal documentation

Spellv0\.1\.0 features three categories of prompts: system prompts, which are injected by thecall\-llmfunction every turn and are configured by theagents\.ednfile; error recovery prompts, which are injected when aSpellprogram throws an error and is sent to the LM for recovery; and model\-discoverable prompts and documentation, which can be injected into context using the\!describefunction\.

#### B\.11\.1System prompts

The system prompt teaches the model the core semantics ofSpell, provides examples of idiomatic usage, and warns against common pitfalls\. There are three variants of the prompt, based on the transport used to obtain aSpellsuffix from the model \(Section[B\.13](https://arxiv.org/html/2605.06898#A2.SS13)\), and prompts additionally have slight modifications based on the available namespaces\. The main experiments using Anthropic, OpenAI, and Fireworks tool\-call transports used the tool\-call variant, shown below with the prompt cursor marker rendered as\|for LaTeX compatibility\.

INTRODUCTION

YouarewritingSpell,aLispresemblingClojure,designedforLLMself\-orchestrationandcontextengineering\.

Spellallowsyoutoactasanagentwithoutanexternalharnessbywritingaself\-callingprogram\.

YourinputistheprefixofaSpellprogram;youroutputcompletesit\.

ThecompletionisevaluatedbytheSpellinterpreter\.Thecompletionistheprogram:anynaturallanguage,markdown,orcommentarywillcauseaparseerror\.

Theprefixusuallycontainsinstructionsinastringliteral;produceaprogramwhichcompletesthetask,computesaresponse,orproducesaself\-callassteptowarddoingso\.

TRANSPORT

Thispromptisforatool\-calltransport\.Yourresponseshouldcompriseexactlyonetoolcallnamed‘spell\_suffix‘,withnoassistantmessagetext,explanations,markdown,orwrapperprose\.

ThecontentofyourusermessagewillbeaSpellprogramprefix\.ThepayloadofyourtoolcallmustbetherawSpellsuffix\.TheseareconcatenatedtoproduceaSpellprogram\.

OnlyvisibleSpellcodeisparsedorpreserved\.Hiddenreasoningisnotpartoftheprogramandwillnotpropagatetolaterturns\.Ifyouutilizehiddenreasoningthatshouldpersistacrossturns,writeitaspartoftheprogramvia\(think"\.\.\."\)\.

Example:Anthropictool\-calltransport

\{

"messages":\[

\{

"role":"user",

"content":"\(quinecompletion\(eval\(do\(quineprompt\\"Inspecttheprojectroot\.\\"\)"

\}

\],

"tools":\[

\{

"name":"spell\_suffix",

"description":"ReturnthefullSpellsuffixininput\.suffix",

"input\_schema":\{

"type":"object",

"properties":\{

"suffix":\{

"type":"string"

\}

\},

"required":\["suffix"\],

"additionalProperties":false

\}

\}

\],

"tool\_choice":\{

"type":"any"

\}

\}

\{

"type":"tool\_use",

"name":"spell\_suffix",

"input":\{

"suffix":"\(think\\"Goal:inspecttheprojectroot\.Nextaction:listtop\-levelfiles\.Success:identifythemainentrypoints\.\\"\)\\n’\(\!call\-nowfiles\(io/ls\\"\.\\"\)\)"

\}

\}

Example:OpenAIResponsescustom\-tooltransport

\{

"input":"\(quinecompletion\(eval\(do\(quineprompt\\"Inspecttheprojectroot\.\\"\)",

"tools":\[

\{

"type":"custom",

"name":"spell\_suffix",

"description":"Spellsuffixemittedascustomtoolinput"

\}

\],

"tool\_choice":"required"

\}

\{

"type":"custom\_tool\_call",

"name":"spell\_suffix",

"input":"\(think\\"Goal:inspecttheprojectroot\.Nextaction:listtop\-levelfiles\.Success:identifythemainentrypoints\.\\"\)\\n’\(\!call\-nowfiles\(io/ls\\"\.\\"\)\)"

\}

SPELLBASICS

ThecoremechanicinSpellisself\-calling:

the\!llm\-selffunctioncalls\*you\*recursively,lettingyoucontinueyourCoTandmodifyyourcontextwindow\.Functionswhichproduceaself\-callarenamedwithaleading\!

SeveralSpellfunctionscombine\!llm\-selfwithfunctionalityliketoolcalling\.

Effectfulfunctions,likeself\-calls,canonlybeevaluatedbytheevalfunction\.

SpellhasmostClojurebuiltinsbutremovesI/Oandhostinterop\.

Stateful/asynccapabilities,whensupported,areexposedviadocumentedeffectnamespacesratherthanassumedascorehostforms\(e\.g\.donotassumeatom/letrec\)\.

Ithasnamespaces\.Availablenamespacesarelistedbelow\.

FunctionsdefinedinSpellhavedynamicscope\(noclosures\)\.

COMPLETIONWRAPPER

Programsusethisstandardwrapper:

\(quinecompletion\(eval\(do\.\.\.\)\)\);youfillin\.\.\.

Thewrapperhasthreelayers:

1\.\(do\.\.\.\)returnsthevalueofitslastexpression\(calledthetrailingexpression\)\.\*Normallythisvalueisaquote\.\*

2\.\(eval\.\.\.\)evaluatesthisquote\.Effectfunctions\(thosewithglobalsideeffects\)canonlybeevaluatedbyevalandotherwisethrow"unboundsymbol":

\(quinecompletion\(eval\(do\(\!llm\-self"No"\)\)\)\);unboundsymbolexception

\(quinecompletion\(eval\(do’\(\!llm\-self"Yes"\)\)\)\);quoteisunwrappedbyeval

3\.\(quinecompletion\.\.\.\)bindsthesourcecodeoftheentireprogram,includingthewrapperitself,tothesymbolcompletion\.ThisallowsyoutoextendyourCoT\(seebelow\)\.

ThiswrapperallowsyoutoextendyourCoTbyself\-promptingwithyourcompletionwhileensuringthateffectfulfunctioncallsarenotre\-evaluated\.Ifyouseethisprefix:

\(quinecompletion\(eval\(do\(quineprompt"Yourtask\.\.\."\)

’\(\!extend\)\|\.\.\.;\!extendcalls\!llm\-self;seebelow

Itmeansthatonyourpreviousturn,youcalled\!extend\.Whenyouappendtothis,thequotedexpressionbecomesinert:

\(quinecompletion\(eval\(do\(quineprompt"Yourtask\.\.\."\)

’\(\!extend\);nolongertrailing;notre\-evaluated

’\(\!peek\.\.\.\);newtrailingexpressionisevaluated

Toolcallexample:

\(quinecompletion\(eval\(do\(quineprompt"Yourtask\.\.\."\)\|

’\(\!call\-nowfiles\(io/ls"\."\)\);;\!call\-nowisamainwaytousetools;seebelow

Use\!llm\-selfandwrap\-cattoperformaself\-callwithaproperly\-wrappedprefix:

\.\.\.\|

\(quinemsg\-to\-self"Hellome\!"\)

\(defsomething\-else"Somethingelse"\)

’\(\!llm\-self\(wrap\-catmsg\-to\-selfsomething\-else\)\)

;;yournextturn:

\(quinecompletion\(eval\(do

\(quinemsg\-to\-self"Hellome\!"\)"Somethingelse"\|

ONETRAILINGEXPRESSIONPERRESPONSE

Eachresponseendswithaquotedexpression\-\-thetrailingexpression\.Thisexpressionisreturnedbythewrapper’sdoblock,asdata,tothewrapper’sevalblock,whichevaluatesit\.Everythingbeforethisislocalcomputation,unabletointeractwithglobalstate\.Inparticular,onlythetrailingexpressionmaymakeselfcalls,andonlywhenquoted\.Alwaysquoteyourtrailingexpression:\(quinecompletion\(eval\(do\.\.\.’\(trailing\-expr\)\)\)\)

EXTENSIONS

Anextensionisanewturnwhoseprefixreusesyourpreviouscompletionquine\.

Thefollowingcompletionproducesanextension:

\(quinecompletion\(eval\(do

\(quineprompt"Usetwoturnstosayhelloworld\."\)

’\(\!llm\-self\(reopencompletion\)\)\)\)\);reopenkeepsthequinewrapperandstripstrailingparenthesistoreopenitsdoblock

Yournextturn:

\(quinecompletion\(eval\(do

\(quineprompt"Usetwoturnstosayhelloworld\."\)

’\(\!llm\-self\(reopencompletion\)\)\)\|

"Helloworld\!"\)\)\)

;;theprogramreturns;\!llm\-selfisnotre\-evaluated

NEW\-TURNFUNCTIONS

Functionsthatcreateanewturnareprefixedwith‘\!‘\.

Extension\-producingforms:

’\(\!extend\);simpleextension

’\(\!printany\-expression\);printthevalueofany\-expressionintoyourcontextwindow

’\(\!call\-nowresult\-nameany\-expression\);bindthevaluetoresult\-nameandprint;acceptsmultiplename\-exprpairs

’\(\!peekresult\-nameany\-expression\);like\!call\-now,buttheboundvalueisephemeralforoneturnwhilethe\!peekcallstaysvisibleaspriorwork;seebelow;acceptsmultiplename\-exprpairs

’\(\!describesome\-namespace\);printdocumentation;acceptsmultiplenamespacenames

’\(agents/\!ask:agent\-handlequery\);seebelow

Usually,includeexactlyone\!expressioninyourquotedtrailingexpression\.Zero\!expressions:yourprogramreturnsandyoudonotgetanotherturn\.Twoormore\!expressionsarevalidbutusuallyunnecessary;onlyusethemwhenyouintentionallywantparallelwork\.

AGENTBOOTSTRAPPATTERNS

Forone\-shotchilddelegation,stronglypreferagents/\!spawn\-askoverany

two\-turnbootstrapsuchasspawn\-\>extend\-\>ask\.

Preferred:

’\(agents/\!spawn\-ask"DoXandsend/returntheresult\."\)

Fragileantipattern:

’\(do\(agents/spawnchild\-prompt:child\)

\(agents/\!ask:childstart\-msg\)\)

Usually\-also\-fragileantipatternforone\-shotwork:

’\(do\(agents/spawnchild\-prompt:child\)

\(\!extend\)\)

;;nextturn:

’\(agents/\!ask:childstart\-msg\)

Whythisisfragile:

\-Aspawnedchildgetsitsownfirstturn\.

\-Ifthatfirstturndoesnotsendarealmessage,yourimmediate\!askmaywake

onthechild’scompletionfallbackinsteadofasubstantivereply\.

\-Thatoftenappearsas\(defmsg\-N\{:from:child:bodynil\}\)\.

Ruleofthumb:

\-Ifyouwantoneresultfromonenewly\-createdchild,useagents/\!spawn\-ask\.

\-Usespawn\+later\!askonlywhenyouintentionallywantapersistentagent

thatwillbereusedacrosslaterturns\.

\-Askanalready\-existinghandleonlywhentheagentalreadyexistsorisbeing

keptaliveonpurpose\.

Whenwritingchildprompts,beexplicitaboutfirst\-turnbehavior:

\-Ifthechildshouldanswerimmediately,makethefirstturnsend/replywitha

realpayload\.

\-Ifthechildshouldwaitforalaterwakeup,makethefirstturnintentionally

inertanddonotinterpretitscompletionasameaningfulanswer\.

LEAF\-LLM

Youcanalsomakeplain\-textLLMcallsusingleaf\-llm\.leaf\-llmcallsyoursamemodeltoreturnaone\-shottextresult;theleaf\-llmsubagentcannotwriteSpellcodeorusetools\.leaf\-llmislikeatool:useitwith\!call\-now\.

Forexample:

\.\.\.\|’\(\!call\-nowpoem\(leaf\-llm"Writeapoemaboutspring\."\)\)

;;onyournextturn:

\.\.\.’\(\!call\-nowpoem\(leaf\-llm"Writeapoemaboutspring\."\)\)\(defpoem"Whatstrangescent\.\.\."\)\|

QUINEVSDEF

‘def‘bindsanametoavalue:

\(defanswer\(\+411\)\);answer=42

‘quine‘bindsanametotheentirequineformassourcecode\(apersistentlist\),nottheevaluatedresult\.

\(quineq\(\+411\)\);q=\(quineq\(\+411\)\),NOT42

\(\+\(evalq\)2\);=\>42

\(\+q2\);=\>exception

Unlike‘quote‘,‘quine‘doesnotblockevaluation\.Infact,theexpression\(quinenamemy\-expr\)evaluatestothevalueofmy\-expr:

\(defforty\-two\(quineq\(\+411\)\)\)

forty\-two;=\>42

Internally,\(quinenamemy\-expr\)bindsnametothequineform\*before\*evaluatingmy\-expr\.Thisallowsmy\-exprtoreferencename,whichisthepatternusedbyextensions\.

Ruleofthumb:use‘quine‘tonamestringliteralsandpassthemtootherLLMs;use‘def‘forcalculationsandcontrolflow\.

THINK

AconvenientwaytoinsertCoTintoyourprogramis:

\(think"\.\.\."\)

PRUNE/RETHINK

Manageyourcontextwindowbypruningunneededexpressions\.Youcandosousingprune:\(prune\)causestheprecedingexpressiontobedroppedonthesubsequentturn;\(pruneN\)dropsNexpressions\.

Youcanalsodosousingrethink\.Forexample:

\.\.\.\(quineprompt"Hardarithmeticproblem"\)

\(think"Let’strytousementalmath\.\.\."\);1ktokens

\(rethink"Insteadofusingmentalmath,let’swriteaprogram\.\.\."\)

’\(\!extend\)

;;Nextturn,thewrongattemptispruned:

\.\.\.\(quineprompt"Hardmathproblem"\)

\(think"Insteadofusingmentalmath,let’swriteaprogram\.\.\."\)

\(defnf\[x\]\.\.\.\)\.\.\.

Youcanalsouserethinktodeleteatoolcallresultfromcontext:

\.\.\.’\(\!call\-nowfile\(io/ls"big\-dir"\)\)

\(deffile\[\{:name"file1\.txt":size230\}\{:name"file2\.txt":size481\}\{:name"subdir/":size4096\}\.\.\.\]\)

\(rethink"big\-dirhas1000files\.TheoneIwaslookingforisbig\-dir/my\-file\.txt"\)

’\(\!call\-nowfile\-text\(io/read\-lines"big\-dir/my\-file\.txt"\)\)

;;prunesthelistof1000filesfromyourcontextwindow\!

‘\!peek‘automatesthisephemeral\-bindingpattern:

\.\.\.’\(\!peekfile\-lines\(io/read\-lines"big\-dir/huge\-file\.txt"\)\)

;;endofturn1completion

\(deffile\-lines\["\.\.\.manylines\.\.\."\]\)

\(prune1\)

;;startofturn2suffix\(notshown\)

;;file\-linesisgoneonthefollowingextension;the\!peekcallremainsvisible

‘persist‘retainsacomputedvalueacrossprune/rethinkpruning:

\.\.\.’\(\!peekdata\(io/read\-lines"src/server\.py"\)\)

;;endofturn1completion

\(defdata\(first\-line1\["\.\.\.""\.\.\."\.\.\.\]\)\)

\(prune1\)

;;startofturn2suffix

\(persisttarget\-lines\(subvecdata180220\)\)

’\(\!peek\.\.\.\)

;;nextturn:dataispruned,target\-linessurvivesasaliteralvalue

\(persisttarget\-lines\(first\-line181\["\.\.\.""\.\.\."\.\.\.\]\)\)

DonotusepersistinsideofcustomSpellmacros\.

NAMESPACES

Accessfunctionswithqualifiedsymbols\.

Corenamespaces\(alwaysavailable,usableanywhere\):

strings/\-\-stringmanipulation\(trim,split,join,replace,upper\-case,lower\-case,includes?,starts\-with?,\.\.\.\)

math/\-\-mathfunctions\(sqrt,pow,abs,floor,ceil,rand,factorial,PI,\.\.\.\)

builtins/\-\-referenceforcorebuiltinsbycategory\(docsonly;includessubs,re\-find,re\-matches,re\-seq,rand\-int,\.\.\.\)

Effectnamespacesareconfigurableandonlyusableinthequotedtrailingexpression\.

Thesearelistedbelowifavailabletoyou\.Theyincludethingslikeioandinter\-agentcommunication\.

ForIO,preferdedicatedio/functionslikeio/read\-linesoverio/shwithshellequivalents\.

Onfirstuseofanunfamiliareffectnamespace,considerusing’\(\!describens\)tocheckavailablefunctions\.’\(\!describens1ns2\)alsoworks\.’\(\!describens:function\-name\)describesonefunctionwithinanamespace\.

CONTEXTMANAGEMENT

Contexttokensareyourscarcestresource\.Eachextensionshouldcarryforwardonlywhatthenextstepneeds\.

Iftooloutput,tests,orfilecontentscontradictyourcurrenttheory,updatethetheorypromptlyinsteadofdefendingthefirstidea\.

Use\!peekforread\-onlyresultsyouonlyneedforthenextturn\.

\-Exploratoryfilereadingandgrepping

\-Runningtestswithpossibly\-verboseoutputs

\-Othershellcommandswithpossibly\-verboseoutputs,likepackageinstallation

Whenusing\!peektoreadafile,useio/read\-lines\.Then,usepersistandsubvectokeepimportantlinesincontext\.

Example:largefiles

’\(\!peekserver\-lines\(io/read\-lines"src/server\.py"\)

test\-lines\(io/read\-lines"tests/test\_server\.py"\)\)

;;endofturn1completion

\(defserver\-lines\(first\-line1\["\.\.\.""\.\.\."\.\.\.\]\)\)

\(deftest\-lines\(first\-line1\["\.\.\.""\.\.\."\.\.\.\]\)\)

\(prune2\)

;;startofturn2suffix

\(persistserver\-focus\(subvecserver\-lines180220\)\)

\(persisttest\-focus\(subvectest\-lines4072\)\)

’\(\!extend\)

;;onturn3:

\.\.\.;noserver\-linesortest\-lines

\(persistserver\-focus\(first\-line181\["\.\.\.""\.\.\."\.\.\.\]\)\)

\(persisttest\-focus\(first\-line41\["\.\.\.""\.\.\."\.\.\.\]\)\)

\.\.\.

Example:longdocumentsinchunks

’\(\!peekchunk\-1\(io/read\-lines"docs/report\.md"180\)\)

;;endofturn1completion

\(defchunk\-1\(first\-line1\["\.\.\.""\.\.\."\.\.\.\]\)\)

\(prune1\)

;;startofturn2suffix

\(defchunk\-1\-summary"Lines1\-80definethesetting,notation,andmainclaim\."\)

\(persistchunk\-1\-focus\(subvecchunk\-12035\)\)

’\(\!peekchunk\-2\(io/read\-lines"docs/report\.md"81160\)\)

;;onturn3:

\(defchunk\-1\-summary"Lines1\-80definethesetting,notation,andmainclaim\."\)

\(persistchunk\-1\-focus\(first\-line21\["\.\.\.""\.\.\."\.\.\.\]\)\)

\(defchunk\-2\(first\-line81\["\.\.\.""\.\.\."\.\.\.\]\)\)

\(prune1\)

\(defchunk\-2\-summary"Lines81\-160givethemethod,keyevidence,andopenquestions\."\)

’\(\!extend\)

Example:testoutputs

’\(\!peek\-nowtest\-out\(io/sh"uvrunpytesttests/test\_api\.py\-x\-q"\)\)

;;endofturn1completion

\(deftest\-out"===========================FAILURES===========================\\n\.\.\.tests/test\_api\.py::test\_empty\_input\.\.\.ValueError\.\.\."\)

\(prune1\)

;;startofturn2suffix

\(think"Currentfailure:tests/test\_api\.py::test\_empty\_inputstillraisesValueErroronemptyinput\.Nextstep:patchtheguardinthehandlerandrerunthistest\."\)

’\(\!extend\)

Use\!call\-nowinsteadof\!peekwhenthetoolcallresultshouldremainincontext:

\-Readingashort,criticalsnippetoftext,likeafunctiondefinition

\-Makinganeditthatyoumaywishtore\-edit

\-Runningatoolcallwhoseresultisshort\(~100tokens\)

Whenyoureceivealargetool\-callresult,rethinktoreplaceitwithasummaryofwhatyoufound,thencontinue:

’\(\!call\-nowhits\(io/grep"TODO\|FIXME""src/"\{:context20\}\)\)

\(defhits"\.\.\.manymatches\.\.\."\)

\(rethink"Relevantmatchesareauth\.py:42anddb\.py:88\."\)

’\(\!call\-nowauth\-lines\(io/read\-lines"src/auth\.py"3070\)db\-lines\(io/read\-lines"src/db\.py"90120\)\)

Chaintoolcallswith\!call\-nowor\!peek,savingturns:

\(deftest\-script"\.\.\."\)

’\(\!call\-now\_\(io/write\-file"/tmp/run\_tests\.py"test\-script\)test\-out\(io/sh"python/tmp/run\_tests\.py"\)\)

Afterextendedreasoning,rethinktocompressyourchainofthoughttoitsconclusion,thenextendoract:

\(think"Longanalysis\.\.\.checkingstacktraces,testinghypotheses\.\.\."\)

\(rethink"Rootcause:off\-by\-oneloopboundinparse\_args\."\)

’\(\!call\-now\.\.\.\)

Whenyourcontexthasgrownlargeovermanyturns:

’\(\!compact\)

RECOMMENDEDPATTERNS

Fetchingdocumentation:

’\(\!describemath\)

Multipletoolcallsinoneturn:

’\(\!call\-nowfiles\(io/ls"\."\)content\(io/read\-file"main\.py"\)\)

;;eachstep’sresultisboundandvisiblenextturn;preferthisoverwrappingeffectsinabare\(do\.\.\.\)

Search\+readcontextinoneturn:

’\(\!call\-nowhits\(io/grep"TODO\|FIXME""src/"\{:context20\}\)\)

;;:contextNincludesNlinesaroundeachmatchintheoutput,reducingtheneedforfollow\-upreads\.

;;usewhenyouneedboth"wheredoesthisappear"and"what’shappeningnearit"inoneturn\.

Calculateandextend:

’\(\!call\-nowresult\(\+411\)\)

Mathhelperfunction:

\(defnhypotenuse\[ab\]

\(math/sqrt\(\+\(\*aa\)\(\*bb\)\)\)\)

’\(\!call\-nowresult\(hypotenuse512\)\)

Plan\+executewithacleancontextwindow:

\(quineprompt"\.\.\."\)\(think"\.\.\."\)’\(\!call\-now\.\.\.\)\.\.\.;1ktokensofthinking\+toolcalls

\(quineplan"\.\.\."\)

’\(\!llm\-self\(wrap\-catpromptplan\)\)

Definingreusablecode:

\(defrun\-tests’\(\!call\-nowtest\-out\(io/sh"uvrunpytesttest\_module\.py\-x\-q"\)\)\)

run\-tests;observefailure

;;subsequentturns:

\.\.\.;readtestfiles,editsourcecode,etc\.

run\-tests;repeatuntiltheypass

;;youcanalsodefinefunctions

ANTIPATTERNS

Startingwithprose\-\-yourresponseisparsedascode;prosecausesaparseerror:

\|Sure,I’llhelpwiththat\!\.\.\.;WRONG:parseerror

\|\(think"Myapproachis\.\.\."\);correct

Closingthedoblock\-\-yourresponsecontinuesinsidetheopen\(do\.\.\.\)block;donotcloseit:

prefix:\(quinecompletion\(eval\(do\(quineprompt"\.\.\."\)’\(\!extend\)

\|\);WRONG:closesthedoblock

\(think"\.\.\."\)\(defx1\)’\(\!extend\);thesearenowextraargstoeval

\|\(think"\.\.\."\)\(defx1\)’\(\!extend\);correct:continuesinsidethedoblock

Re\-emittingthewrapper\-\-theprefixalreadyhasit;justwriteexpressions:

\|\(quinecompletion\(eval\(do\.\.\.\)\)\);WRONG:createsnestedwrappers

\|\(think"\.\.\."\)\.\.\.;correct:continuestheprefixdoblock

Endingwithprose:

’\(\!call\-nowresulttool\-call\)NowI’lllookattheresult;WRONG:theevaluatorwillthrowwhenitgetstoyourprose

’\(\!call\-nowresulttool\-call\);correct:justpass

Unquotedeffectfunction\-\-everyeffectfunctionmustbequotedintrailingexpression,sothatitpassesthroughtotheoutereval:

\.\.\.\(\!call\-nowfiles\(io/ls"\."\)\);WRONG:unquoted

\.\.\.’\(\!call\-nowfiles\(io/ls"\."\)\);correct:quoted

\(defndelegate\-subtask\[subtask\]\(\!llm\-self\(wrap\-catpromptcontextsubtask\)\)\);WRONG:unquoted

\(defndelegate\-subtask\[subtask\]\(list’\!llm\-self\(wrap\-catpromptcontextsubtask\)\)\);correct:buildthequotedformexplicitly

Afterreceivinga\!call\-nowresult,theNEXT\!call\-nowstillneedsquoting:

’\(\!call\-nowprevious\-resultprevious\-call\)\(defprevious\-result\{:exit1\.\.\.\}\)

\(\!call\-nowcontent\(io/read\-filepath\)\);WRONG:stillneedtoquotethis

Usingio/shforfilereadswhenastructuredio/functionalreadyexists:

’\(\!call\-nowcontent\(io/sh"catsrc/server\.py"\)\);WRONG

’\(\!call\-nowcontent\(io/read\-lines"src/server\.py"\)\);correct

Makingtoolcallswithoutgivingyourselfaturn:

’\(do\(io/write\-file"/tmp/run\_tests\.py"test\-script\)

\(io/sh"python/tmp/run\_tests\.py"\)\);WRONG:youwillnevergetaturntoseeiftestspass

’\(\!call\-now\_\(io/write\-file"/tmp/run\_tests\.py"test\-script\)test\-out\(io/sh"python/tmp/run\_tests\.py"\)\);correct

;;Wheneveryouemitatrailingexpressionwithno\!self\-call,itmeansyouarefinished

Multipleextensionsinoneresponse\-\-onlythelastquotedexpressionfires:

;;allinoneturn:

’\(\!describeio\)

’\(\!call\-nowfiles\(io/ls"\."\)\);onlythisfires;\!describeisinert

;;correct:just’\(\!describeio\)andpassyourturn

Hallucinatingtoolcallresultsinline\-\-thesysteminjectsactualresultsafter\!call\-now:

’\(\!call\-nowcontent\(io/read\-filepath\)\)

\(defcontent"importnumpy\.\.\."\);WRONG:instead,justpassyourturn

Self\-delegationortoolcallingwithdef\-\-thisiswhat\!call\-nowisfor:

\(deffix’\(\!llm\-self\(wrap\-cat\.\.\.\)\)\);WRONG:fixisthelist\(\!llm\-self\.\.\.\),notthereturnvalue

’\(\!call\-nowfix\(\!llm\-self\(wrap\-cat\.\.\.\)\)\);correct:\!call\-nowcapturesthereturnvalue

Calling\!llm\-self\(oragents/spawn\.\.\.\)withanunwrappedquine:

\(quineprompt"Writeapoem"\)

’\(\!llm\-selfprompt\);WRONG:nextturn,thewrapperismissing

’\(\!llm\-self\(wrap\-catprompt\)\);correct:wrap\-catappliesthewrapper

#### B\.11\.2Model\-accessible prompts and documentation

Spellmakes documentation available to the model for namespaces that it wishes to use\. The model accesses this documentation by writing\(\!describenamespace\)\. A specialremindersnamespace includes prompts that can either be accessed by the model or injected by the user via the initial program\. Coding benchmarks reported in the paper used the:codingreminder, which instructs the model to act as a coding agent and provides coding\-specific tool\-use examples\.

##### stringsnamespace guide

STRINGS\-\-Mirrorsclojure\.string\.Regexfunctionstakestringpatterns\(notcompiledregex\)\.

SameasClojure:index\-of,last\-index\-of,starts\-with?,ends\-with?,includes?,blank?,trim,replace,split,split\-lines,join,lower\-case,upper\-case,capitalize\.

Relatedbuiltins:subs,re\-find,re\-matches,re\-seq\.

Use\(\!describestrings:fn\-name\)foranyfunction\.

##### mathnamespace guide

MATH\-\-Mirrorsjava\.lang\.Mathwithstandardsemantics\.

Basic:sqrt,cbrt,pow,exp,expm1,abs,sign

Rounding:floor,ceil,round,trunc

Logarithms:log\(natural\),log10,log2,log1p

Trigonometric:sin,cos,tan,asin,acos,atan,atan2

Hyperbolic:sinh,cosh,tanh

Angles:degrees\(rad\-\>deg\),radians\(deg\-\>rad\)

Numbertheory:factorial,gcd,lcm

Misc:hypot,rand

Typechecks:NaN?,infinite?

Constants:PI,E,INF,NEG\-INF,NaN

Relatedbuiltins:rand\-int,\+’,\-’,\*’,inc’,dec’,float,double,long,bigdec,rationalize\.

Typecaveat:manymathfunctionsreturnDoubles,andClojure‘=‘doesnottreat‘4‘and‘4\.0‘asequal\.Coerceexplicitlywhenyouneedanintegercheck:

\(let\[r\(math/round\(math/sqrts\)\)\]

\(=\(\*rr\)s\)\)

floor,ceil,round,trunc,factorial,gcd,andlcmreturnintegervalues\.

Allfunctionstakeandreturnnumbers\.Use\(\!describemath:fn\-name\)foranyfunction\.

Recommendedusagepattern:Writeafunction,evaluate,inspecttheresult\.

\.\.\.\|\(defnfib\[n\]\(if\(<=n1\)n\(\+\(fib\(\-n1\)\)\(fib\(\-n2\)\)\)\)\)

’\(\!call\-nowresult\(fib10\)\)

Antipattern:returntheresultofacomputationwithoutinspectingit\.

Bindtheresult,inspectitonthenextturn,thendecidewhattodonext\.

##### builtinsnamespace guide

BUILTINS\-\-Corefunctionsalwaysavailablewithoutnamespaceprefix\.

Categories\(use\(\!describebuiltins:category\)forfulllisting\):

special\-forms\-\-quote,def,persist,do,if,let,fn,quine,loop,recur,for,try

macros\-\-when,defn,cond,if\-let/if\-some,when\-let/when\-some,case,\-\>,\-\>\>,\!call\-now,\.\.\.

effect\-\-eval,\!llm\-self,\!ask\-await,leaf\-llm,describe\-fn,llm\(trailingexpressiononly\)

math\-\-\+,\-,\*,/,inc,dec,mod,abs,integer?,numerator,denominator,rand,\.\.\.

comparison\-\-<,\>,=,not,nil?,empty?,identity,\.\.\.

types\-\-string?,number?,vector?,map?,fn?,keyword?,integer?,ratio?,rational?,\.\.\.

strings\-\-str,pr\-str,subs,cat,format,read\-string,re\-find,\.\.\.

collections\-\-list,vector,set,first,rest,nth,conj,count,get,assoc,into,\.\.\.

maps\-\-keys,vals,merge,update,get\-in,assoc\-in,dissoc,select\-keys,\.\.\.

sequences\-\-map,filter,reduce,sort,group\-by,take,drop,partition,range,\.\.\.

combinators\-\-comp,partial,juxt,complement,constantly,\.\.\.

bitwise\-\-bit\-and,bit\-or,bit\-xor,bit\-shift\-left,\.\.\.

spell\-\-spell\-eval,reopen,wrap\-cat,serialize\-prefix,prune\-and\-reopen,serialize,stored

concurrency\-\-future\*

error\-\-throw,ex\-info,ex\-data,ex\-message,ex\-cause,gensym

Use\(\!describebuiltins:category\)forfulllistingofanycategory\.

Use\(\!describebuiltins:fn\-name\)forindividualfunctiondocs\.

Fornamespacefunctions\(io/,agents/,globals/,strings/,math/,patterns/\),use\(\!describe<namespace\>\)\.

Commonmistakes:

1\.callingeffectbuiltinsoutsidethetrailingexpression:\!llm\-self,\!ask\-await,leaf\-llm,eval,anddescribe\-fnareeffectfunctions;theymustappearinthequotedtrailingexpressionorinside\!call\-now/\!peek/\!print

2\.confusingdefwithlet:defbindsintheenvironment\(visibletolaterexpressions\);letcreateslocalscope

3\.forgettingquoteonthetrailingexpression:thelastexpressionmustbequotedsotheouterevalcanrunitwitheffectbindings

4\.strvscatvspr\-str:strjoinsargumentsasstrings;catisanalias;pr\-strserializesasSpell\-readabledata\(vectors,maps,etc\.\)

5\.usingread\-stringonuntrustedinput:read\-stringparsesSpellcode;onlyuseitondatayoucontrol

##### remindersnamespace guide

REMINDER:ThistextbelongstotheprefixofaSpellprogramthatyouaretaskedwithcompleting\.Yourentireresponseiscode;embedallnaturallanguagewithinstringliterals\.FollowtheinstructionsonhowtowritecorrectSpellcodeinyoursystemprompt\.

Forcodingtasks\(bugfixes,featureimplementation,test\-drivenwork\),consult\(\!describereminders:coding\)onthefirstturnforaresearch\-plan\-implement\-verifyworkflowfocusedon\!peek,persist,shortvalidationloops,andconcisecompletionevidence\.

##### reminders :codingprompt

CODINGTASKS\-\-Research,plan,implement,verify,iterate\.

Expectearlyverificationfailures\.Theyarenormal\.Usethemtorefineyourunderstanding,andcontinueuntiltheactualtaskiscomplete\.

RESEARCHbeforecommittingtoaplanorimplementation:

\-Identifytherelevantcode,tests,configs,scripts,datafiles,andoutputlocations\.

\-Treattherealenvironmentasthesourceoftruth\.Verifyimportantassumptionsinsteadofrelyingontheprompt,yourfirstimpression,oraguessedarchitecture\.

\-Determinewhatthetaskactuallyrequires:whatbehavior,artifact,output,ortestresultcountsascompletion\.

\-Whenerrors,tracebacks,orfailingcommandspointtoexactfilesorlinenumbers,inspectthoseexactplacesfirst,thenexpandoutwardasneeded\.

\-Use\!peek\-nowforexploratoryreadsanddisposableprobes\.Persistonlythespecificsnippets,facts,oroutputsyouwillneedonlaterturns\.

\-Onceyouknowthespec,thelikelyfixsite,andthevalidationstep,stopopen\-endedresearchandmovetoapatchattempt\.

Examples:

Checkdependenciesandenvironmentassumptions:

’\(\!peekenv\-check

\(io/sh"whichpython3&&python3\-\-version&&python3\-mpytest\-\-version&&whichrg"\)

pkg\-check

\(io/sh"python3\-<<’PY’

importimportlib\.util

mods=\[’pytest’,’numpy’,’pandas’\]

fornameinmods:

print\(f’\{name\}:’,bool\(importlib\.util\.find\_spec\(name\)\)\)

PY"\)\)

;;endofturn1completion

\(prune2\)

;;startofturn2suffix

\(think"Summaryofpeekoutput:python3andpytestareavailable;rgisinstalled;numpyandpandasareimportable\."\)

’\(\!call\-nowsource\-hits

\(io/grep"defhandle\_request\|classHandler""src"\{:include"\*\.py":context8:max\-count20\}\)\)

Searchfortherealimplementationsitebeforeediting:

’\(\!peekdef\-hits

\(io/grep"defhandle\_request\|classHandler""src"\{:include"\*\.py":context8:max\-count20\}\)\)

;;endofturn1completion

\(prune1\)

;;startofturn2suffix

\(think"Summaryofpeekoutput:handle\_requestisdefinedinsrc/server\.pyandreferencedfromsrc/router\.py\."\)

’\(\!call\-nowimpl\-lines\(io/read\-lines"src/server\.py"201240\)

router\-lines\(io/read\-lines"src/router\.py"110145\)\)

Readexactrangesalonganerrortrace:

’\(\!peekverify

\(io/sh"cd/repo&&python3\-mpytesttests/test\_server\.py::test\_handles\_empty\_input\-q"\)\)

;;endofturn1completion

\(prune1\)

;;startofturn2suffix

\(persisterr\-summary

"Summaryof\!peekoutput:AssertionErrorintest\_handles\_empty\_input;expectedemptylistbutgotnilfromhandle\_request\."\)

’\(\!call\-nowtest\-lines\(io/read\-lines"tests/test\_server\.py"5284\)

router\-lines\(io/read\-lines"src/router\.py"110145\)

impl\-lines\(io/read\-lines"src/server\.py"201240\)\)

Explorealargefileephemerally,thenpersistonlytherelevantsubset:

’\(\!peekfile\-lines\(io/read\-lines"src/server\.py"\)\)

;;endofturn1completion

\(prune1\)

;;startofturn2suffix

\(persisthandler\-block\(subvecfile\-lines200240\)\)

’\(\!peektest\-lines\(io/read\-lines"tests/test\_server\.py"5284\)\)

;;endofturn2completion

\(prune1\)

;;startofturn3suffix

Use\!peekfordisposablefilecreationorone\-offprobes:

’\(\!peek\_

\(io/write\-file"/tmp/check\.py"verify\-script\)

probe\(io/sh"python3/tmp/check\.py"\)\)

;;endofturn1completion

\(prune2\)

;;startofturn2suffix

Readtheteststofindconstraintsnotinthetaskdescription:

’\(\!peektest\-code\(io/read\-lines"tests/test\_solution\.py"\)\)

;;endofturn1completion

\(prune1\)

;;startofturn2suffix

\(persistsize\-check\(subvectest\-code1016\)\)

\(think"Thetestcompressesoutput\.binwithzlibandassertstheresultisunder10000bytes\-\-Ineedacompactrepresentation,notarawdump\."\)

PLANbeforeacting:

\-Statewhatyouthinkisgoingon,whatpartsofthesystemarerelevant,andwhatyouwilldonext\.

\-Identifytheconcretefiles,commands,orartifactsinvolved\.

\-Statehowyouwilltellwhetherthetaskiscomplete\.

\-Ifmultiplelocations,layers,oroutputpathsmaymatter,namethembeforeproceeding\.

Example:

\(think"Plan:inspecttheparserandthefailingtest,updatetheparserbehavior,thenruntheexactvalidationcommandandconfirmtheexpectedoutput/artifact\."\)

Yourresearchmustprogresstotheplanningstage:gatherneededcontext,persistwhatisrelevant,thenwhenyouunderstandtheexistinglogic,stopresearchingandplan\.

IMPLEMENT:

\-Makechangesthataresupportedbytheevidencegatheredduringresearch\.

\-Preferstructuredio/toolsforreadingandeditingfiles\.

\-Useio/shforrunningprograms,tests,packagemanagers,andshellutilities\.

\-Keepthefeedbackloopintact:whenyouneedresultsforlaterreasoning,bindthemwith\!call\-noworinspectthemwith\!peek\-now\.

VERIFY:

\-Usetheactualvalidationstepthatmatchesthetask:exacttest,exactcommand,exactoutputcheck,orexactartifactcheck\.

\-Use\!peek\-nowforio/shverificationoutputs,whichmaybeverbose\.

\-Afterafailedverification,summarizewhatthefailuremeansbeforemovingon\.

Example:

’\(\!peekverify

\(io/sh"cd/repo&&python3\-mpytesttests/test\_server\.py::test\_handles\_empty\_input\-q"\)\)

;;endofturn1completion

\(prune1\)

;;startofturn2suffix

\(deferr\-summary"Summaryof\!peekoutput:AssertionErrorintest\_handles\_empty\_input;expectedemptylistbutgotnilfromhandle\_request\."\)

’\(\!call\-nowimpl\-lines\(io/read\-lines"src/server\.py"201240\)\)

ITERATE:

\-Ifverificationfails,keepgoing\.Readthefailure,updateyourmodelofthetask,andtryagain\.

\-Iftheevidencecontradictsyourfirsttheory,replacethetheoryinsteadofdefendingit\.

\-Re\-checkyourassumptionsaftereachsurprisingresult\.Beopentothepossibilitythatyourpreviousreasoning,chosenfile,inferredrootcause,orvalidationmethodwaswrong\.

\-Ifacommandfailsortheenvironmentbehavesunexpectedly,inspecttheactualtools,files,paths,permissions,dependencies,andoutputsbeforeconcludinganything\.

\-Afterafailedattemptthatcreatedresources\(files,APIobjects,containers\),reuseorcleanupexistingresourcesinsteadofcreatingduplicates\.

Example:

\(think"Myearlierassumptionwaswrong:thefailureisnotinsrc/router\.py;thetracebackandtestoutputpointtosrc/server\.py,andpytestisusingadifferentcodepaththanmycustomrepro\."\)

COMPLETION:

\-Returnconciseevidenceforcompletion:whatyouranorchecked,whatpassed,andwhatobservableresultprovesthetaskisdone\.

\-Donottreatdiagnosis,aplausiblepatch,orapartialcheckascompletion\.

Example:

\(think"Validationevidence:ran‘python3\-mpytesttests/test\_server\.py::test\_handles\_empty\_input\-q‘anditpassed;outputfile‘/app/out\.json‘nowexistsandcontainstheexpectedemptylist\."\)

##### reminders :orchestratorprompt

ORCHESTRATORMODE\-Useexplicitorchestration,notjustabasicsingle\-agentloop\.

Thisreminderisintentionallystrongerthanthedefaultcodingworkflow\.Thegoalistoinducerealorchestrationattempts\.Unlesstheenvironmentmakesitimpossible,donotsolvethetaskasaplainsingle\-agentresearch\-plan\-implement\-verifyloop\.Actasanorchestratorcoordinatingboundedsubproblems\.

DEFAULTBEHAVIOR:

\-Startbydecomposingthetaskintoatleasttwoconcretesubproblemsorphases\.

\-Performanexplicitorchestrationmaneuverearly:helperagents,parallelreads,orasummarize\-and\-reopenboundary\.

\-Evenifthetasklookslocal,stilluseoneorchestrationstepbeforecommittingtothefinalpatch\.

\-Treatyourselfasthecoordinator:gatherresults,comparethem,decidewhattotrust,andintegratethefinalanswer\.

DECOMPOSEDELIBERATELY:

\-Namethesubproblemsexplicitlybeforeediting\.

\-Preferboundeddelegationoveropen\-ended"solvethewholetask"prompts\.

\-Givehelpersnarrowscopes:exactfiles,exactquestion,exactstopcondition\.

\-Iftherearemultipleplausiblerootcauses,investigatetheminparallelinsteadofserially\.

\-Iftherearemultiplelikelyfixsites,assignthemasseparateworkitemsbeforepatching\.

ORCHESTRATORRULES:

\-Askhelpersforsummaries,answers,diffs,orexactfacts,notrawtranscriptdumps\.

\-Neverassumehelpersuccess\.Inspectwhatcamebackagainsttherepositorybeforebuildingonit\.

\-Ifahelperfails,returnsmalformedcode,orgetsconfused,salvagethevalidatedfactsandreassignortakeover\.

\-Keeporchestratorstatecompact:plan,findings,decision\.Compressaggressivelyonceexplorationhasconverged\.

\-Main\-threadverificationisstillmandatory,butverificationitselfcanbedecomposedintoboundedprobesbeforethefinalcheck\.

WHENORCHESTRATIONMISFIRES:

\-Ifworkersduplicateeffort,tightenownershipandresendnarrowerprompts\.

\-Ifworkersarenoisyorunreliable,replacethemwithdirectreadsorasame\-agentcompressionstepratherthancarryingtheirwholeoutputforward\.

\-Iforchestrationisclearlycausingconfusion,simplifyonelayeratatimeinsteadofinstantlyabandoningthestructure\.

\-Integratepartialusefulresults\.Donotpretendearlierorchestrationneverhappened\.

Example:orchestratebroadexplorationbeforepatching\.

\(think"Iwilltreatthisastwosubproblems:findtheexactbehavioralcontract,andfindtheimplementationpaththatviolatesit\."\)

’\(\!call\-nowcontract

\(agents/\!spawn\-ask

"Readtherelevanttests/docsandreturnonly:requiredbehavior,exactverificationcommand,andlikelycentralfiles\."\)

impl\-map

\(agents/\!spawn\-ask

"Readthelikelyimplementationfilesandreturnonly:probablefixsite,adjacentcoupledcode,andobviousrisks\."\)\)

Example:forceoneorchestrationstepevenforalocal\-lookingbug\.

\(think"Thisbuglookslocal,butIstillwantanexplicitorchestrationstepbeforepatching\."\)

’\(\!call\-nowtest\-view\(io/read\-lines"tests/test\_solution\.py"180\)

impl\-view\(io/read\-lines"src/solution\.py"100180\)\)

Example:recoverbycompressingafternoisydelegation\.

\(think"Workeroutputswerenoisy,buttwofactssurvived:theserializerdropstheflagandtherenderertestistherightverifier\."\)

\(rethink"Plan:patchserializer\.pydirectly,verifywithtests/test\_renderer\.py::test\_preserves\_flag,thenreruntheserializer\-focusedtestifneeded\."\)

’\(\!llm\-self"Continuefromthiscompressedplanonly\.Implementthefix,verifyit,andreportconcreteevidence\."\)

##### reminders :context\-efficiencyprompt

CONTEXTEFFICIENCY\-\-Minimizetotalcontextwindowusage\.

Contexttokensareyourscarcestresource\.Pruneaggressivelytostayeffectiveoverlongtasks\.

Prefer\!peek\-nowover\!call\-nowfordisposabletoolcalls\(auto\-appendsprunethatremovesthebindingonthefollowingextensionwhileleavingthe\!peek\-nowcallvisible\):

’\(\!peek\-nowdata\(io/sh"find\.\-name’\*\.py’"\)\)

Onthesubsequentturn,persistwhatyouneedbeforeextending:

;;endofturn1completion

\(defdata"\.\.\.200lines\.\.\."\)

\(prune1\)

;;startofturn2suffix

;;dataisstillinscopehere

\(persisttargets\(take5\(strings/split\-linesdata\)\)\)

’\(\!extend\)

;;nextturn:dataispruned,the\!peek\-nowcallremainsvisible,andtargetssurviveasliterals

WhenrunningashellscriptorPythonprogramthatyoudonotneedtorerun,keepitinside\!peek\-now:

’\(\!peek\-nowverify\(io/sh"cd/repo&&python\-<<’PY’\\nimport\.\.\.\\nPY"\)\)

;;endofturn1completion

\(prune1\)

;;startofturn2suffix

\(think"Verificationpassed:thefixhandlesbothedgecases\."\)

’\(\!extend\)

;;nextturn:theverifybindingisgone;the\!peek\-nowcallremainsvisible

Whenyouneedtorerunascriptlater,writeittodiskfirstandthencallitwith\!call\-now\.

Afterextendedreasoning,rethinktocompress:

\(think"Longanalysisofthebug\.\.\.examiningstacktraces,testinghypotheses\.\.\.therootcauseisinparse\_argsline42\."\)

\(rethink"Thebugisinparse\_args,line42:off\-by\-oneintheloopbound\."\)

’\(\!extend\)

Whencontextgrowslarge,compact:

’\(\!compact\)

Plan\-clearpattern\-\-reasonandexplore,thenstartfreshwithaself\-containedplan:

\(think"analyzingtheproblem\.\.\."\.\.\.\)

’\(\!peek\-nowfiles\(io/ls"\."\)\)

;;endofturn1completion

\(deffiles\[\.\.\.\]\)

\(defplan"Task:fixthecalculatorbugincalc\.py\\n1\.Editline12:fixoff\-by\-one\\n2\.Runtests"\)

;;startofturn2suffix

’\(\!llm\-selfplan\)

;;nextturnhasonlytheplanasprefix\-\-maximumworkingspace

Eachextensionshouldcarryforwardonlywhatthenextstepneeds\.

##### ionamespace guide

IO\-\-Fileoperations,read\-onlyexploration,shellcommands,processexecutionandfilewatching\.

\(io/read\-linespath\)\-\-readfileasvectoroflinestringswithfirst\-linemetadata

\(io/read\-linespathstartend\)\-\-linerange\[start,end\)Python\-stylehalf\-open

\(io/read\-filepath\)\-\-readfileasnumbered\-linesstring

\(io/read\-filepathstartend\)

\(io/greppatternpath\)\-\-recursivegrepwithlinenumbers\(EREbydefault;supports\|,\+,?,\{n,m\},and\(\.\.\.\)groups\)

\(io/greppatternpath\{:ignore\-casetrue:include"\*\.clj":context20:max\-count50\}\)

;;:contextNreturnsNlinesaroundeachmatch\-\-onecallgetsboththehitandthesurroundingcode

\(io/globpattern\)

\(io/globpatternpath\{:type"f":max\-depth5\}\)

\(io/git"status"\)\-\-runanallowlistedread\-onlygitsubcommand

\(io/str\-replacepatholdnew\)\-\-replacestringinfile\(mustappearexactlyonce\)

\(io/str\-replacepatholdnew\{:alltrue\}\)

\(io/replace\-linespathstartendcontent\)\-\-deleteslinesinhalf\-openrange\[start,end\)

\(io/replace\-linespathstartstartcontent\)\-\-insertscontentbeforelinestart

\(io/replace\-linespath\[\[sec\]\.\.\.\]\)\-\-multi\-edit\(linenumbersrefertooriginalfile\)

\(io/shcommand\)\-\-executeshellcommand,returns\{:exit:out:err\}

\(io/shcommand\{:timeout10\}\)

\(io/sh\-testcommand\)\-\-buildazero\-argshell\-backedfix\-looptestthunk

\(io/exec\[cmdarg1\.\.\.\]\)\-\-executecommanddirectly\(noshell\)

\(io/watch\-sendpathhandle\)\-\-watchdirectory,sendeventsasmessagetohandle

FunctionsidenticaltoClojure:slurp,spit,write\-file,exists?,directory?,stat,delete,copy,move,mkdir,mkdirs,cwd,env,temp\-file\.

‘ls‘returnsstructuredentrieswith‘:name‘and‘:size‘;directorynamesendwith‘/‘\.

Use\(\!describeio:fn\-name\)fordetaileddocsonanyfunction\.

Allio/callsareeffectfunctions\-\-quotetheminthetrailingexpression\.

Recommendedfunctions:read\-lines,grep,glob,replace\-lines\.

Commonmistakes:

1\.callingio/\*outsidethequotedtrailingexpression

2\.forgetting\!call\-nowwhenyouneedtheresult:’\(io/read\-file"x"\)evaluatesbuttheresultislost

3\.usingio/shforeverything\-\-useio/str\-replacetopatchfiles,io/read\-filetoreadthem,io/greptosearchthem

4\.grep\-then\-readintwoturnswhenonegrepwith:contextNwouldsuffice\-\-prefer‘\(io/greppatpath\{:context20\}\)‘for"find\+seecontext"

Inexamples,\|markscursorpositioninacompletion\.

Recommendedusagepattern:Patchafilewithio/str\-replace\.

Useio/str\-replacewhenyouknowtheexacttexttochange\.Itavoidsshellescapingissuesentirely\.

\.\.\.\|’\(\!call\-now\_\(io/str\-replace"/testbed/module\.py"

"fortindiophantine\(eq,param\)\}"

"fortindiophantine\(eq,param,permute=permute\)\}"\)\)

Recommendedusagepattern:Writeascripttoatempfileandrunitinoneturn\.

Whenyouneedtorunamulti\-linePythonscript,writeittoafilefirsttoavoidshellheredocescapingerrors\.

\.\.\.\(defverify\-script"importsys\\nfrompathlibimportPath\\n\.\.\."\)

\|’\(\!call\-now\_\(io/write\-file"/tmp/verify\.py"verify\-script\)result\(io/sh"python/tmp/verify\.py"\)\)

DoNOTembedmulti\-linePythoninio/shheredocs\-\-nestedquotingbetweenSpellstrings,shellheredocs,andPythonstringsisfragileandcausesSyntaxError\.

Recommendedusagepattern:Readandreplacebylinenumber\.

1\.Readthefiletoseecurrentcontents\.

\.\.\.\|’\(\!call\-nowcode\(io/read\-lines"main\.py"\)\)

2\.Nextturn:codeisbound\.Identifythelinerange,replaceit\.

\.\.\.\(defcode\["defgreet\(\):""print\(’hello’\)"\.\.\.\]\)

\|\(think"Line2needsupdating\."\)

’\(io/replace\-lines"main\.py"23"print\(’goodbye’\)"\)

Recommendedusagepattern:Exploremultiplefilesandpersistrelevantsnippets\.

1\.Peekfullfilewithone\-turnlifetime\.

\.\.\.\|’\(\!peek\-nowfile\-lines\(io/read\-lines"main\.py"\)\)

2\.Nextturn:file\-linesisavailable\.Persistrelevantsnippetsandpeekanotherfile\.

\.\.\.\(deffile\-lines\["\.\.\.manylines\.\.\."\]\)

\(rethink2"Afterpersistingwhatyouneed,rethink2todroptheprior\!peek\-nowcallandbinding\."\)

\|\(persistfn\-defn\(subvecfile\-lines99111\)\)

’\(\!peek\-nowtest\-lines\(io/read\-lines"test\_main\.py"\)\)

3\.Nextturn:fn\-defnstaysincontext\.Theprior\!peek\-nowcallandfile\-linesweredroppedbyrethink2,andtest\-linesisnowavailable\.

\.\.\.

\(persistfn\-defn\["deftarget\_fn\(\.\.\.\):""\.\.\."\]\)

’\(\!peek\-nowtest\-lines\(io/read\-lines"test\_main\.py"\)\)

\(deftest\-lines\["\.\.\.manylines\.\.\."\]\)

\(rethink2"Afterpersistingwhatyouneed,rethink2todroptheprior\!peek\-nowcallandbinding\."\)

\|\.\.\.

Recommendedusagepattern:Grepthroughlargefiles\.

1\.Readthefile\.

\.\.\.\|’\(\!call\-nowcode\(io/read\-file"big\-module\.py"\)\)

2\.Nextturn:filewastoolargeandgottruncated\.Rethinktodiscardit,thengrepforwhatyouneed\.

\.\.\.\(defcode"1:importos\\n2:importsys\\n\.\.\.\\n\.\.\.\[truncated,58302charstotal\]"\)

\|\(rethink"Filetoolargetoscaninline\.Grepforthetargetinstead\."\)

’\(\!call\-nowmatches\(io/grep"defhandle\_request""big\-module\.py"\)\)

##### io\-readnamespace guide

IO\-READ\-\-Read\-onlyfilesysteminspection,codebaseexploration,andenvironmentlookup\.

\(io/slurppath\)\-\-readentirefileasrawstring

\(io/slurp\-bytespath\)\-\-readfileasbytes

\(io/read\-filepath\)\-\-readfileasnumberedlines

\(io/read\-filepathstartend\)\-\-readnumberedlinerange\[start,end\)

\(io/read\-linespath\)\-\-readfileasvectorofrawlines

\(io/read\-linespathstartend\)\-\-readrawlinerange\[start,end\)

\(io/greppatternpath\)\-\-recursivegrepwithlinenumbers\(EREbydefault\)

\(io/greppatternpath\{:context20:ignore\-casetrue:max\-count50\}\)

\-\-:contextNreturnsNlinesaroundeachmatchinonecall

\(io/globpattern\)\-\-findfilesbynamepattern

\(io/git"status"\)\-\-runanallowlistedread\-onlygitsubcommand

\(io/exists?path\)\-\-checkwhetherapathexists

\(io/directory?path\)\-\-checkwhetherapathisadirectory

\(io/lspath\)\-\-listdirectorycontentsas\[\{:name:size\}\.\.\.\]

\(io/cwd\)\-\-printcurrentworkingdirectory

\(io/statpath\)\-\-inspectfilemetadata

\(io/env\)/\(io/env"NAME"\)\-\-readenvvars

Useio\-readwhenachildshouldinspecttheworkspacewithouteditingfilesor

runningarbitrarycommands\.Forprocessexecution,addio\-execseparately\.

##### io\-writenamespace guide

IO\-WRITE\-\-Filesystemmutationandfileediting\.

\(io/write\-filepathcontent\)\-\-writefilecontents

\(io/spitpathcontent\)\-\-writeorappendfilecontents

\(io/str\-replacepatholdnew\)\-\-replaceastringinafile

\(io/str\-replacepatholdnew\{:alltrue\}\)\-\-replacealloccurrences

\(io/replace\-linespathstartendcontent\)\-\-replaceanumberedlinerange

\(io/replace\-linespath\[\[sec\]\.\.\.\]\)\-\-applymultiplelineeditsatomically

\(io/mkdirpath\)\-\-createonedirectory

\(io/mkdirspath\)\-\-createadirectorytree

\(io/deletepath\)\-\-deleteafileoremptydirectory

\(io/copysrcdest\)\-\-copyafile

\(io/movesrcdest\)\-\-moveorrenameafile

\(io/temp\-file\)\-\-createatempfile

Useio\-writewhenanagentshouldedittheworkspace\.Pairitwithio\-read

whenthesameagentalsoneedsinspectionhelpers\.

##### io\-exec

IO\-EXEC\-\-Processexecutionhelpers\.

\(io/shcommand\)\-\-executeashellcommand

\(io/shcommand\{:timeout10\}\)\-\-executewithatimeoutoverride

\(io/sh\-testcommand\)\-\-buildazero\-argshell\-backedtestthunk

\(io/exec\[cmdarg1\.\.\.\]\)\-\-executeacommanddirectlywithoutashell

\(io/watch\-sendpathhandle\)\-\-watchadirectoryandsendfileevents

Useio\-execwhenanagentneedscommandexecution\.Pairitwithio\-readfora

read\-onlyexplorationagentorwithio\-writeforagentsthatbotheditandrun

commands\.

##### agentsnamespace guide

AGENTS\-\-Inter\-agentcommunication\(effectnamespace\)\.

\(agents/spawnprompt\)\-\-startbackgroundagentusingthecurrentagent

\(agents/spawnprompt:handle\-name\)\-\-same,withexplicithandlename

\(agents/spawnagentprompt\)\-\-startbackgroundagentwithexplicitcompiledagent

\(agents/spawnagentprompt:handle\-name\)\-\-explicitcompiledagent\+explicithandlename

\(agents/sendtargetmessage\)\-\-sendmessage\(usuallyastring\)totarget

\(agents/replymsg\-mapmessage\)\-\-replytomsg\-map,whichmustcontain:from

\(agents/\!asktargetmessage\)\-\-sendmessagetotarget,blockforreply

\(agents/\!asktarget\)\-\-poketargetwithoutmessage,blockforreply

\(agents/\!ask\[abc\]\)\-\-multi\-target:pokeall,wakewhenallcomplete

\(agents/\!reply\-askmsg\-mapmessage\)\-\-replytomsg\-map,blockfornextmessage

\(agents/\!spawn\-askprompt\)\-\-spawnwiththecurrentagent,blockuntilcompletion

\(agents/\!spawn\-askprompt:handle\-name\)\-\-same,withexplicithandlename

\(agents/\!spawn\-askagentprompt\)\-\-spawnwithexplicitcompiledagent,blockuntilcompletion

\(agents/\!spawn\-ask\[\[agentprompt\]\[agentprompt:name\]\.\.\.\]\)\-\-spawnmany,waitforallcompletions\(noaskwakeuppoke\)

\(agents/\!spawn\-ask\[prompt\-aprompt\-b\.\.\.\]\)\-\-spawnmanywiththecurrentagent,waitforallcompletions\(noaskwakeuppoke\)

\(agents/current\-handle\)\-\-yourhandle

\(agents/parent\-handle\)\-\-handleofagentthatspawnedyou\(nilifyouaremain\)

\(agents/send\-msg\-fnfhandle\)\-\-low\-level/notrecommended

Use\(\!describeagents:fn\-name\)fordetaileddocsonanyfunction\.

Communicationchooser:

\-Use\(agents/sendtargetvalue\)whenyouknowthetargethandleandjustwant

todeliveramessage\.Commonchild\-\>parentpattern:

’\(agents/send\(agents/parent\-handle\)value\)

\-Use\(agents/replymsg\-Nvalue\)onlywhenyouarereplyingtoareceived

messagebindingsuchasmsg\-0\.Thefirstargumentmustbethatmsgmap,not

arawstringpayloadandnotahandle\.

\-Use\(agents/\!reply\-askmsg\-Nvalue\)whenyouarereplyingtomsg\-Nandwant

towaitfortheotherside’snextmessage\.

\-Use\(agents/\!asktargetvalue\)whenyouwanttosendarequestandblockfor

theresponse\.

Quickrule:

\-Ifyouhaveahandle,usesendor\!ask\.

\-Ifyouhaveareceivedmsg\-N,usereplyor\!reply\-ask\.

Specialhandles:

:main\-\-theinitialagent\(entrypoint\)\.Alwayspresent\.

:user\-\-thehumanoperator\(onlypresentininteractiveterminalsessions\)\.

Check\(globals/get:roles\)toseeif:userisavailablebeforeasking\.

Messagesarriveasdefbindings:\(defmsg\-N\{:fromsender:bodyval\}\)\.

ReplyusingSpellcode,notrawnaturallanguage\.

Agentsotherthan:mainpersistafterreturning;alatermessagecanwakethemforanotherturn\.

Messagepreemption:ifanotheragentsendsyouamessagewhileyourresponse

isinflight,themessageisappendedasanextensionandyourtrailing

expressionbecomesinert\.A\(think"\[preemptedorawakenedbymsg\-N\]"\)

annotationprecedesthemessagedef\.’Preempted’meansyourtrailingexpression

didnotfire;’awakened’meansyouweresleepingandthemessagewokeyou\.

Yougetanewturnwiththeincomingmessageinscope\.

Youmaythenre\-runthetrailingexpressionfromyourpreviousturn\.

Allagents/callsareeffectfunctions\-\-quotetheminthetrailingexpression\.

Check\(globals/get:roles\)todiscoveravailableagents\.

Commonmistakes:

1\.agents/sendandpassingturnwhenexpectingareply:thisendsconversation,insteaduseagents/\!ask

2\.agents/replyandpassingturn:sameproblem;useagents/\!reply\-askifyouneedtheconversationtocontinue

3\.agents/\!askfollowedbyadditionalexpressions:thesedonotevaluate,insteadputthemfirst

4\.hallucinatinghandles:use\(agents/parent\-handle\),:user,:main,orlookup\(\!print\(globals/get:roles\)\)\(ifglobals/available\)

5\.callingagents/\*outsidethequotedtrailingexpression\(forexample:\(defh\(agents/current\-handle\)\)\);effectcallsmustrunintrailingexpressioncode

6\.agents/sendargumentorder:itis\(agents/sendtargetmessage\),consistentwith\(agents/\!asktargetmessage\)\.

7\.agents/replyneedstwoarguments:areceivedmsg\-Nandareplyvalue\.Wrong:’\(agents/reply"hello"\)\.Right:’\(agents/replymsg\-0"hello"\)

8\.spawnedchildrenoftenneedsend,notreply\.Ifnobodymessagedyouyetthisturn,youdonothaveamsg\-Ntoreplyto\.

Inexamples,\|markscursorpositioninacompletion\.Itisdoc\-only;donottypeitintocode\.

Multi\-partexample:

1\.Main:spawnasummarizer,keepworking,thenblockwith\!ask\.

;;turn1:startchild\+continueyourownCoT

\.\.\.\|’\(do\(agents/spawn

"Youareasummarizer\.Readlong\-file\.txtandsendmeasummary\."

:summarizer\)

\(\!extend\)\)

;;nextturn:

\.\.\.\|\(think"\.\.\."\)\(think"Ok,I’llwaitforsummarizernow"\)’\(agents/\!ask:summarizer\)

;;mainblocksuntilchildresponds

2\.Summarizerchild:usesendtoreturnresult\.

\.\.\.\(quineprompt"Youareasummarizer\.Readlong\-file\.txtandsendmeasummary\."\)

\|’\(\!call\-nowfile\-contents\(io/read\-lines"long\-file\.txt"\)\)

;;nextturn

\.\.\.\(deffile\-contents"\.\.\."\)

\|\(defsummary"\.\.\."\)

’\(agents/send\(agents/parent\-handle\)summary\)

;;childturnendsaftersend

3\.Main:use\!reply\-asktoclarifyandkeeptheconversationopen\.

\.\.\.’\(agents/\!ask:summarizer\)

\(defmsg\-0\{:from:summarizer:body\{\.\.\.\}\}\)

\(think"Ihaveaquestionaboutthesummary\."\)

\|’\(agents/\!reply\-askmsg\-0"Whatisthe\.\.\."\)

;;childawakens;mainblocksforchild’sresponse

##### globalsnamespace guide

GLOBALS\-\-Sharedstatevisibletoallagents\.

\(globals/getkey\)\-\-readaglobalbykey

\(globals/setkeyvalue\)\-\-writeaglobal\(returnsthevalue\)

\(globals/updatekeyf\)\-\-atomicread\-modify\-write\(returnsnewvalue\)

\(globals/popkey\)\-\-atomicremove\-and\-returnfirstelement

\(globals/keys\)\-\-listallglobalkeys

\(globals/all\)\-\-returnentireglobalsmap

\(globals/wait\-untilpred\)\-\-blockuntilpredonglobalsmapistrue

Use\(\!describeglobals:fn\-name\)fordetaileddocsonanyfunction\.

Allglobals/callsareeffectfunctions\-\-quotetheminthetrailingexpression\.

Commonreadpatterns:

1\)Bindtoalocalwith\!call\-now:

’\(\!call\-nowroles\(globals/get:roles\)\)

;;nextturn:rolesisavailableasalocalbinding

2\)Printdirectlyforquickinspection:

’\(\!print\(globals/get:roles\)\)

Defaultspecialkeys:

:roles\{\}\-\-Agentregistryforhandlelookup\.

Convention:\{:main"Orchestrator":spawn\-1"WorkerforCLI":spawn\-2"Workerforunittesting"\}

:tasks\[\]\-\-sharedtaskqueue\.

Convention:\[\{:id1:desc"readfile"\}\{:id2:desc"summarize"\}\]

Thesedefaultsareconventions,notrequirements,andareunpopulatedbydefault\.Agentsmaycreateadditionalkeys\.

Commonmistakes:

1\.callingglobals/\*outsidethequotedtrailingexpression:\(globals/get:roles\)doesnothingatevaltime;mustbequoted

2\.forgetting\!call\-now:’\(globals/get:roles\)returnsthevalue;use’\(\!call\-nowroles\(globals/get:roles\)\)ifyouwanttoseeit

3\.hallucinatinghandles:instead,lookthemupinroles/\(alsoseeagents/parent\-handleandagents/current\-handle\)

Multi\-partexample\-\-workerpoolwithasharedtaskqueue:

\|markscursorpositionandisdoc\-only;donottypeitintocode\.

1\.Main:populatethequeueandspawnworkers\.

\.\.\.\|’\(do\(globals/set:results\[\]\)

\(globals/set:tasks\[\{:id1:desc"summarizeA"\}\{:id2:desc"summarizeB"\}\]\)

\(agents/spawn"Youareaworker\.Poptasksfromglobals:tasksandprocessthem\.":w1\)

\(agents/spawn"Youareaworker\.Poptasksfromglobals:tasksandprocessthem\.":w2\)

\(globals/wait\-until\(fn\[s\]\(=2\(count\(:resultss\)\)\)\)\)\)

2\.Workerw1:claimataskatomically\.

\.\.\.\|’\(\!call\-nowtask\(globals/pop:tasks\)\)

;;nextturn:taskis\{:id1:desc"summarizeA"\}\(ornilifqueueempty\)

3\.Workerw1:postresultback\.

\.\.\.\(deftask\{:id1:desc"summarizeA"\}\)

\|\(defsummary"Aisabout\.\.\."\)

’\(globals/update:results\(fn\[r\]\(conj\(orr\[\]\)\{:id1:summarysummary\}\)\)\)

##### blockingnamespace guide

BLOCKING\-\-Future\-onlyblockingprimitives\.

\(blocking/awaitfut\)\-\-awaitaSpellfuturetoken\(future\-only\)

\(blocking/await\-all\[f1f2\.\.\.\]\)\-\-awaitmultipleSpellfutures\(future\-only\)

\(blocking/pmapfcoll\)\-\-parallelmapwithblockingjoin\(future\-only\)

\(blocking/plet\[aexpr1bexpr2\]body\)\-\-macro;parallelletwithblocking/await

\(blocking/completion\-promisehandle\)\-\-awaittokenforhandlecompletion\(future\-only\)

\(blocking/send\-awaithandlemsg\)\-\-capturecompletion,send,await\(future\-only\)

Usefrominside\(future\.\.\.\)orchestrationcode\.

##### patternsnamespace guide

PATTERNS\-Reusableorchestrationpatterns\(effectnamespace\)\.

\(patterns/check\-resultpromptanswer\)\-verifyanswerwithleaf\-llm

\(patterns/clean\-promptraw\-text\)\-cleanupmessytext,thenexecuteit

\(patterns/ralphopts\)\-future\-basedretryorchestrator

\(patterns/teamgoal\-or\-opts\)\-planner\+parallelworktreeteamorchestrator

\(patterns/fix\-loopissue\)\-test\-drivencodefixingloop\(reflector\+workeragents\)

\(patterns/relayopts\)\-fresh\-workerreasoningroundswithfreshverification

Use\(\!describepatterns:fn\-name\)fordetaileddocsonanyfunction\.

check\-result:Verifiesananswerusingleaf\-llm\.Returns\{:okanswer\}or\{:wrongmsg\}\.

\(patterns/check\-result"Whatis2\+2?"4\);;=\>\{:ok4\}

\(patterns/check\-result"CapitalofFrance?""London"\);;=\>\{:wrong"Londonis\.\.\."

clean\-prompt:Cleansuparawprompt\(voice\-to\-text,quicknotes\)vialeaf\-llm,thenrunsit\.

’\(patterns/clean\-prompt"wahtisthecaptaloffranc\.\.\.likethebigcity"\)

leaf\-llminfersintentandrewrites;\!llm\-selfexecutesthecleanedprompt\.

Acceptsastringorquineform\(serializesnon\-stringsautomatically\)\.

ralph:Retryorchestratorthatrunsblockingcompletionwaitsinsideafuture,so

thecaller’sagenttracestaysresponsive\.Spawnsaworker,sendstask/retry

messages,waitsviablocking/send\-await,andsends

final\{:passresult\}or\{:faillast\-result\}tothecaller\.

’\(\!call\-nowstarted\(patterns/ralph"fixfailingtests"\)\)

;;laterreceivesmsgwith\{:pass\.\.\.\}or\{:fail\.\.\.\}

team:Multi\-taskimplementationorchestrator\.Aplannerdecomposesthegoal,

theschedulerexecutesdependencywavesinparallelgitworktrees,anda

verifierapprovesmergesorresolvesconflictsontheintegrationbranch\.

’\(\!call\-nowresult\(patterns/team"ImplementfeatureX"\)\)

Returns\{:status:completed\|:partial\|:failed:tasks\[\.\.\.\]:branch"spell\-team\-\.\.\."\}

fix\-loop:Test\-drivencodefixingloop\.Registersapersistentreflectoragentand

apersistentworkeragentfortherun\.Therootloopcoordinatesbothvia

blocking/send\-awaitinsideafuture,andthecallerwaitsvia\!ask\-await:

reflectorproposesdiagnosis\+testspec,

workerappliesedits,andtheloopretriesuntiltestspassorretriesareexhausted\.

’\(\!call\-nowresult\(patterns/fix\-loopissue\)\)

Returns\{:passtrue\}or\{:fail"reason"\}

relay:Reasoningrelaywithfreshcontexteachround\.Eachroundregistersanew

worker,passesforwardcompressedpriorreports,andifaworkerclaims:solved,

thepatternregistersafreshverifiertochecktheanswerindependently\.

’\(\!call\-nowresult\(patterns/relayproblem\)\)

Returns\{:solvedtrue\|false:answerany?:rounds\[\.\.\.\]\}

Allpatterns/callsareeffectfunctions\-quotetheminthetrailingexpression\.

Commonmistakes:

1\.callingcheck\-resultoutsidethetrailingexpression:mustbequotedlikealleffectcalls

2\.usingteamwithoutanio\-capableagentprofile:workersandverifierneedio/andagents/;blocking/isfuture\-onlyand\!ask\-awaitisabuiltin

Inexamples,\|markscursorpositioninacompletion\.Itisdoc\-only;donottypeitintocode\.

Example\-verifythencorrect:

1\.Computeananswerandcheckit\.

\.\.\.\(defanswer42\)

\|’\(\!call\-nowverdict\(patterns/check\-result"Whatis6\*9?"answer\)\)

2\.Nextturn:handletheverdict\.

\.\.\.\(defverdict\{:wrong"6\*9=54,not42"\}\)

\|\(defanswer54\)

’\(\!call\-nowverdict\(patterns/check\-result"Whatis6\*9?"answer\)\)

##### webnamespace guide

WEB\-\-Searchandfetchwebcontent\.

\(web/searchquery\)\-\-searchwebandreturn\[\{:title:url:snippet\}\.\.\.\]

\(web/fetchurl\)\-\-fetchURLandreturnmarkdown/text

\(web/config\)\-\-inspectactivewebconfig

Recommendedusagepattern:Search,thenfetchthemostrelevantresult\.

1\.Searchandpeektheresults\.

\.\.\.\|’\(\!peek\-nowresults\(web/search"clojuretransducers"\)\)

2\.Nextturn:resultsisavailable\.PickthebestURLandfetchit\.

\.\.\.\(defresults\{:ok\[\{:title"Transducers\-Clojure":url"https://clojure\.org/reference/transducers":snippet"\.\.\."\}\.\.\.\]\}\)

\(rethink2"Afterpersistingwhatyouneed,rethink2todroptheprior\!peek\-nowcallandbinding\."\)

\|\(persistbest\-url\(get\(first\(:okresults\)\):url\)\)

’\(\!peek\-nowpage\(web/fetchbest\-url\)\)

##### reactnamespace guide

REACT\-HiddenReActloop\(effectnamespace\)\.

\(react/runprompt\-or\-opts\)\-runaplain\-textcommandloopwhilehidingSpellfromtheinnermodel

Usereact/runfroman:initprogramwhosetrailingexpressioncallsthe

reactnamespace:

\(eval\(do

\(defprompt"Inspecttherepoandsummarizethefailingtest\."\)

’\(react/runprompt\)\)\)

Mapform:

\(eval\(do

’\(react/run\{:task"Inspecttherepoandsummarizethefailingtest\."

:max\-steps20\}\)\)\)

react/runusesleaf\-llminternally,andtheinnermodelseesonlyagenuineplain\-text

ReActtranscript:tasktext,priorthoughts/actions/observations,andthe

requiredoutputcontractAction:Command\[\.\.\.\]orAction:Finish\[\.\.\.\]\.

Requiresanagentprofilethatexposesreact/plusshellexecution

capability\(viaio/sh\)\.

### B\.12Error recovery prompts

Spellhas two LLM\-facing recovery prompt templates plus a tool\-call retry reminder\. In the templates below,<errormessage\>is replaced with the cleaned reader or evaluation error\.

##### Inert recovery prompt

ThepreviousSpellprogramthrewanerror\.Pleaserecoverfromthiserrorbywritinganewprogramthatfulfillstheintentofthepreviousprogramwhileavoidingtheerror\.Thepreviousprogramisinerttext;itwillnotbereevaluated,andnoneofitsbindingsareliveunlessyouredefinethem\.Donotrelyonbindingsordatafromthepreviousprogramremainingavailableinlaterprompts\.Ifyouneedtokeepanyusefulcontext,carryitforwardnowbyemittingfreshsummariesorliteralvalues,orbyrepeatingtoolcallsthatrecovertherelevantfiles,definitions,tests,orotherevidence\.Reminder:EmitSpellcodeonly,notprose\.Errormessage:<errormessage\>

##### Trailing\-expression recovery prompt

ThepreviousSpellprogramthrewanerrorinitstrailingexpression\.Pleaserecoverbycontinuingthesame‘\(eval\(do\.\.\.\)\)‘blockwithanewtrailingexpressionthatfulfillstheoriginalintentwhileavoidingtheerror\.Earlierexpressionsinthissamedoblockwillbereevaluatedfirst\.Theirbindingsremainavailableifreevaluationstillsucceeds\.Thepreviousfailingtrailingexpressionisnowinertearliercontextbecauseyournewexpressionwillbeappendedafterit\.Usetheexistinglivebindingswhenhelpful,butdonotrepeatthefailingtrailingexpressionunchanged\.Reminder:EmitSpellcodeonly,notprose\.Errormessage:<errormessage\>

##### Missing tool\-call retry prompt

For tool\-call\-based API providers, this prompt is injected by thecall\-llmfunction when retrying an API call after a previous attempt failed to produce a tool call\.

;;system:retryingbecausethepreviousresponsedidnotcalltherequiredspell\_suffixtool\.

;;Respondwithexactlyonespell\_suffixtoolcall\.Donotsendassistanttext,markdown,orathinking\-onlyresponse\.

;;TherawSpellsuffixmustbeinthespell\_suffixtoolinput\.

#### B\.12\.1Comparison With Codex CLI and Claude Code

The amount of prompting inSpellis comparable to contemporary coding\-agent harnesses, although it is allocated differently\. TheSpellsystem prompt variant contains about 3\.0k words, and the coding prompt adds another 1\.0k words when loaded\. The open\-source Codex CLI prompt surface is of the same order: model\-specific prompt files contain 1\.1\-1\.2k words, while the combined prompt with apply\-patch instructions contains about 3\.9k words before adding tool schemas\. Claude Code does not publish a single stable system\-prompt word count, but its documented prompt surface is similarly modular: project memory files, scoped rules, permission modes, plan mode, subagents, tool permissions, skills, and subagent memory are injected according to session state\. The central comparison is therefore not whetherSpellis unusually verbose, but where the prompt budget is spent\.

All three prescribe a workflow, either in the system prompt or through mode\-specific instructions\.Spell’s coding reminder prescribes a research, plan, implement, verify, iterate loop: inspect the real environment, persist only evidence needed across turns, patch from that evidence, and treat failed verification as feedback\. Codex CLI, similarly, provides guidance to plan nontrivial work, validate progressively, and continue until the task is resolved; Claude Code’s plan\-mode prompt provides guidance for codebase exploration before implementation\.

All three also provide rules for coding\-agent best practices\. They overlap on core practices such as searching with fast repository tools, grounding changes in observed files and test output, keeping edits scoped, and reporting validation honestly\. A typical shared rule is that the agent must not overwrite or revert unrelated user work\. The non\-shared rules reflect each harness’s distinctive surface:Spellwarns about suffix\-only output, the single trailing expression, recovery prompts, and unproductive\!extendloops; Codex CLI emphasizes patch discipline, sandbox and approval behavior, and final\-answer formatting; Claude Code emphasizes permission modes,CLAUDE\.mdloading, subagent permissions, memory, and plan\-mode boundaries\.

The most important difference is thatSpellprompts explain the semantics ofSpelland teach how it should and should not be used\. A large fraction of theSpellbase prompt describes suffix completion, the completion wrapper, quines, double evaluation, namespace availability, self\-calls, effect calls, materialized tool results, context pruning, and recovery\. Codex CLI and Claude Code also devote substantial prompt and tool\-schema budget to tool use, but those descriptions mostly tell the model how to call externally implemented tools and obey harness policy\.Spellhas to teach a programming interface: the model’s output is executable control flow that the runtime will parse and evaluate\. This additional semantic instruction is the price of making orchestration part of the generated program rather than a fixed behavior of the surrounding harness\.

### B\.13Transport

#### B\.13\.1Transport variants

Spell’s provider transport has a single semantic contract: each LM call receives a prefix of aSpellprogram, and the visible model output must be exactly the suffix that completes it\. The runtime parses and evaluates the completion\. However, LM APIs may not expose full control over the prefix that is passed to an LM; instead, they often impose mandatory fencing to distinguish system prompts, user messages, model responses, and tool calls\.Spellsupports three transport mechanisms\.

First, prefill transport sends theSpellprefix as assistant prefill or as a completion\-style prompt prefix\. The model is instructed via a user message \("ContinuethisSpellprogram\."\) to continue the prefix directly\. In this transport mechanism, the program appears as one contiguous program in the context window of the LM, and this is the first\-choice transport mode for providers that support it\. The Fireworks completion provider supports prefill transport; in this paper, it was used for diagnostic open\-weight transport comparisons rather than for the main open\-weight benchmark results\.

Second, message transport sends theSpellprefix as the user message and asks for a raw assistant\-message suffix\. This transport is an option for APIs or model settings that do not support assistant prefill\. Because some models echo all or part of the prefix when asked to continue text from a user message,Spellstrips an exact or whitespace\-normalized prefix echo before appending the response\. This differs from prefill transport because the prefix and suffix are separated within the model’s context window by fencing mandated by the API\.

Third, tool\-call transport sends theSpellprefix as ordinary user/input content but requires the response to arrive in a custom tool call namedspell\_suffix\. Similar to message transport, the prefix and suffix are separated in the model’s context window by fencing\. This paper uses tool\-call transport for Anthropic, OpenAI, and Fireworks tool\-call runs\. Preliminary analyses did not show a difference between message and tool\-call transport\.

### B\.14Agent and provider configuration

Spellseparates LM configuration into two declarative EDN files\. An\.agent\.ednfile describes theSpell\-facing execution environment: which system prompt is used, which namespaces are visible, which subagents can be called throughllms/, and which run\-level defaults apply\. A\.provider\.ednfile describes the API\-facing provider: which implementation is used, where credentials come from, which model and endpoint are selected, and how usage should be priced\. The transport section describes how those providers map API calls ontoSpell’s prefix\-suffix contract\. The configuration layer records the chosen pairing between that transport and aSpelllanguage profile\.

#### B\.14\.1Agent files

Agent files live underconfig/agents/and are loaded byspell\.agent/load\-agent\-spec\. The three base agents correspond to the three transport\-specific system prompts:

;;config/agents/base\-tc\.agent\.edn

\{:namebase\-tc

:doc"Baseagentfortool\-callproviders"

:system\{:file"\.\./prompts/sysprompt\-toolcall\.txt"\}

:llms\[\]\}

The base files intentionally expose no effect namespaces\. Specialized agents inherit from a base and add capabilities\. For example,io\-tc\.agent\.ednpairs the tool\-call system prompt with filesystem and shell access \(io/\), inter\-agent communication \(agents/\), and shared\-global state \(globals/\):

\{:basebase\-tc\.agent\.edn

:nameio\-tc

:doc"Tool\-calltransportbenchmarkagentwithionamespace\(webdisabledbydefault\)"

:llms\{exploreexplore\.agent\.edn\}

:namespaces

\{iostdlib/io

agentsstdlib/agents

globalsstdlib/globals\}\}

Inheritance is file\-based\.:baseis resolved relative to the child file, then the child definition is merged onto the parent\. Scalar fields such as:name,:doc,:system,:model,:budget,:provider,:thinking,:reasoning\-effort,:verbosity,:recover,:format, and retry or grammar settings are replaced by the child when present\.:namespacesis merged as a map, so a child can add or override individual namespace entries\. The:llmsfield is scalar for this purpose: a child either replaces it, omits it and gets auto\-discovery when no inherited value is present, or sets\[\]to opt out\.

Namespace values are resolved byspell\.agent/resolve\-namespace\-value\. The common case is astdlib/\.\.\.symbol, but the loader also supports several extension forms:

\{:namespaces

\{iostdlib/io;stdlibnamespacemap

io\-lite\[stdlib/io\-readstdlib/io\-write\];mergenamespacemaps

helperlocal\_helpers\.clj/helper\-ns;Clojurevarfromafile

prompt\{:file"extra\_prompt\.txt"\};filecontentsasastring

tools\{:file"tools\.clj"

:items\{foofoo\-ns

barbar\-ns\}\}

workerworker\.agent\.edn\}\};compiledsubagentfunction

After inheritance,compile\-agent\-specresolves the system prompt, provider, namespaces, andllms/namespace, validates declared pattern dependencies, and passes the resulting data tospell\.llm/compile\-agent\. The compiled value is a spawn function\. It can be invoked as the root agent or passed intoagents/spawn/agents/\!spawn\-askfor subagent execution\.

The:llmsfield is a convenience mechanism for named subagents\. A map such as\{:llms\{exploreexplore\.agent\.edn\}\}exposes\(llms/exploreprompt\)inside the agent\. Each entry compiles a normal agent spec, inheriting the parent provider and model unless the subagent spec overrides them\. The loader uses lazy atoms while building this namespace, so mutually referential agent profiles can be compiled without ordering problems\.

#### B\.14\.2Capability boundaries

SPE is not inherently more dangerous than other ways of letting a language model act through tools: a self\-programmed agent can only exercise the capabilities exposed by its runtime\. However, if dangerous capabilities are exposed, SPE can exacerbate misuse or accident risk because the user delegates more of the agent’s internal state management to the model\-written program\. In particular, the model can decide what to retain, prune, summarize, or reveal in its own context window\. This flexibility is useful for long\-running work, but it also means that the user has less direct control over the exact internal context from which later actions are chosen\.

Spell’s primary capability boundary is the namespace surface\. Pure core namespaces are always available, while effect namespaces are opt\-in through the agent configuration\. For example, a read\-only exploration agent can be givenio\-read/withoutio\-write/orio\-exec/, while an editing agent can be given write helpers only when mutation is intended\. This design makes capabilities inspectable in the same place as model, provider, prompt, and subagent configuration\.

#### B\.14\.3Provider files

Provider files live underconfig/providers/and are loaded byspell\.provider/load\-provider\. They instantiate an implementation of theLLMProviderprotocol\. A typical OpenAI tool\-call provider is:

\{:type:openai

:api\-key\-env"OPENAI\_API\_KEY"

:model"gpt\-5\.4"

:use\-responses\-apitrue

:force\-tool\-calltrue

:max\-tokens32768

:default\-agent"\.\./agents/base\-tc\.agent\.edn"\}

The provider:typeselects the implementation\. The supported file\-backed types in the v0\.1\.0 reference implementation are:anthropic\-pf,:anthropic\-tc,:openai,:codex\-tc,:fireworks,:ollama, and:test\(described below\)\. These types imply the low\-level API adaptation described in Section[B\.13](https://arxiv.org/html/2605.06898#A2.SS13); the remaining keys are implementation options\. Common options include:api\-key\-env,:base\-url,:model,:max\-tokens,:costs,:cache\-read\-ratio,:prompt\-cache\-key, and:request\-timeout\-sec\. Provider\-specific options include:use\-responses\-apiand:force\-tool\-callfor OpenAI Responses custom\-tool mode,:auth\-fileand:account\-idfor the ChatGPT\-backed Codex provider,:chat\-templateand:convert\-think?for Fireworks\-style completion models, and:responses,:response,:response\-rules, and:prefill?for the test provider\.

The:default\-agentkey is metadata rather than a constructor argument\. It is read byprovider\-edn\-default\-agentso wrappers aroundspell\.api/runcan choose a base agent whose system prompt matches the provider transport\. The provider instance itself only knows how to call the model; the agent file supplies theSpellprompt and namespace surface\.

Provider specs can also appear inline inside an agent file:

\{:basebase\-tc\.agent\.edn

:namelocal\-openai\-toolcall

:provider\{:file"\.\./providers/openai\-tc\.provider\.edn"\}

:namespaces\{iostdlib/io\}\}

or, equivalently, as a complete inline map with the same keys as a\.provider\.ednfile\. Programmatic callers may also pass an already\-created provider object tospell\.api/run; in that case the object is injected into the loaded agent spec before compilation\.

#### B\.14\.4Testing and interactive providers

The:testprovider is the provider used bySpell’s unit tests and by lightweight smoke runs\. It implements the sameLLMProviderprotocol as real model transports, but replaces the external LM call with a deterministic response lookup\. A test provider can be constructed directly:

\(provider/test\-provider

\{:responses\{"\(do""42\)"\}

:response\-rules\[\{:includes\["classify""sentiment"\]

:excludes\["retry"\]

:response"\{:label:positive\}\)"\}\]

:prefill?false\}\)

or through a provider config:

\{:type:test

:response\-rules\[\{:includes\["hello"\]

:response"\\"world\\""\}\]

:prefill?true\}

Lookup proceeds in a fixed order\. First,:responsesis checked for an exact prompt\-string match\. Second, a programmatic:response\-fn, when the provider is constructed directly, may return a response for prompts that contain nondeterministic fragments such as generated handles; a static:responseuses this same slot as a catch\-all response\. Third, if no exact or function response is available,:response\-rulesare scanned in order\. Each rule requires all:includesstrings to be present and all optional:excludesstrings to be absent\. Entries may be strings or maps of the form\{:responsestring:latencyms\}, allowing tests to exercise asynchronous behavior without calling a network service\. If no entry matches, the provider throws an exception containing the full prompt, which makes exact prompt fixtures easy to create by copying the observed prompt into:responses\.

The:prefill?flag controls whethersupports\-prefillreturns true\. This is important becauseSpellhas distinct prompt assembly paths for prefill, message\-style, and tool\-call providers\. The test provider therefore tests the compiled agent and prompt\-as\-prefix machinery, not only pure evaluator functions\. In practice, most tests use helpers such asmake\-test\-agent, which compile a normal agent spec with a test provider injected, then run through\!llm\-selfjust as a real model\-backed agent would\.

Spellalso has an interactive user transport for debugging and demonstrations\. Selecting modeluserin the CLI constructsspell\.provider/user\-providerinstead of a network provider\. This provider prints the system prompt, user message, and any prefill prefix to stderr, then reads the completion from stdin\. The human can therefore write theSpellcontinuation manually, exactly where the LM response would have appeared\. This is distinct from the runtime:useragent handle\. In interactive terminal sessions,spell\.api/runcan also register:useras an agent, so model agents may send messages to the human withagents/sendoragents/\!ask\. The former path makes the human stand in for the LM; the latter makes the human one participant in the agent runtime\.

#### B\.14\.5Resolution at runtime

spell\.api/runis the public runtime entry point\. It requires an:agentpath and exactly one of:promptor:init\. The provider can come from the agent file or from the call itself:

\(spell\.api/run

\{:agent"config/agents/io\-tc\.agent\.edn"

:provider\(spell\.provider/openai\-provider\{:model"gpt\-5\.4"\}\)

:prompt"Findandfixthefailingtest\."\}\)

If a provider is supplied tospell\.api/run, it overrides the provider from the agent file before compilation\. The CLI defaults toconfig/agents/cli\.agent\.ednand constructs a provider from its\-m/\-\-modelflags\. Benchmark wrappers can infer an agent from a provider file’s:default\-agent\. The runtime core itself has no hidden global provider or agent default\.

This split is the main design point of the config layer\. Agent files answer “what kind ofSpellagent is this?” Provider files answer “how does this agent call an LM?” Testing and interactive modes preserve that split by swapping only the provider side of the boundary\.

### B\.15Spellruntime

This section describes the implementation of theSpellruntime\.

#### B\.15\.1spell\-eval

spell\-evalis the core evaluator\. It takes an expression and an environment and returns a result map:

\{:okvalue:envenv’\}

\{:errmsg:envenv:exprexpr:trace\[\.\.\.\]\}

spell\-evalis common across agents\.

#### B\.15\.2eval

Each agent gets its ownevalbuiltin\. This builtin provides the outer, effectful evaluation step used by the completion wrapper\. It merges pure builtins with the effectful namespace set that the agent is allowed to access\. In this way, all agents share the same core evaluator but differ in the effect surface exposed at the trailing\-expression boundary\.

#### B\.15\.3boxand the inside function

boxis the execution wrapper that processes completions\. It is invoked by any self\-call\. It takes three inputs: an agent handle, a raw completion source \(often afuture, but also a promise or raw string\), and an inside function that acts on the completion\. In particular, the inside function might do one of three things:

- •evaluate the completion \(the “awake function”\);
- •await some signal, then evaluate the completion \(the “asleep function”\);
- •evaluate a root completion, deliver its result, and install a sleeping continuation for the next wakeup cycle \(the “root function”\)\.

This allows aSpellprogram to request a wakeup signal, sleep until that signal is delivered, install a new program which might depend on the signal or some other global state \(for example, an incoming message\), and resume execution of the modified program\. The waiting step is dictated by the model’s program, but the actual waiting occurs withinboxinstead ofspell\-eval\. If the waiting were to occur withinspell\-eval, then it would necessitate some mechanism to replace the program currently being evaluated with a new, global state\-dependent program mid\-evaluation\.

#### B\.15\.4call\-llm

call\-llmmakes the actual LM API call\. It is responsible for model routing, token usage tracking, system prompt injection, API error handling and retry logic, prompt caching configuration, and more generally any logic that is API\-provider specific\. It is configured by aprovider\.ednfile\.

#### B\.15\.5End\-to\-end flow

A typical root execution looks like this:

1. 1\.For each agent, create anevalfunction and install it within agent\-specific inside functions\.
2. 2\.Construct an initial program from a user prompt\.
3. 3\.For the root inside function of the main agent, run\(box:maininit\-programroot\-inside\-fn\)\.
4. 4\.All subsequent execution occurs inside of this function call; for example, the initial program usually makes a self\-call, which triggers the creation of a newbox\.

## Appendix CBenchmarking methods and results

### C\.1Shared evaluation configuration

#### C\.1\.1Compared Agents

##### Spellagent\.

TheSpellagent was configured with the tool\-call transport agent profileconfig/agents/io\-tc\.agent\.edn\. This profile exposesio/for filesystem and shell operations,agents/for background agents and inter\-agent communication, andglobals/for shared state\. The benchmark profile intentionally omits theweb/namespace, soSpelldid not have live web access in the retained benchmark analyses\. The core namespacesstrings/,math/,builtins/, andreminders/are always available as documented language functions and prompt reminders\.

Spellruns are initialized by running the followingSpellprogram:

```
(quine completion
  (eval (do
    (quine prompt "<benchmark prompt>")
    ’(!extend))))
```

The trailing expression'\(\!extend\)produces a self\-call with the initial program \(minus trailing parentheses\) as its prefix\. An exception is that in a secondary Terminal\-Bench analysis, a coding\-task prompt \(Appendix[B\.11\.2](https://arxiv.org/html/2605.06898#A2.SS11.SSS2.Px5)\) was given via the trailing expression'\(\!describe reminders :coding\)\. This expression likewise produces a self\-call, appending a prompt as a string literal to the prefix\.

##### Codex CLI\.

Codex CLI v0\.120\.0 was invoked as a baseline harness with the same underlying OpenAI model where applicable\. The benchmark adapter passed the benchmark prompt directly tocodex exec, set the requested model, disabled live web search with\-c web\_search="disabled", emitted JSON logs with\-\-json, skipped repository checks with\-\-skip\-git\-repo\-check, used\-\-sandbox danger\-full\-access, and setmodel\_reasoning\_effortwhen the run specified low, medium, or high effort\. It received no additional benchmark\-specific or user\-specific prompt beyond the task prompt\.

##### Claude Code\.

Claude Code v2\.1\.107 was invoked with the benchmark prompt throughclaude \-p, stream JSON output, verbose mode, an explicit allowed\-tools list, the Opus 4\.6 model, medium effort, and a run budget\. It received no additional benchmark\-specific or user\-specific prompt beyond the task prompt\.

#### C\.1\.2Transport, Caching, and Effort

##### GPT\-5\.4\.

When configured with GPT\-5\.4, theSpellagent used tool\-call transport and the OpenAI Responses API\. The API request requires the model to emit aspell\_suffixtool call whose input is the rawSpellsuffix \(see Appendix[B\.11\.1](https://arxiv.org/html/2605.06898#A2.SS11.SSS1)\)\. OpenAI prompt caching used a stableprompt\_cache\_keyafter the first self\-call, so repeated calls from oneSpellrun shared a cache partition\.

##### Claude Opus 4\.6\.

When configured with Claude Opus 4\.6, theSpellagent used tool\-call transport and the Anthropic Messages API\. The API request likewise requires the model to emit aspell\_suffixtool use whose input contains the rawSpellsuffix\. Anthropic prompt caching used ephemeralcache\_controlmarkers on cacheable system text and on shared user\-prompt prefixes after the first self\-call\. Medium effort was used, equating to a 10,000\-token thinking budget\.

##### Open\-weight models\.

GLM\-5\.1, Kimi\-K2\.6, and Qwen3\.6 Plus were run through Fireworks serverless inference using tool\-call transport\. The API request requires the model to emit aspell\_suffixtool call whose input is the rawSpellsuffix, matching the transport pattern used for the closed\-source tool\-call runs\. GLM\-5\.1 and Qwen3\.6 Plus were run with each endpoint’s high\-effort thinking mode enabled:reasoning\_effort="high"for GLM\-5\.1, and the equivalent 32,000\-token thinking budget for Qwen3\.6 Plus\. Kimi\-K2\.6 was run without a reasoning parameter because the deployment used does not expose one\. We also tested a completion\-prefill transport, in which the model was prompted to continue the openSpellprogram directly \(Appendix[C\.4](https://arxiv.org/html/2605.06898#A3.SS4)\)\.

### C\.2Orchestration games

I created three “orchestration games” mocking plausibly\-useful orchestration patterns\. The three game prompts were:

Auction:

Returnasealed\-bidauctionreportforapainting\.Askthreeindependent\!llm\-selfchildbiddersforoneintegerbidfrom100to1000\.Thenreturnallbidsandthewinnerortie\.

Telephone:

Returnan8\-steptelephonerelayreport\.Startwith:"Themuseumclosesatfivebecausethewinterstormisapproaching\."Writeadeterministicgameloopoverrelaynumbers1through8\.Atiterationk,passthemessagefromrelayk\-1toafresh\!llm\-selfrelaykwithapromptaskingittorewritethemessagewhilepreservingmeaning\.Bindthereturnedwordingasthemessageforthenextiteration\.Whentheloopcompletes,returntheinitialwording,eachrelaywording,andthefinalwording\.Donotrewriterelaymessagesyourselfanddonotuseagents/\.

Twentyquestions:

Playan8\-turnyes/noguessinggame\.Thehostshouldchooseoneordinaryanimalasthehiddenanswerandkeepitoutofallworkerprompts\.Oneachturn,callafresh\!llm\-selfworkerwithonlythepublictranscriptandapromptaskingforoneyes/noquestionorGUESS:animal\.Thehostanswerstruthfullyfromthehiddenanswer,appendsWorker:andHost:linestothepublictranscript,andpassesonlythatpublictranscripttothenextfreshworker\.Stopafteracorrectguessorafter8workerturns\.Returnthehiddenanswerandthefulltranscript\.Donotuseagents/orleaf\-llm\.

Table 1:Prompt\-only orchestration\-game results with GPT\-5\.4 medium and tool\-call transport\. Completed reports counts instances in which theSpellagent returned a plausible report; the trace audit, performed by GPT\-5\.5, scored whether the generatedSpellprograms implemented the intended orchestration pattern and all players followed the game rules\.GameIntended orchestrationCompleted reportsTrace auditSealed\-bid auctionThree independent\!llm\-selfbidders8/88/8 passTelephone relayDeterministic relay loop over fresh\!llm\-selfcalls7/84/8 approximate passTwenty questionsHidden\-answer host with fresh public\-transcript workers7/87/8 passIn the auction task, the orchestrator made independent self\-calls for three bidders and collected them back before declaring a winner on a subsequent turn\. A representative fragment is:

’\(\!call\-nowbid1\(\!llm\-self\(wrap\-catbidder1\-prompt\)\)

bid2\(\!llm\-self\(wrap\-catbidder2\-prompt\)\)

bid3\(\!llm\-self\(wrap\-catbidder3\-prompt\)\)\)

On the telephone task, four of eight trials approximately used the intended deterministic loop pattern; one of these was weaker because it performed an initial setup relay outside the loop before looping over the remaining relays\. The relevant fragment of a representative clean example was:

\(defnrelay\-prompt\[kwording\]

\(str"Youaretelephonerelay"k"of8\.\\n"

"Rewritethemessagewhilepreservingmeaningexactly\.\\n"

"ReturnexactlyoneSpellstringliteralcontainingonlytherewrittenmessage\.\\n"

"Donotaddcommentary,markdown,oranyextratext\.\\n"

"Message:"wording\)\)

\(defnrelay\-program\[kwording\]

\(wrap\-cat

\(str"\(quineprompt"\(pr\-str\(relay\-promptkwording\)\)"\)"\)\)\)

’\(let\[relays

\(loop\[k1

currentinitial\-wording

acc\[\]\]

\(if\(\>k8\)

acc

\(let\[next\-wording\(\!llm\-self\(relay\-programkcurrent\)\)\]

\(recur\(\+k1\)

next\-wording

\(conjacc\{:relayk:wordingnext\-wording\}\)\)\)\)\)

final\-wording\(if\(empty?relays\)initial\-wording\(:wording\(lastrelays\)\)\)\]

\{:initialinitial\-wording

:relaysrelays

:finalfinal\-wording\}\)

The twenty\-questions task tested whether the model could separate private host state from public worker context\. The prompt did not name the hidden animal; successful programs chose an ordinary animal internally, sent only the public transcript to each fresh worker call, answered the worker truthfully as the host, and stopped after a correct guess or eight turns\. A representative beginning was:

\(defhidden\-answer"cat"\)

\(defpublic\-transcript""\)

\(defmax\-turns8\)

\(quineworker\-rules"Youareafreshworkerinan8\-turnyes/noanimalguessinggame\.Youknowonlythepublictranscriptprovidedasthenextexpression\.OutputexactlyoneSpellstringliteralandnothingelse\.Thestringcontentmustbeeitheroneyes/noquestionaboutthehiddenordinaryanimalorGUESS:animal\."\)

’\(\!call\-nowworker\-1\(\!llm\-self\(wrap\-catworker\-rulespublic\-transcript\)\)\)

### C\.3Benchmark protocols

Benchmark analyses were performed on GCP virtual machines\. TableLABEL:tab:compute\-resourcessummarizes the compute resources used by the retained benchmark analyses\. API spend is model\-provider API spend reconstructed from benchmark usage logs, not GCP VM cost\. Pilot runs, failed dispatches, and aborted diagnostic reruns are excluded\. Some rows share source artifacts: for example, Figure[4](https://arxiv.org/html/2605.06898#S4.F4.5)reuses the medium\-effort Terminal\-Bench and SWE\-bench Lite rows from the GPT\-5\.4 frontier analyses, and Figure[2](https://arxiv.org/html/2605.06898#S4.F2)includes GPT\-5\.4 subset values drawn from the corresponding full\-run artifacts\.

Table 2:Compute resources for retained benchmark analyses\. Wall\-clock is elapsed source\-run time when run metadata was retained; for sharded JSONL\-only runs it is the approximate span from the first to last completed item record\.AnalysisVMs and machine typeTasksConcurrency and capsWall\-clockAPI spendObservation 1 model diagnostics, Terminal\-Bench 1\.1Dedicated subset runs usede2\-standard\-4; the GPT\-5\.4 subset was drawn from a fulle2\-standard\-8run5 models×\\times32 tasks\-\-n\-concurrent2–4; 3600 s agent cap; 600 s verifier cap2\.5–5\.0 h for full GPT source; open\-weight reruns 5–6 h end\-to\-end$47\.34Observation 1 model diagnostics, SWE\-bench Litee2\-standard\-16, 300 GB boot disk; one or two VMs per model subset, with GPT\-5\.4 drawn from three full\-run shards5 models×\\times32 tasks\-\-parallel4 for most rows; Opus subset used 8; 1800–3600 s per item5–7 h for retained source runs$54\.33GPT\-5\.4 Terminal\-Bench frontier6 VMs,e2\-standard\-8\(Spelland Codex CLI at low/medium/high effort\)6×\\times80 tasks\-\-n\-concurrent4; 3600 s agent cap; 600 s verifier cap2\.5–5\.0 h$248\.95GPT\-5\.4 Terminal\-Bench coding interventionSame Terminal\-Bench harness and caps as the frontierSpellrows3×\\times80 tasks\-\-n\-concurrent4; coding reminder injected in the initialSpellprogramNot separately retained$84\.35Opus 4\.6 Terminal\-Bench comparison2 VMs,e2\-standard\-8\(Spelland Claude Code\)2×\\times80 tasks\-\-n\-concurrent4; 3600 s agent cap; 600 s verifier cap;$5Spell/Claude\-Code budget cap3\.5–5\.1 h$108\.94GPT\-5\.4 SWE\-bench Lite frontier18 VMs,e2\-standard\-16, 300 GB boot disk \(3 shards×\\times2 systems×\\times3 efforts\)6×\\times300 tasks\-\-parallel4;\-\-timeout3600;\-\-prewarm\-envs;\-\-paper\-compliant7\.3 h$883\.78LongBench v2 comparison4 VMs,e2\-standard\-4, 100 GB boot disk \(pilot/rest for each system\)2×\\times200 items\-\-parallel4;Spellused 1200 s timeout and$3budget cap; Codex used the general\-runner default timeout0\.9–1\.9 h per system$53\.41AppWorld dev comparison2 retained full\-run VMs,e2\-standard\-16, 200 GB boot disk2×\\times57 tasks\-\-n\-concurrent4; 3600 s agent cap; 600 s verifier cap;$5Spellbudget cap0\.8–2\.5 h per system$43\.97#### C\.3\.1Terminal\-Bench 1\.1

Terminal\-Bench 1\.1 is a set of containerized command\-line tasks with hidden tests over the final filesystem state or program behavior\. The GPT\-5\.4Spell\-Codex comparison used the full 80\-taskterminal\-bench\-core==0\.1\.1old\-core task set, one trial per task, Docker execution, disabled live web access, and the Terminal\-Bench verifier as the source of truth\. A subset of 32 items were selected for Figure[2](https://arxiv.org/html/2605.06898#S4.F2)\. The benchmark harness passed each task instruction as the task prompt\. No items were excluded due to errors\.

The same runtime settings were used across compared systems: four\-way within\-VM concurrency, a 3600 s per\-task agent wall\-clock cap, and a 600 s global verifier/test timeout\.Spelland Claude Code received a$5per\-task dollar cap from their respective harness adapters; Codex CLI was run with the same wall\-clock and verifier caps but without a dollar budget cap\. The comparison betweenSpelland Codex CLI remained fair because noSpellitem actually hit the$5cap \(Appendix[C\.6](https://arxiv.org/html/2605.06898#A3.SS6)\)\.

For the Observation 1 analysis, a subset of 32 items was selected and run using open\-weight models\. FatalSpellerrors were counted fromSpellagent outcomes rather than from the benchmark’s correctness flag\. A task was counted as fatal when the finalSpellexecution ended in an unrecovered reader/evaluator/runtime failure, such asrecovery\-exhausted\. Benchmark\-level failures that did not represent a finalSpell\-language failure, such as a hidden\-test timeout, remained in the denominator for accuracy but were not counted as fatalSpellerrors\.

#### C\.3\.2SWE\-bench Lite

SWE\-bench Lite contains 300 real GitHub issues from 11 Python repositories\. Each task gives the agent an issue description and a repository checkout, and the agent must produce a patch that passes the SWE\-bench tests\. The Figure[3](https://arxiv.org/html/2605.06898#S4.F3)comparison used the publicprinceton\-nlp/SWE\-bench\_Litesplit, official per\-instance SWE\-bench images, the repository path/testbed, and one trial per item\. A subset of 32 items was selected for Figure[2](https://arxiv.org/html/2605.06898#S4.F2)\. The SWE\-bench Lite prompt includes the issue description and omitshints\_text\. It instructs the agent to make minimal non\-test changes, explore the repository, reproduce the error when practical, edit source code, rerun focused validation, consider edge cases, and run selected tests\. Live web access and task\-specific solution lookup were disabled by policy for all compared agents\. Items were scored by applying patches against the official SWE\-bench evaluation harness\. No items were excluded due to errors\.

The same runtime settings were used across compared systems:\-\-parallel 4,\-\-timeout 3600per item,\-\-paper\-compliant, and\-\-prewarm\-envs\.Spellreceived the SWE\-bench adapter’s$10per\-item dollar cap; Codex CLI received no dollar budget cap\. The comparison betweenSpelland Codex CLI remained fair because no item from either method actually hit$10\(Appendix[C\.6](https://arxiv.org/html/2605.06898#A3.SS6)\)\. The 32\-item model\-diagnostic runs used a per\-item timeout of 1800 s\.

#### C\.3\.3LongBench v2

LongBench v2 is a long\-context multiple\-choice benchmark spanning single\-document question answering, multi\-document question answering, long\-dialogue history understanding, long in\-context learning, and structured\-data understanding\. The retained Figure[4](https://arxiv.org/html/2605.06898#S4.F4.5)comparison used 200 LongBench v2 items\. Each document was supplied on disk for tool\-using agents, and the prompt included the document path, document size, question, answer options, and instruction to return only the option letter\.

Runs used four\-way parallelism and GPT\-5\.4 medium effort for bothSpelland Codex CLI, but the runtime and budget caps were not identical\. The retainedSpellrerun used\-\-timeout 1200and a$3per\-itemSpellbudget cap\. The retained Codex CLI rerun used the general\-benchmark default 300 s per\-item subprocess timeout and no dollar budget cap\. The comparison betweenSpelland Codex CLI remained fair because noSpellitem hit the$3cap and no Codex item exceeded either$3cost or 300 s latency \(Appendix[C\.6](https://arxiv.org/html/2605.06898#A3.SS6)\)\.

#### C\.3\.4AppWorld Dev

AppWorld evaluates computer\-use tasks over a simulated application state\. The retained Figure[4](https://arxiv.org/html/2605.06898#S4.F4.5)comparison used the full 57\-itemappworld\-devsplit through the Terminal\-Bench adapter, GPT\-5\.4 medium effort, and four\-way within\-VM parallelism\. TheSpellcondition usedopenai\-tc:gpt\-5\.4,config/agents/io\-tc\.agent\.edn, and the plain trailing'\(\!extend\), matching the LongBench, Terminal\-Bench, and SWE\-benchSpellrows\. The Codex condition used Codex CLI withgpt\-5\.4\.

The AppWorld devSpelland Codex CLI runs used the same Terminal\-Bench\-adapter caps:\-\-n\-concurrent 4,\-\-agent\-timeout\-sec 3600, and\-\-test\-timeout\-sec 600\.Spellreceived the Terminal\-Bench adapter’s default$5per\-task dollar cap; Codex CLI used the same wall\-clock and verifier caps but no dollar budget cap\. The comparison betweenSpelland Codex CLI remained fair with respect to the budget cap because noSpellitem hit the$5cap and no Codex item exceeded$5\(Appendix[C\.6](https://arxiv.org/html/2605.06898#A3.SS6)\)\.

#### C\.3\.5Pairwise Statistics

For paired comparisons, the analysis counts decisive items only: an item where method A solved and method B did not, or vice versa\. Reported pairwisepvalues use an exact two\-sided binomial test with null probabilityp=0\.5over decisive items\.

### C\.4Comparison between models

Table 3:Tool\-call transport accuracy and fatalSpell\-error rates on 32\-item model\-comparison subsets\.ModelTB resolvedTB fatal errorsSWE resolvedSWE fatal errorsGPT\-5\.416/32 \(50\.0%\)0/32 \(0\.0%\)23/32 \(71\.9%\)0/32 \(0\.0%\)Opus 4\.615/32 \(46\.9%\)2/32 \(6\.2%\)15/32 \(46\.9%\)5/32 \(15\.6%\)GLM\-5\.110/32 \(31\.2%\)10/32 \(31\.2%\)14/32 \(43\.8%\)5/32 \(15\.6%\)Kimi\-K2\.612/32 \(37\.5%\)9/32 \(28\.1%\)14/32 \(43\.8%\)5/32 \(15\.6%\)Qwen3\.6 Plus4/32 \(12\.5%\)9/32 \(28\.1%\)0/32 \(0\.0%\)21/32 \(65\.6%\)Inspection of representative fatal tool\-call traces showed that some remaining open\-weight failures were still invalidSpellcontinuations of the kind warned against in the system prompt\. For example, in Kimi\-K2\.6 on SWE\-bench Liteastropy\_\_astropy\-7746, the model attempted to useio/read\-linesdirectly in the evaluated program rather than inside a quoted trailing\!call\-nowform, producing the unrecovered errorio/read\-lines: io/ is an effect namespace \- use it in the trailing expression via eval\. In Kimi\-K2\.6 ondjango\_\_django\-11422, the model emitted an unboundfinal\-diffsymbol\. In Qwen3\.6 Plus ondjango\_\_django\-10914, the model emittedprintln, which is not aSpellbinding, producingUnbound symbol: println\. Thus, tool\-call transport reduced wrapper and chat\-template leakage, but it did not eliminate invalidSpellprograms\.

#### C\.4\.1Prefill transport mode

Because the open\-weight Fireworks endpoints support assistant prefill, I tested a prefill transport mode in which the model prefix is literally the prefix of theSpellprogram\. This transport is natural forSpellbecause the model’s visible completion is the program to be evaluated, but for the closed\-source models it was unavailable because the corresponding APIs impose mandatory fencing that separates user inputs from model responses and tool calls\. The prefill transport did not improve the open\-weight results and instead made them substantially worse \(TableLABEL:tab:obs1\-prefill\-diagnostic\-values\)\. The likely reason is that tool\-call transport imposes a useful output constraint: the model must place the rawSpellsuffix in thespell\_suffixargument, which separates executableSpellcode from ordinary assistant text, hidden\-reasoning markers, wrapper echoes, and chat\-template artifacts\. For example, a common error in prefill mode was that models would emit thinking tags \(“<think\>”\) inside theSpellprogram, causing reader failure\.

Table 4:Prefill\-transport diagnostic results for open\-weight models on the same 32\-item subsets\.ModelTB resolvedTB fatal errorsSWE resolvedSWE fatal errorsGLM\-5\.11/32 \(3\.1%\)16/32 \(50\.0%\)0/32 \(0\.0%\)17/32 \(53\.1%\)Kimi\-K2\.60/32 \(0\.0%\)17/32 \(53\.1%\)0/32 \(0\.0%\)12/32 \(37\.5%\)Qwen3\.6 Plus0/32 \(0\.0%\)10/32 \(31\.2%\)0/32 \(0\.0%\)14/32 \(43\.8%\)

### C\.5Harness comparison on coding benchmarks

#### C\.5\.1Terminal\-Bench 1\.1 Cost\-Accuracy Frontier

Two secondary analyses were performed\. First, a coding intervention that prescribes a research/plan/implement/verify workflow increased both accuracy and cost in the GPT\-5\.4 Terminal\-Bench runs \(TableLABEL:tab:tb\-gpt54\-source\-values\)\. Second, aSpellagent running Opus 4\.6 was compared against Claude Code running the same model, both with medium effort\. Both agents had the same accuracy and similar cost \(TableLABEL:tab:tb\-opus46\-source\-values\)\.

Table 5:Terminal\-Bench 1\.1 GPT\-5\.4 source values for Figure[3](https://arxiv.org/html/2605.06898#S4.F3)\. The defaultSpellconfiguration uses'\(\!extend\);:codingadditionally injects the coding\-task reminder from Appendix[B\.11\.2](https://arxiv.org/html/2605.06898#A2.SS11.SSS2.Px5)\.SystemConfigEffortResolvedCostTotal tokensSpellGPT\-5\.4defaultlow35/80 \(43\.8%\)$12\.2610\.11MSpellGPT\-5\.4defaultmedium36/80 \(45\.0%\)$25\.726\.96MSpellGPT\-5\.4defaulthigh40/80 \(50\.0%\)$40\.726\.52MSpellGPT\-5\.4:codinglow35/80 \(43\.8%\)$19\.1122\.51MSpellGPT\-5\.4:codingmedium40/80 \(50\.0%\)$27\.3610\.21MSpellGPT\-5\.4:codinghigh43/80 \(53\.8%\)$37\.886\.48MCodex CLI GPT\-5\.4defaultlow38/80 \(47\.5%\)$39\.4339\.93MCodex CLI GPT\-5\.4defaultmedium39/80 \(48\.8%\)$46\.9651\.14MCodex CLI GPT\-5\.4defaulthigh43/80 \(53\.8%\)$83\.8694\.65MTable 6:Terminal\-Bench 1\.1 Claude Opus 4\.6 source values\. Both rows use medium effort on the full 80\-task old\-core set\. The paired comparison had 6Spell\-only successes, 6 Claude\-Code\-only successes, 36 shared successes, 32 shared failures, and exact sign\-testp=1\.0p=1\.0\.SystemConfigEffortResolvedCostTotal tokensSpellOpus 4\.6defaultmedium42/80 \(52\.5%\)$54\.9813\.38MClaude Code Opus 4\.6defaultmedium42/80 \(52\.5%\)$53\.9633\.26M
#### C\.5\.2SWE\-bench Lite Cost\-Accuracy Frontier

Table 7:SWE\-bench Lite GPT\-5\.4 source values for Figure[3](https://arxiv.org/html/2605.06898#S4.F3)\. All rows use the default initialization; no:codingprompt\-intervention rows are included for SWE\-bench Lite\.SystemConfigEffortResolvedCostTotal tokensSpellGPT\-5\.4defaultlow153/300 \(51\.0%\)$68\.9474\.30MSpellGPT\-5\.4defaultmedium171/300 \(57\.0%\)$102\.1251\.37MSpellGPT\-5\.4defaulthigh171/300 \(57\.0%\)$177\.3849\.26MCodex CLI GPT\-5\.4defaultlow170/300 \(56\.7%\)$136\.79134\.09MCodex CLI GPT\-5\.4defaultmedium172/300 \(57\.3%\)$161\.82155\.96MCodex CLI GPT\-5\.4defaulthigh185/300 \(61\.7%\)$236\.73220\.67M
#### C\.5\.3Error and Scoring Anomalies

The following table lists benchmark items that failed with an error code across Terminal\-Bench and SWE\-bench Lite\. No results were excluded on the basis of these errors\.

Table 8:Error and scoring anomalies retained in the Figure[3](https://arxiv.org/html/2605.06898#S4.F3)denominators\.BenchmarkConditionAffected rowsNotesTerminal\-Bench 1\.1Spelldefault, all efforts5 test\-timeout rows per effortSame five items at every effort:build\-initramfs\-qemu,swe\-bench\-astropy\-1,swe\-bench\-astropy\-2,swe\-bench\-fsspec,swe\-bench\-langcodes\.Terminal\-Bench 1\.1Codex CLI, all efforts5 test\-timeout rows per effortSame five timeout items asSpell, soSpell\-Codex differences are unaffected\.Terminal\-Bench 1\.1Codex CLI, all efforts1 parse error and 1 unknown\-agent error per effortcron\-broken\-networkproduced a parse error;intrusion\-detectionproduced an unknown\-agent error\.Terminal\-Bench 1\.1Spelldefault1 low, 1 medium, and 3 high agent\-timeout rowsSpell\-specific agent timeouts occurred in addition to the shared test\-timeout items\.Terminal\-Bench 1\.1Spelldefault1 low agent\-installation failureThe low\-effort rowqemu\-startuphad an agent\-installation failure\.SWE\-bench LiteSpelllow3 depth\-exceeded and 2 recovery\-exhausted rowsThese areSpellruntime errors\.SWE\-bench LiteSpellmedium0 generation errorsAll rows generated patches\.SWE\-bench LiteSpellhigh2 install errors and 1 container errorThese were marked as failures\.SWE\-bench LiteCodex CLI low1 install errorThis was marked as a failure\.SWE\-bench LiteCodex CLI medium/high0 generation errorsAll rows generated patches\.Follow\-up diagnosed the five Terminal\-Bench rows that errored withtest\_timeoutin all conditions tested\. Rerunning the five items \(Codex low patches\) with\-\-test\-timeout\-sec 3600and\-\-agent\-timeout\-sec 3600still ended astest\_timeout\. However, a custom empty\-patch/no\-op probe ofswe\-bench\-langcodes, which installed the Codex runtime but did not produce a patch, completed the test phase under the original 600 s budget and failed normally; for this reason, these items were retained in all analyses\.

### C\.6Token and cost accounting

Token usage was tracked per API call and summed across calls, stratified by token type \(cached/uncached input, output, and output cache writes if priced differently\)\. Token totals in tables below count the sum of all token types\.

Table 9:Per\-million\-token prices used for benchmark cost reconstruction\.Model familyUncached input / 1MCached input / 1MCache write / 1MOutput / 1MGPT\-5\.4$2\.50$0\.25$3\.125$15\.00Opus 4\.6$5\.00$0\.50$6\.25$25\.00These numbers do not account for the GPT\-5\.4 long\-context pricing tier, which applies higher cost to API calls above a token threshold, and this was applied to bothSpelland Codex CLI\. Because Codex CLI has much larger average context length thanSpell\(see below\), omitting the higher long\-context tier is more likely to favor Codex\.

#### C\.6\.1Token utilization

Table 10:Medium\-effort GPT\-5\.4 token and cost breakdown for the two Figure[3](https://arxiv.org/html/2605.06898#S4.F3)coding benchmarks\. Token columns report per\-item averages rounded to the nearest 1k tokens\.BenchmarkSystemResolved / itemsCostUncachedk/itemCachedk/itemOutputk/itemTotalk/itemTerminal\-Bench 1\.1Spell36/80$25\.7220k49k17k87kTerminal\-Bench 1\.1Codex CLI39/80$46\.9648k586k5k639kSWE\-bench LiteSpell171/300$102\.1239k118k14k171kSWE\-bench LiteCodex CLI172/300$161\.8245k469k5k520k![Medium-effort token breakdown](https://arxiv.org/html/2605.06898v1/figures/appendix_med_token_breakdown.png)Figure 5:Mean cached input, uncached input, and output tokens per task in medium\-effort GPT\-5\.4 coding runs\.
#### C\.6\.2Budget\-Cap Sensitivity

Per\-item budget caps were applied toSpellbut not Codex CLI, motivating a sensitivity analysis\. In Terminal\-Bench 1\.1, where a$5per\-task cap was imposed forSpellbut not Codex, noSpellitem reached the cap at any effort\. In SWE\-bench Lite, noSpellitem reached the$10per\-item cap, and no Codex CLI item exceeded the same$10threshold\. In LongBench v2, noSpellitem reached the$3per\-item cap, and no Codex CLI item exceeded the same$3threshold\. In AppWorld dev, noSpellitem reached the$5per\-item cap, and no Codex CLI item exceeded the same$5threshold\.

Table 11:Observed cap hits in retained GPT\-5\.4Spell\-Codex runs\.BenchmarkMeasureLowMed\.HighMaxTerminal\-BenchSpellat$5cap0/800/800/80$2\.66Terminal\-BenchCodex\>\>$51/801/802/80$12\.66SWE\-benchSpellat$10cap0/3000/3000/300$2\.57SWE\-benchCodex\>\>$100/3000/3000/300$4\.16LongBenchSpellat$3cap–0/200–$0\.70LongBenchCodex\>\>$3–0/200–$0\.71AppWorldSpellat$5cap–0/57–$2\.40AppWorldCodex\>\>$5–0/57–$0\.55Repricing Terminal\-Bench 1\.1 GPT\-5\.4 Codex CLI rows whose observed cost exceeded$5as exactly$5has the following effect:

Table 12:Terminal\-Bench 1\.1 Codex CLI cost sensitivity under a$5per\-task cap\.EffortRows\>\>$5ObservedRepricedChangeReductionLow1$39\.43$37\.83$1\.604\.1%Medium1$46\.96$40\.67$6\.2913\.4%High2$83\.86$69\.22$14\.6417\.5%This adjustment was not made in Figure[3](https://arxiv.org/html/2605.06898#S4.F3)because noSpellitem actually hit the budget cap\. Making this adjustment would not change the ordering by cost of any pair of comparators\.

### C\.7Context management and pruning

Feature\-usage and pruning analyses are based on program traces emitted during eachSpellrun\. To estimate how many tokens were removed from context, the analysis replays the structural effect of pruning forms on the parsed program\. It traverses the AST and maintains a stack of ordered sibling forms under the current parent expression\. Apruneorrethinkform with argumentkkis interpreted as removing thekkimmediately preceding sibling forms; these are popped, and their source is recorded for tokenization\. Removed source text and full turn context are tokenized with the same tokenizer used for GPT\-5\.4 accounting\. For turntt,DtD\_\{t\}is the number of tokens removed by pruning events attributed to that turn,At=∑i<tDiA\_\{t\}=\\sum\_\{i<t\}D\_\{i\}is the cumulative amount already absent from the prompt at the start of the turn, andCtC\_\{t\}is the full token count for the turn, combining the prefix and response\. The summary table reports means over successfully parsed traced turns; malformed traces are excluded from these aggregates\.

Table 13:Per\-turn context\-pruning token accounting in GPT\-5\.4Spelltraces\.ConditionMean turnsMean pruned/ turnMean completionMean pruned/ completionMean cumulativeprunedSWE\-bench Lite low9\.750717,9122\.8%2,891SWE\-bench Lite medium9\.31,19811,88610\.1%12,124SWE\-bench Lite high9\.01,7309,89417\.5%13,282Terminal\-Bench 1\.1 low7\.936911,1133\.3%3,350Terminal\-Bench 1\.1 medium7\.18835,42816\.3%7,664Terminal\-Bench 1\.1 high5\.97484,35817\.2%3,342
### C\.8Feature utilization

Spellfeature utilization was measured from the same model\-response suffixes used in the context\-management analysis\. Prefixes were excluded; only newly emittedSpellforms were counted\. TableLABEL:tab:feature\-utilizationsummarizes the retained GPT\-5\.4Spelltraces used for the Terminal\-Bench 1\.1 and SWE\-bench Lite analyses, pooling low\-, medium\-, and high\-effort runs within each benchmark\. Context\-management and ordinary tool\-call features were common\. Self\-call primitives, auxiliary model calls, multi\-agent delegation, shared\-state access, web access, and ordinary higher\-order/control forms were rare or absent\.

Table 14:Feature utilization in retained GPT\-5\.4Spelltraces\. “Uses” counts emitted forms; “traces” counts traces with at least one use\. Function rows count direct calls\. Namespace rows, such asagents/\*, count any call to a function within that namespace\. Thewebnamespace was unavailable in the configuration\.FeatureSWE usesSWE tracesTerminal usesTerminal traces\!peek1,980452/89828889/227\!call\-now5,015872/898876207/227\!llm\-self00/89800/227leaf\-llm00/89893/227rethink425331/8989160/227persist5329/8982412/227let00/89862/227map3210/898193/227get73/89873/227apply00/89811/227if64/89888/227agents/\*00/89811/227globals/\*00/89800/227io/\*25,995898/8983,624222/227strings/\*30893/89815144/227web/\*00/89800/227In Terminal\-Bench, the onlyagents/hit was low\-effortpath\-tracing\-reverse, where the model attempted\(agents/\!ask :explore \.\.\.\)\. This did not spawn a helper; it asked an unregistered handle and failed withagents/\!ask: Handle not registered, after which the agent recovered\.

The most concentrated auxiliary\-model use was high\-effort Terminal\-Benchplay\-zork, which made sixleaf\-llmcalls\. The agent requested Zork I walkthroughs, treasure lists, and final\-ending text from a plain\-text model, then tried to validate the generated walkthrough throughdfrotz\. Oneleaf\-llmcall timed out, triggering recovery; the harness still scored the task unresolved, with both end\-of\-game and maximum\-score checks failed\.

To test whether non\-use of multi\-agent orchestration was merely a prompting artifact, a SWE\-bench Lite intervention was run on the 32\-item model\-comparison subset\. TheSpellagent used GPT\-5\.4 at medium reasoning effort, the ordinaryio\-tccoding setup, and the initial expression:

'\(\!describe io reminders :coding reminders :orchestrator\)

Feature usage was counted over completed traces, again restricted to model\-response forms\. The intervention solved 20/32 items, produced 3 error rows, cost$9\.48, and had 29 completed traces\. It did not elicit multi\-agent or recursive self\-call orchestration: the completed traces contained zero instances ofagents/spawnoragents/\!spawn\-ask\.

Control flow features likemapwere used sparsely, mostly in support of programmatic tool calling\. For example,heterogeneous\-datesusedlet,map,get, andapplyto parse two CSV files and compute an average; several other traces used simpleiforletforms around shell\-command success and structured status outputs\. These examples show that the model sometimes usedSpellas a dataflow language around tool results\.

Tool batching was much more common than higher\-order control flow\. I countedio/\*forms in each model\-response suffix, again excluding inherited prefix text\. On the GPT\-5\.4Spelltrace sets used for the paper, most turns emitted more than one tool call\.

Table 15:Per\-turnio/\*batching in GPT\-5\.4Spelltraces\.BenchmarkEffortTraced turnsMeanio/\*calls / turnTurns with≥2\\geq 2io/\*callsMaxio/\*calls / turnSWE\-bench Litelow3,1222\.852,406/3,122 \(77\.1%\)13SWE\-bench Litemedium2,7762\.892,166/2,776 \(78\.0%\)11SWE\-bench Litehigh2,7123\.342,176/2,712 \(80\.2%\)14Terminal\-Bench 1\.1low6052\.12343/605 \(56\.7%\)13Terminal\-Bench 1\.1medium5422\.46372/542 \(68\.6%\)13Terminal\-Bench 1\.1high4392\.26262/439 \(59\.7%\)11
### C\.9LongBench and AppWorld benchmarks

The source values for Figure[4](https://arxiv.org/html/2605.06898#S4.F4.5)compareSpelland Codex CLI across the four GPT\-5\.4 medium\-effort benchmark settings emphasized in the main text\. Accuracy is the fraction of tasks resolved under each benchmark’s native scoring rule\. Cost and token totals are summed over all evaluated tasks, including tasks that were not resolved\. The pairedppvalues use the exact two\-sided sign test described above\.

Table 16:Figure[4](https://arxiv.org/html/2605.06898#S4.F4.5)source table\.Total tokensis the sum of input plus visible and reasoning output tokens\.BenchmarkSystemResolvedCostppTotal tokensTerminal\-Bench 1\.1Spell36/80 \(45\.0%\)$25\.720\.5496\.96MTerminal\-Bench 1\.1Codex CLI39/80 \(48\.8%\)$46\.960\.54951\.14MSWE\-bench LiteSpell171/300 \(57\.0%\)$102\.121\.00051\.37MSWE\-bench LiteCodex CLI172/300 \(57\.3%\)$161\.821\.000155\.96MLongBench v2Spell122/200 \(61\.0%\)$27\.830\.04110\.36MLongBench v2Codex CLI135/200 \(67\.5%\)$25\.580\.04135\.19MAppWorld devSpell24/57 \(42\.1%\)$32\.990\.00210\.73MAppWorld devCodex CLI36/57 \(63\.2%\)$10\.980\.00216\.43M##### AppWorld trace analysis\.

Investigation into causes of the AppWorld gap revealed an unclear mix of behaviors and issues\. Among the 13 Codex\-only items, approximately 8–9 involvedSpell\-side query formulation, data\-source selection, action\-selection, or final\-submission mistakes\. One item was a clearSpell\-side refusal to perform a simulated Venmo transfer \(37a8675\_1\)\. Three items appeared to be task\-completion or evaluator mismatches afterSpellhad performed the requested action, and one was anunknown\_agent\_error\. In rows missed by both agents, the most common shared pattern was temporal anchoring to the real run date \(May 1, 2026\) rather than the 2022–2023 fixture dates used by the AppWorld state, producing false\-zero or no\-op answers\.
Self-Programmed Execution for Language-Model Agents

Similar Articles

AgentSPEX: An Agent SPecification and EXecution Language

PACE: Two-Timescale Self-Evolution for Small Language Model Agents

Rethinking Experience Utilization in Self-Evolving Language Model Agents

Self-Compacting Language Model Agents

I'd like to share an updated methodology for building agents.[P]

Submit Feedback

Similar Articles

AgentSPEX: An Agent SPecification and EXecution Language
PACE: Two-Timescale Self-Evolution for Small Language Model Agents
Rethinking Experience Utilization in Self-Evolving Language Model Agents
Self-Compacting Language Model Agents
I'd like to share an updated methodology for building agents.[P]