Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages
Summary
This paper evaluates six frontier coding agents on esoteric programming languages and finds that stronger agents use metaprogramming—writing Python programs to generate and debug code in the unfamiliar target language. Forbidding this strategy causes large performance drops, while providing Python helper code improves weaker agents.
View Cached Full Text
Cached at: 06/10/26, 06:17 AM
# Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages
Source: [https://arxiv.org/html/2606.10933](https://arxiv.org/html/2606.10933)
![[Uncaptioned image]](https://arxiv.org/html/2606.10933v1/lossfunk.png)
Aman Sharma Lossfunk aman\.sharma@lossfunk\.com &Sushrut Thorat Lossfunk sushrut\.thorat@lossfunk\.com &Paras Chopra33footnotemark:3 Lossfunk paras@lossfunk\.com
###### Abstract
LLM\-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories\. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar\. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden\-test grading\. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE\-Bench Verified and Terminal\-Bench 2\.0 compress into much narrower bands\. We observe that the strongest agents, Claude Opus 4\.6 and GPT\-5\.4 xhigh, often avoid writing the target language directly\. On Brainfuck and Befunge\-98, they write Python programs that generate target\-language code and debug those generators locally\. Forbidding this metaprogramming strategy causes large performance drops\. Text guidance distilled from this strategy does not materially improve weaker agents\. In contrast, Opus\-derived Python helper code for building generators, with no solved benchmark programs or hidden\-test answers, sharply improves Sonnet 4\.6 and GPT\-5\.4 mini on the same problems, while Haiku 4\.5 remains low\. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them\. Together, these results show thatstrong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language\. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language’s rules\.
## 1Introduction
Coding is one of the central applications of large language model \(LLM\) agents\. Most prominent benchmarks for coding agents evaluate them in familiar software ecosystems: mainstream languages, common libraries, and public open\-source repositories\. SWE\-Bench Verified\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib10)\)is a canonical example, testing agents on real GitHub issues from widely used Python projects\. These benchmarks measure important progress on realistic software\-engineering tasks, but they also test settings where frontier models have extensive prior exposure to the relevant syntax, APIs, libraries, coding patterns, and repository structure\. A complementary question is how the same agents behave when the programming language itself is unfamiliar: when the agent must figure out how to write, run, debug, and revise code in a language whose syntax and execution rules are not already familiar\. This setting has received comparatively less attention in agentic coding evaluation, despite its practical relevance\. Production systems often require models to work with internal domain\-specific languages, proprietary configuration files, generated APIs, and local tool conventions that are absent from public corpora or differ from standard programming environments\. In such settings, success depends less on recalling familiar code patterns and more on building a working understanding of the target interface during the session\.
Figure 1:Task substrate and agentic runtime\.\(a\) The same simple input\-and\-print task in Python, Brainfuck, and Befunge\-98 shows how different esolang code looks from ordinary code\. \(b\) Each model runs in a coding harness \(Claude Code, Codex, or OpenCode\) with file editing, shell access, benchmark commands, and a persistent workspace for local execution and hidden\-test submission\.To study how contemporary LLM\-based coding agents behave when the programming language itself is unfamiliar, we use languages from EsoLang\-Bench\(Sharma and Chopra,[2026](https://arxiv.org/html/2606.10933#bib.bib48)\)\. These esoteric languages are not realistic production targets; they are controlled proxies for unfamiliar executable interfaces\. For example, Brainfuck is a minimal pointer\-machine language, and Befunge\-98 introduces two\-dimensional control flow over a stack\-based grid \(Figure[1](https://arxiv.org/html/2606.10933#S1.F1)a\)\. This makes them useful for testing whether agents can learn an unfamiliar language well enough during a session to write, run, debug, and improve working programs\.
We therefore build an agentic evaluation pipeline around EsoLang\-Bench and use it to compare a capability ladder of six contemporary LLM\-based coding agents under a common tool\-use protocol: Claude Opus 4\.6, Sonnet 4\.6, and Haiku 4\.5; GPT\-5\.4 xhigh and GPT\-5\.4 mini \(xhigh and medium reasoning effort, respectively\); and Kimi K2\.5 \(Figure[1](https://arxiv.org/html/2606.10933#S1.F1)\)\. Each agent works in a persistent workspace where it can edit files, run code locally, and submit final answers to hidden tests\. The evaluation therefore tests an interactive problem\-solving process, not a single code completion\. We analyze final pass rates together with agent logs and targeted interventions, allowing us to ask which agents succeed and how they adapt\.
##### Central observations:
1. 1\.Unfamiliar\-language evaluation separates agents that look similar on mainstream coding benchmarks\.Under our EsoLang\-Bench protocol, where the target language must be worked out within the session, these agents are separated over a much wider range than on mainstream coding benchmarks such as SWE\-Bench Verified and Terminal\-Bench 2\.0, exposing capability differences that those benchmarks compress into narrower bands \(Table[2](https://arxiv.org/html/2606.10933#S3.T2)\)\.
2. 2\.The strongest agents use metaprogramming\.They write Python generators that emit target\-esolang programs, reuse helpers across problems, and test locally before submission\. This emerges without language\-specific prompting\. Forbidding metaprogramming on Brainfuck and Befunge\-98 drops performance by tens of percentage points\.
3. 3\.Strategy transfer works through executable scaffolds, but not with distilled written strategies\.A textual description of the strategy does not close the gap, but providing the implementation of that strategy as a reference library of working Python generators substantially improves Sonnet 4\.6 and GPT\-5\.4 mini\. Haiku 4\.5 remains low, showing that some agents still cannot compose the provided machinery into working solutions\.
4. 4\.Extra inference\-time resources help only when agents can use them\.More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that resources amplify useful strategies rather than create them\.
## 2Experimental setup
##### Task substrate\.
We evaluate on EsoLang\-Bench\(Sharma and Chopra,[2026](https://arxiv.org/html/2606.10933#bib.bib48)\), using the original 80\-problem sequences for Brainfuck, Befunge\-98, Whitespace, and Shakespeare\. EsoLang\-Bench includes a fifth language, Unlambda, which we exclude because its interpreter made local execution substantially slower in our agentic setting; including it would make wall\-clock runtime depend heavily on interpreter latency rather than adaptation behavior\. The problems themselves are short standard programming tasks \(echoing input, sorting integers, GCD/LCM, and similar list\- and number\-manipulation tasks across the four difficulty tiers; the full per\-tier task list is in Appendix[A\.2](https://arxiv.org/html/2606.10933#A1.SS2)\)\. The original EsoLang\-Bench evaluation reports near\-ceiling performance on these same problem statements when models answer in Python or JavaScript, so here the difficulty is primarily expressing, implementing, and debugging solutions in an unfamiliar target language\.
##### Agentic protocol\.
We use the same four\-language task substrate as EsoLang\-Bench, but evaluate each model as an interactive coding agent rather than a one\-pass generator\. Each model×\\timeslanguage run is one sequential session over all8080problems for that language\. Problems are fetched in fixed forward order\. For each problem, the agent receives the statement, edits files in an isolated workspace, runs candidates locally, and may make up to three hidden submissions\. A problem is finalized when one submission passes all six hidden tests or when the three submissions are exhausted; finalized problems are not revisited\.
Local interpreter calls expose ordinary execution feedback such as stdout, stderr, and runtime errors\. Hidden submissions return only the aggregate number of private tests passed, not the private inputs, expected outputs, or per\-test diagnostics\. Figure[2](https://arxiv.org/html/2606.10933#S2.F2)summarizes the state machine\. The primary protocol uses8080problems per language, six hidden tests per problem, up to three hidden submissions, unlimited local interpreter calls, a3232k\-token output budget per assistant turn, and isolated workspaces; all parameters are summarized in Appendix Table[4](https://arxiv.org/html/2606.10933#A1.T4)\.
Figure 2:Per\-problem state machine under the primary protocol\.Each model–language run is a fixed forward session over8080problems\. For each problem, the agent fetches the specification, edits and executes candidate programs locally, and makes up to three hidden submissions\. Hidden submissions return only aggregate hidden\-test feedback; finalized problems are not revisited\.
##### Models and harnesses\.
We evaluate deployed coding agents rather than bare models\. Claude Opus 4\.6, Claude Sonnet 4\.6, and Claude Haiku 4\.5 run under Claude Code\(Anthropic,[2026](https://arxiv.org/html/2606.10933#bib.bib57)\); GPT\-5\.4 xhigh and GPT\-5\.4 mini run under Codex\(OpenAI,[2026a](https://arxiv.org/html/2606.10933#bib.bib58),[b](https://arxiv.org/html/2606.10933#bib.bib59)\); and Kimi K2\.5 runs under OpenCode\(Moonshot AI,[2026](https://arxiv.org/html/2606.10933#bib.bib60)\)\. This model×\\timesharness pairing is part of what we evaluate, because tool mediation, file editing, shell access, and workspace management are part of deployed coding\-agent systems\. Per\-agent API endpoints, model identifiers, sampling settings, and harness invocations are documented in Appendix[A](https://arxiv.org/html/2606.10933#A1)\.
Every agent receives the same benchmark\-facing operations and the same per\-language system prompt \(a simple task prompt for benchmarking, with no problem\-specific guidance, no solved examples, and no hidden\-test material\); the system prompt and the per\-condition deviations from this primary prompt are in Appendix[A\.12](https://arxiv.org/html/2606.10933#A1.SS12)\. As a cross\-harness check, we re\-ran Opus 4\.6 and GPT\-5\.4 xhigh under OpenCode on Brainfuck and Befunge\-98; we observed similar performance and the qualitative ordering is unchanged \(Appendix[B\.4](https://arxiv.org/html/2606.10933#A2.SS4)\)\.
##### Logging and behavioral measurements\.
For each run, we log problem fetches, shell commands, local interpreter calls, hidden submissions, file edits, generated files, command outputs, and final workspace state\. We use*metaprogramming*to mean that the agent writes a program in a familiar host language, such as Python, JavaScript, or Rust, whose output is source code in the target esolang\. This differs from direct authoring, where the agent edits the target esolang source itself\. A helper file is reusable if it persists in the workspace and is called, imported, copied, or modified across multiple problems in the same session\. These labels describe behavior only; scoring depends solely on hidden\-test success\.
##### Scoring and reporting\.
A problem is counted as solved if and only if one of the agent’s submissions passes all six private hidden tests for that problem; a submission that passes only a subset of the hidden tests counts as a failure for that submission, with no partial credit, and we do not aggregate hidden\-test passes across the up\-to\-three submissions allowed per problem\. Primary scores are solved problems out of8080for each model×\\timeslanguage run\.
For each model×\\timeslanguage cell, we run three independent sessions under the same task order and protocol\. We report the first session’s solved count as the headline value in Table[1](https://arxiv.org/html/2606.10933#S2.T1)and Table[2](https://arxiv.org/html/2606.10933#S3.T2), with Wilson95%95\\%binomial confidence intervals\(Wilson,[1927](https://arxiv.org/html/2606.10933#bib.bib63)\)computed over its 80 per\-problem outcomes \(k/80k/80per\-language andk/320k/320pooled\)\. The remaining two sessions serve as session\-to\-session sanity checks; per\-session counts for all three sessions are tabulated in Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7)\. Full reporting and confidence\-interval details are in Appendices[A\.6](https://arxiv.org/html/2606.10933#A1.SS6),[A\.7](https://arxiv.org/html/2606.10933#A1.SS7), and[B\.6](https://arxiv.org/html/2606.10933#A2.SS6)\.
##### Diagnostic protocol variants\.
All headline results use the primary protocol above\. We use protocol variants only for targeted diagnostic experiments\. In Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3), we restrict metaprogramming by requiring agents to author Brainfuck and Befunge\-98 directly, without using a host\-language generator\. In Section[3\.4](https://arxiv.org/html/2606.10933#S3.SS4), we test strategy transfer by giving weaker agents either text guidance distilled from Claude Opus 4\.6’s traces or a small reference library of working generator code, with no solved benchmark problems or hidden\-test answers included\. In Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5), we vary local interpreter\-call budgets and output\-token budgets while holding the task substrate fixed\.
These variants are used to explain the performance gaps observed under the primary protocol, not to define the main score\. Unless explicitly stated otherwise, model prompts, task order, hidden tests, submission limits, and scoring rules remain unchanged from the primary protocol\.
Table 1:Main EsoLang\-Bench results under the primary protocol\.EsoLang cells report Session 1 percentage solved \(the headline session; two further sessions per cell are reported in Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7)\) with Wilson 95% binomial confidence intervals as subscripts\. TheMeancolumn averages the four esolang scores\.BF=Brainfuck, B98=Befunge\-98, WS=Whitespace, Sh=Shakespeare\. Subscripts denote Wilson 95% intervals:−x\-xfor ceiling cells,\+x\+xfor near\-floor cells, and±x\\pm xfor interior cells\. Raw counts are in Appendix[B\.2](https://arxiv.org/html/2606.10933#A2.SS2), per\-session counts in Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7), and full asymmetric intervals in Appendix[B\.6](https://arxiv.org/html/2606.10933#A2.SS6)\.
## 3Results
### 3\.1Unfamiliar\-language evaluation sharply separates contemporary coding agents
We first ask how contemporary agentic models fare under unfamiliar\-language evaluation\. Table[1](https://arxiv.org/html/2606.10933#S2.T1)reports reports per\-language EsoLang\-Bench scores for the six evaluated agents under the primary protocol, with Wilson95%95\\%binomial confidence intervals \(Section[2](https://arxiv.org/html/2606.10933#S2), Appendix[B\.6](https://arxiv.org/html/2606.10933#A2.SS6)\)\. We observe a large performance spread between the agents\. The separation is not driven by a single uniformly hard language\. Per\-language ranges are5\.05\.0–98\.898\.8on Brainfuck,5\.05\.0–100100on Befunge\-98,31\.331\.3–100100on Whitespace, and2\.52\.5–100100on Shakespeare\. Whitespace is near\-ceiling for several agents, whereas Brainfuck and Befunge\-98 expose large separations\. What these two languages share is that they are low\-ecosystem targets whose syntax and idioms are far from mainstream software work; the spread is therefore informative about within\-session adaptation to an unfamiliar executable interface\.
Comparing the same six agents across mainstream coding benchmarks reveals a much smaller spread\. Table[2](https://arxiv.org/html/2606.10933#S3.T2)reports the spread and standard deviation across three mainstream coding benchmarks \(SWE\-Bench Verified, Terminal\-Bench 2\.0, and LiveCodeBench v6\) together with the EsoLang\-Bench mean; EsoLang\-Bench produces both the widest spread and the largest SD of the four\. Figure[6](https://arxiv.org/html/2606.10933#A2.F6)in the appendix visualizes the SWE\-V versus EsoLang\-Bench cell as a scatter plot\. This result shows that unfamiliar\-language evaluation exposes capability differences that these mainstream benchmarks compress\.
Table 2:Mainstream coding scores cluster while unfamiliar\-language scores separate\.Six contemporary coding agents on three mainstream coding benchmarks and the EsoLang\-Bench four\-language mean\. Spread is best minus worst across the six agents; SD is the sample standard deviation \(n=6n\{=\}6,n−1n\{\-\}1denominator\)\. The EsoLang SD \(36\.036\.0\) is roughly12×12\\timesSWE\-Bench Verified’s \(2\.92\.9\) and2×2\\timesLiveCodeBench v6’s \(17\.217\.2\)\.†\\daggervendor\-reported;∗\\astVals\.ai\. Per\-agent sourcing in Appendix Table[6](https://arxiv.org/html/2606.10933#A2.T6); scatter view in Figure[6](https://arxiv.org/html/2606.10933#A2.F6)\.
### 3\.2Strong agents discover metaprogramming strategies
The performance spread in Section[3\.1](https://arxiv.org/html/2606.10933#S3.SS1)raises a behavioral question: what do high\-performing agents do differently during the session? Inspecting the logged trajectories shows a consistent pattern on the low\-level languages, especially Brainfuck and Befunge\-98\. The strongest agents often avoid direct target\-language authoring\. Instead, they write generators in a familiar host language that emit source code in the target esolang, then run the generated programs against the local interpreter before submission\. We call this metaprogramming\. This subsection describes the behavior we observe in the logs; the next subsection tests whether removing it changes performance\.
This strategy is not requested by the primary prompt\. The system prompts are fixed across primary runs and reproduced in Appendix[A\.12](https://arxiv.org/html/2606.10933#A1.SS12); the harness accepts either direct target\-language files or generated target\-language files\. The strategy therefore emerges from the agent’s interaction with the task substrate, rather than from language\-specific prompting\.
A representative within\-session switch occurs in Brainfuck E04\. Opus 4\.6 first submits a hand\-written Brainfuck program of18841884bytes, which fails the hidden tests\. After the failure, it writes a Python generator; the generated Brainfuck source is2450024500bytes and passes all six hidden tests\. This illustrates why the strategy helps: the host program can name and reuse structure that is implicit and fragile in raw Brainfuck, such as cell allocation, pointer position, sign flags, decimal\-digit layout, BCD arithmetic, and conditional macros\. Sample programs and trajectory excerpts are provided in Appendices[F](https://arxiv.org/html/2606.10933#A6)and[C\.5](https://arxiv.org/html/2606.10933#A3.SS5)\.
### 3\.3Metaprogramming is causally important on Brainfuck and Befunge\-98
To test whether metaprogramming merely correlates with success or supports it, we run a no\-metaprogramming variant for the two strongest agents\. In this variant, agents must author the target esolang directly and may not use a host\-language program to generate target source\. All other aspects of the protocol are held fixed\.
Figure[3](https://arxiv.org/html/2606.10933#S3.F3)shows that removing metaprogramming causes large drops on Brainfuck and Befunge\-98 for both Opus 4\.6 and GPT\-5\.4 xhigh\. These are the two languages where direct authoring requires the agent to maintain long, low\-level program state across edits\. In direct Brainfuck authoring, for example, the agent must track cell offsets, pointer position, flags, and numeric encodings implicitly while editing raw symbols\. Small local changes can invalidate this bookkeeping, and local smoke tests often miss hidden edge cases\. A host\-language generator externalizes this state into named variables and reusable functions, so the same cell allocation, arithmetic, and branching patterns can be emitted consistently across problems\.
By contrast, Whitespace and Shakespeare are less diagnostic for this intervention because successful solutions are often short enough or structured enough to author directly\. The result therefore supports a narrower claim: metaprogramming is a major mechanism for high performance on the low\-level esolangs where direct target\-language editing becomes fragile\. Code excerpts and trajectory examples are provided in Appendices[C\.5](https://arxiv.org/html/2606.10933#A3.SS5)and[C\.6](https://arxiv.org/html/2606.10933#A3.SS6)\.
Figure 3:Forcing direct authoring sharply reduces performance on Brainfuck and Befunge\-98\.Solved problems out of 80 for Opus 4\.6 and GPT\-5\.4 xhigh with metaprogramming allowed versus forced direct authoring\. The largest drops occur on the low\-level languages where target programs are long and fragile\.##### The benefit is host\-language generation, not Python specifically\.
On Brainfuck, swapping the generator host language preserves most of the gain: Opus 4\.6 solves6464/8080with Python,6363/8080with JavaScript, and5555/8080with Rust, while GPT\-5\.4 xhigh solves7979/8080,7777/8080, and7979/8080, respectively\. Direct authoring remains low for both agents \(2727/8080and2929/8080\)\. Thus the critical ingredient is access to a familiar general\-purpose host language for constructing target programs, not Python itself; the corresponding generator code in each host language is shown in Appendix[C\.7](https://arxiv.org/html/2606.10933#A3.SS7)\.
### 3\.4Strategy transfer works through executable scaffolds, but not with distilled written strategies
We next ask whether lower\-performing agents fail because they lack the high\-level idea or because they cannot construct the machinery needed to execute it\. We use the strongest agents’ traces \(primarily Claude Opus 4\.6, plus a single generic Brainfuck builder pattern from a successful GPT\-5\.4 xhigh session\) to create two forms of strategy transfer for three lower\-performing agents\. In the*text*condition, we add a system\-prompt preamble summarizing the strategy: use a generator, build reusable primitives, verify locally, and regenerate components rather than hand\-patching target code\. In the*library*condition, we additionally provide a small strategy\-only host\-language helper library distilled from those traces, containing generic code\-generation primitives \(cell allocator, BCD\-arithmetic helpers, decimal\-print primitives, and a local Befunge\-98 simulator\) and a notes document; the exact files shipped are listed in Appendix[G\.2](https://arxiv.org/html/2606.10933#A7.SS2)\. No per\-problem generators, no solved benchmark programs, no hidden\-test inputs, no expected outputs, and no ground\-truth answers are included\.
Table[3](https://arxiv.org/html/2606.10933#S3.T3)reports solved problems out of 80 for Brainfuck and Befunge\-98\. Written advice alone produces little improvement\. The reference library, in contrast, substantially improves Sonnet 4\.6 and GPT\-5\.4 mini, while Haiku 4\.5 remains near the floor\. This pattern suggests that the mid\-tier agents do not mainly lack the high\-level idea\. They struggle to build the reusable code needed to carry it out\. With that code provided, Sonnet 4\.6 and GPT\-5\.4 mini improve sharply; Haiku 4\.5 remains low\.
Table 3:Strategy transfer works through executable scaffolds, but not with distilled written strategies\.Results are problems solved out of 80; Base is the primary protocol\.*\+Text*adds written strategy guidance distilled from Opus 4\.6’s trajectory;*\+Lib*also provides a small strategy\-only host\-language helper library distilled from the strong\-agent traces \(primarily Opus 4\.6, plus one generic GPT\-5\.4 xhigh Brainfuck builder pattern\), with no per\-problem generators, no solved programs, no hidden\-test inputs, no expected outputs, and no ground\-truth answers\.Figure 4:More interpreter calls help only agents that can use feedback\.Problems solved out of 80 on Brainfuck and Befunge\-98 under local\-interpreter\-call budgets of 3, 5, 15, 30, and unlimited\. Opus improves with budget; Haiku remains near the floor; Sonnet improves on Befunge\-98 but not Brainfuck\.
### 3\.5Additional inference\-time resources help only when agents can use them
##### Interpreter\-call budget\.
We cap local interpreter calls per problem at33,55,1515,3030, or unlimited, holding the task substrate, hidden submissions, and scoring rule fixed\. Figure[4](https://arxiv.org/html/2606.10933#S3.F4)shows the result on Brainfuck and Befunge\-98\. Additional interpreter access helps agents that already convert local feedback into progress: Opus 4\.6 improves on both languages, and Sonnet 4\.6 improves on Befunge\-98\. Agents that are near the floor at the smallest budget remain near the floor even when given many more local runs\. Thus, tool access is not a uniform substitute for strategy construction; it amplifies agents that can use the feedback\.
##### Output\-token use\.
We also ask whether the gap is explained simply by stronger models spending more output tokens\. For the first 20 Brainfuck and Befunge\-98 problems, we log cumulative API output tokens for the three Claude agents, including extended\-thinking tokens for Opus and Sonnet\. Figure[5](https://arxiv.org/html/2606.10933#S4.F5)plots cumulative solves against cumulative output tokens\. Opus 4\.6 solves more problems with fewer tokens than Sonnet 4\.6 on Brainfuck and reaches2020/2020on Befunge\-98 with roughly half Sonnet’s token use\. The difference is therefore not just that Opus spends more; it finds a reusable strategy earlier, after which additional problems become cheaper to solve\. More output budget does not substitute for finding the strategy\.
## 4Limitations and validity checks
##### Closed models and training exposure\.
The strongest agents we evaluate are closed\-source, so we cannot inspect their training data, post\-training environments, or exact exposure to esolang examples\. We therefore do not claim formal distributional novelty\. Our claim is operational: these are low\-ecosystem programming targets relative to mainstream languages, and they induce large differences in how deployed coding agents adapt during a session\. Appendix[D](https://arxiv.org/html/2606.10933#A4)reports public\-code prevalence andnn\-gram overlap analyses showing orders\-of\-magnitude less public\-code presence for these esolangs than for mainstream languages\. Hidden tests, isolated workspaces, and fixed forward sessions reduce shallow memorization but do not prove zero exposure\.
Figure 5:Output\-token use does not explain the gap\.Cumulative solves versus cumulative API output tokens on the first 20 Brainfuck and Befunge\-98 problems for Claude agents\. Opus reaches 20/20 on both languages with fewer tokens than Sonnet; Haiku saturates early\.
##### Artificial but controlled proxies for testing adaptation\.
The esolangs are artificial, but that is what makes them useful here\. They are public, fully specified, runnable, automatically graded languages with unusual syntax, execution rules, and debugging surfaces\. This gives us a controlled proxy for a practical pressure that is otherwise hard to study publicly: whether an agent can build, test, and revise a working interface model when familiar language and library priors are weak\. We therefore treat success on these tasks as evidence about adaptation to low\-ecosystem executable interfaces, not as evidence that esolangs themselves are important production targets\.
##### Harnesses and mechanism scope\.
We evaluate deployed coding agents rather than bare language models, because tool use, file editing, shell access, and workspace management are part of the systems users actually run\. The tradeoff is that Claude Code, Codex, and OpenCode are not identical internally\. Our protocol fixes the benchmark\-facing interface: every agent receives the same problem sequence, local interpreters, hidden\-test rule, and scoring criterion\. Selected OpenCode re\-runs preserve the qualitative ordering \(Appendix[B\.4](https://arxiv.org/html/2606.10933#A2.SS4)\), making the separation unlikely to be wrapper\-only\. The metaprogramming claim is specific to Brainfuck and Befunge\-98 where direct authoring is fragile, not Whitespace or Shakespeare\.
## 5Related work
##### Code and agentic coding benchmarks\.
Execution\-based grading is central to modern code evaluation, from HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib1)\), MBPP\(Austinet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib2)\), APPS\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib3)\), DS\-1000\(Laiet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib4)\), and MultiPL\-E\(Cassanoet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib5)\)to LiveCodeBench\(Jainet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib6)\), BigCodeBench\(Zhuoet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib7)\), OJBench\(Wanget al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib9)\), SciCode\(Tianet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib8)\), MHPP\(Daiet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib53)\), and SWE\-bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib10)\)\. Agentic benchmarks extend this to multi\-step work in repositories, terminals, desktops, browsers, and research environments\(Yanget al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib15); Xiaet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib16); Zhanget al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib17); Merrillet al\.,[2026](https://arxiv.org/html/2606.10933#bib.bib11); Xieet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib14); Drouinet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib20); Kohet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib21); Heet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib22); Staraceet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib12); Chanet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib13)\)\. These settings are realistic but mix many factors: repository navigation, dependency management, long\-horizon planning, environment quirks, and familiarity with public software ecosystems\. We inherit executable grading, but change the controlled variable: using EsoLang\-Bench\(Sharma and Chopra,[2026](https://arxiv.org/html/2606.10933#bib.bib48)\), we keep the tasks simple and vary the familiarity of the executable interface itself\. ARC\-AGI\-3\(ARC Prize Foundation,[2026](https://arxiv.org/html/2606.10933#bib.bib50)\)is close in spirit, but stresses rule inference in unfamiliar games; we stress unfamiliar programming interfaces with explicit task specifications\.
##### Tools, feedback, and language transfer\.
Prior work shows that tools, feedback, and intermediate computation can improve LLM performance\(Yaoet al\.,[2023b](https://arxiv.org/html/2606.10933#bib.bib23); Schicket al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib24); Shinnet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib25); Madaanet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib26); Yaoet al\.,[2023a](https://arxiv.org/html/2606.10933#bib.bib27); Gaoet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib28); Chenet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib29),[2024](https://arxiv.org/html/2606.10933#bib.bib30); Nyeet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib31)\), and benchmarks such as TAU\-bench\(Yaoet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib18)\)and BFCL\(Patilet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib19)\)evaluate interface following and tool use\. Our question is which agents can turn local execution and workspace persistence into reliable program construction when the target language is unfamiliar\. The cross\-host experiment also connects to multilingual code generation and translation\(Cassanoet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib5); Paulet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib39); Twistet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib40); Roziereet al\.,[2020](https://arxiv.org/html/2606.10933#bib.bib41); Ahmadet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib42); Wanget al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib43); Friedet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib44); Nijkampet al\.,[2022](https://arxiv.org/html/2606.10933#bib.bib45); Liet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib46); Roziereet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib47)\)\. Our strongest agents often do not simply generate the target language directly; they write generators in a familiar host language and treat the target language as generated output\.
##### Benchmark validity and long\-tail coding\.
Benchmark\-design and contamination work\(Ribeiroet al\.,[2020](https://arxiv.org/html/2606.10933#bib.bib32); Nieet al\.,[2020](https://arxiv.org/html/2606.10933#bib.bib33); Yeet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib34); Bowman and Dahl,[2021](https://arxiv.org/html/2606.10933#bib.bib35); Orenet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib36); Denget al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib37); Xuet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib38)\)warns that high scores can mask weaknesses and that training overlap is hard to rule out for proprietary models\. We therefore avoid formal OOD claims and report public\-code frequency andnn\-gram overlap analyses\. The motivation also connects to long\-tail production coding, where agents face internal DSLs, generated APIs, proprietary configuration formats, and platform\-specific frameworks that are sparse in public training data\. Industry evaluations such as Convex Evals\(Hunt,[2025](https://arxiv.org/html/2606.10933#bib.bib51)\)and Fullstack\-Bench\(Jayakar,[2025](https://arxiv.org/html/2606.10933#bib.bib52)\)document failures on such platform\-specific invariants\. Our esolang setting makes an analogous pressure public, runnable, inspectable, and automatically graded\.
## 6Conclusion
Unfamiliar programming languages make a normally hidden agent capability visible\. When the target language is low\-ecosystem, success requires more than writing code in a familiar software environment\. The agent has to work out how the language behaves, test candidate programs, revise failures, and decide when a solution is ready\. Because the problems are short standard tasks and every agent receives the same benchmark\-facing interface, the separation reflects how well agents turn tools, feedback, and workspace state into a working strategy for the target language\.
For the strongest agents, the clearest strategy is metaprogramming: host\-language generators, reusable primitives, and local verification loops\. The no\-metaprogramming ablation shows this machinery is causally important on Brainfuck and Befunge\-98, where direct authoring is long and fragile\. Strategy transfer sharpens the point: written advice does little, while working helper code lifts mid\-tier agents and still fails on the weakest\. The key capability is not knowing that a strategy should help, but building and debugging machinery that works under the target language’s rules\.
Agentic generalization in this setting means reorganizing an unfamiliar problem into a form the agent can solve\. The strongest agents do not only retrieve familiar patterns; they create intermediate code, tests, and reusable structure that make the target language usable\. Real deployments often involve internal DSLs, generated APIs, proprietary configuration formats, and local tool conventions that are sparse in public code\. Making this capability reliable in smaller or open\-source agents should become a target for training, distillation, and model analysis\.
## References
- W\. U\. Ahmad, S\. Chakraborty, B\. Ray, and K\. Chang \(2021\)Unified pre\-training for program understanding and generation\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 2655–2668\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.211)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- Anthropic \(2026\)Claude Opus 4\.6, Claude Sonnet 4\.6, and Claude Haiku 4\.5: model overview\.Note:Anthropic model overview, system cards, and release announcementsOpus 4\.6 system card February 2026; Sonnet 4\.6 system card February 17, 2026\. Accessed 2026\-05\-04\.External Links:[Link](https://platform.claude.com/docs/en/about-claude/models/overview)Cited by:[Table 6](https://arxiv.org/html/2606.10933#A2.T6.2.2.4),[Table 6](https://arxiv.org/html/2606.10933#A2.T6.4.4.4),[Table 6](https://arxiv.org/html/2606.10933#A2.T6.6.6.4),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1),[§2](https://arxiv.org/html/2606.10933#S2.SS0.SSS0.Px3.p1.1)\.
- ARC Prize Foundation \(2026\)ARC\-AGI\-3: a new challenge for frontier agentic intelligence\.External Links:2603\.24621,[Link](https://arxiv.org/abs/2603.24621)Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton \(2021\)Program synthesis with large language models\.arXiv preprint arXiv:2108\.07732\.Cited by:[§B\.8](https://arxiv.org/html/2606.10933#A2.SS8.SSS0.Px4.p1.2),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- S\. R\. Bowman and G\. E\. Dahl \(2021\)What will it take to fix benchmarking in natural language understanding?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4843–4855\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.385)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px8.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- F\. Cassano, J\. Gouwar, D\. Nguyen, S\. Nguyen, L\. Phipps\-Costin, D\. Pinckney, M\. Yee, Y\. Zi, C\. J\. Anderson, M\. Q\. Feldman, A\. Guha, M\. Greenberg, and A\. Jangda \(2023\)MultiPL\-E: a scalable and polyglot approach to benchmarking neural code generation\.IEEE Transactions on Software Engineering49\(7\),pp\. 3675–3691\.External Links:[Document](https://dx.doi.org/10.1109/TSE.2023.3267446)Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- J\. S\. Chan, N\. Chowdhury, O\. Jaffe, J\. Aung, D\. Sherburn, E\. Mays, G\. Starace, K\. Liu, L\. Maksin, T\. Patwardhan, A\. Madry, and L\. Weng \(2025\)MLE\-bench: evaluating machine learning agents on machine learning engineering\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=6s5uXNWGIh)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§B\.8](https://arxiv.org/html/2606.10933#A2.SS8.SSS0.Px4.p1.2),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2023\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.Transactions on Machine Learning Research\.Note:arXiv:2211\.12588Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Chen, M\. Lin, N\. Schärli, and D\. Zhou \(2024\)Teaching large language models to self\-debug\.InInternational Conference on Learning Representations,Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Clark and D\. Chalmers \(1998\)The extended mind\.Analysis58\(1\),pp\. 7–19\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px10.p1.1)\.
- J\. Dai, J\. Lu, Y\. Feng, G\. Zeng, R\. Ruan, M\. Cheng, D\. Huang, H\. Tan, and Z\. Guo \(2024\)MHPP: exploring the capabilities and limitations of language models beyond basic code generation\.arXiv preprint arXiv:2405\.11430\.Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Deng, Y\. Zhao, X\. Tang, M\. Gerstein, and A\. Cohan \(2024\)Investigating data contamination in modern benchmarks for large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8706–8719\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.482)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px8.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, D\. Vazquez, N\. Chapados, and A\. Lacoste \(2024\)WorkArena: how capable are web agents at solving common knowledge work tasks?\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 11642–11662\.External Links:[Link](https://proceedings.mlr.press/v235/drouin24a.html)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Du, M\. Liu, K\. Wang, H\. Wang, J\. Liu, Y\. Chen, J\. Feng, C\. Sha, X\. Peng, and Y\. Lou \(2024\)Evaluating large language models in class\-level code generation\.InProceedings of the 46th IEEE/ACM International Conference on Software Engineering \(ICSE\),Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px6.p1.1)\.
- D\. Fried, A\. Aghajanyan, J\. Lin, S\. Wang, E\. Wallace, F\. Shi, R\. Zhong, W\. Yih, L\. Zettlemoyer, and M\. Lewis \(2023\)InCoder: a generative model for code infilling and synthesis\.InInternational Conference on Learning Representations,Note:arXiv:2204\.05999Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 10764–10799\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024\)WebVoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6864–6890\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.371)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, S\. Basart, S\. Kadavath, M\. Mazeika, A\. Arora, E\. Guo, C\. Burns, S\. Puranik, H\. He, D\. Song, and J\. Steinhardt \(2021\)Measuring coding challenge competence with APPS\.InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Hunt \(2025\)Convex Evals: behind the scenes of AI coding with Convex\.Note:Convex Stack BlogAccessed 2026\-04\-22External Links:[Link](https://stack.convex.dev/convex-evals)Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- E\. Hutchins \(1995\)Cognition in the wild\.MIT Press,Cambridge, MA\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px10.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2025\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=chfJJYC3iL)Cited by:[§B\.8](https://arxiv.org/html/2606.10933#A2.SS8.SSS0.Px3.p1.3),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Jayakar \(2025\)Introducing Fullstack\-Bench\.Note:Convex Stack BlogAccessed 2026\-04\-22External Links:[Link](https://stack.convex.dev/introducing-fullstack-bench)Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InInternational Conference on Learning Representations,Cited by:[§B\.8](https://arxiv.org/html/2606.10933#A2.SS8.SSS0.Px1.p1.1),[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.10933#S1.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)VisualWebArena: evaluating multimodal agents on realistic visual web tasks\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 881–905\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.50)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Lai, C\. Li, Y\. Wang, T\. Zhang, R\. Zhong, L\. Zettlemoyer, W\. Yih, D\. Fried, S\. Wang, and T\. Yu \(2023\)DS\-1000: a natural and reliable benchmark for data science code generation\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 18319–18345\.Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- R\. Li, L\. B\. Allal, Y\. Zi, N\. Muennighoff, D\. Kocetkov, C\. Mou, M\. Marone, C\. Akiki, J\. Li, J\. Chim, Q\. Liu, E\. Zheltonozhskii, T\. Y\. Zhuo, T\. Wang, O\. Dehaene, M\. Davaadorj, J\. Lamy\-Poirier, J\. Monteiro, O\. Shliazhko, N\. Gontier, N\. Meade, A\. R\. Zebaze, M\. Yee, L\. K\. Umapathi, J\. Zhu, B\. Lipkin, M\. Oblokulov, Z\. Wang,et al\.\(2023\)StarCoder: may the source be with you\!\.arXiv preprint arXiv:2305\.06161\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- Lossfunk \(2026\)Lossfunk/Esolang\-Bench: hugging face dataset\.Note:Hugging Face DatasetsCC BY 4\.0\. Accessed 2026\-05\-07\.External Links:[Link](https://huggingface.co/datasets/Lossfunk/Esolang-Bench)Cited by:[§A\.2](https://arxiv.org/html/2606.10933#A1.SS2.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-Refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan, J\. Shen, G\. Ye, H\. Lin, J\. Poulos, M\. Wang, M\. Nezhurina, J\. Jitsev, D\. Lu, O\. M\. Mastromichalakis, Z\. Xu, Z\. Chen, Y\. Liu, R\. Zhang, L\. L\. Chen, A\. Kashyap, J\. Uslu, J\. Li, J\. Wu, M\. Yan, S\. Bian,et al\.\(2026\)Terminal\-Bench: benchmarking agents on hard, realistic tasks in command line interfaces\.arXiv preprint arXiv:2601\.11868\.External Links:[Link](https://arxiv.org/abs/2601.11868)Cited by:[§B\.8](https://arxiv.org/html/2606.10933#A2.SS8.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- Moonshot AI \(2026\)Kimi K2\.5: open visual agentic model for real work\.Note:Moonshot AI model documentationReleased January 27, 2026\. Accessed 2026\-05\-04\.External Links:[Link](https://www.kimi.com/ai-models/kimi-k2-5)Cited by:[§B\.8](https://arxiv.org/html/2606.10933#A2.SS8.SSS0.Px3.p1.3),[Table 6](https://arxiv.org/html/2606.10933#A2.T6.12.12.4),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1),[§2](https://arxiv.org/html/2606.10933#S2.SS0.SSS0.Px3.p1.1)\.
- S\. R\. Motwani, D\. Nichols, C\. London, P\. Li, F\. Pizzati, A\. Blake, H\. Hammoud, T\. McDonald, A\. Naik, A\. Ivanova, V\. Baskaran, I\. Laptev, R\. Glatt, T\. Ben\-Nun, P\. Torr, N\. Jaques, A\. Prabhu, B\. Bartoldson, B\. Kailkhura, and C\. Schroeder de Witt \(2026\)LongCoT: benchmarking long\-horizon chain\-of\-thought reasoning\.arXiv preprint arXiv:2604\.14140\.External Links:[Link](https://arxiv.org/abs/2604.14140)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px7.p1.1)\.
- Y\. Nie, A\. Williams, E\. Dinan, M\. Bansal, J\. Weston, and D\. Kiela \(2020\)Adversarial NLI: a new benchmark for natural language understanding\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 4885–4901\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.441)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px8.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- E\. Nijkamp, B\. Pang, H\. Hayashi, L\. Tu, H\. Wang, Y\. Zhou, S\. Savarese, and C\. Xiong \(2022\)CodeGen: an open large language model for code with multi\-turn program synthesis\.arXiv preprint arXiv:2203\.13474\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- M\. Nye, A\. J\. Andreassen, G\. Gur\-Ari, H\. Michalewski, J\. Austin, D\. Bieber, D\. Dohan, A\. Lewkowycz, M\. Bosma, D\. Luan, C\. Sutton, and A\. Odena \(2021\)Show your work: scratchpads for intermediate computation with language models\.arXiv preprint arXiv:2112\.00114\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2026a\)GPT\-5\.4 thinking system card\.Note:OpenAI system card, March 2026Released March 5, 2026\. Accessed 2026\-05\-04\.External Links:[Link](https://openai.com/index/gpt-5-4-thinking-system-card/)Cited by:[Table 6](https://arxiv.org/html/2606.10933#A2.T6.8.8.4),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1),[§2](https://arxiv.org/html/2606.10933#S2.SS0.SSS0.Px3.p1.1)\.
- OpenAI \(2026b\)Introducing GPT\-5\.4 mini and nano\.Note:OpenAI release announcementAccessed 2026\-05\-04\.External Links:[Link](https://openai.com/index/introducing-gpt-5-4-mini-and-nano/)Cited by:[Table 6](https://arxiv.org/html/2606.10933#A2.T6.10.10.4),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1),[§2](https://arxiv.org/html/2606.10933#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Oren, N\. Meister, N\. S\. Chatterji, F\. Ladhak, and T\. B\. Hashimoto \(2024\)Proving test set contamination in black\-box language models\.InInternational Conference on Learning Representations,Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px8.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(BFCL\): from tool use to agentic evaluation of large language models\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 48371–48392\.External Links:[Link](https://proceedings.mlr.press/v267/patil25a.html)Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- I\. Paul, G\. Glavaš, and I\. Gurevych \(2024\)IRCoder: intermediate representations make language models robust multilingual code generators\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15023–15041\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.802)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh \(2020\)Beyond accuracy: behavioral testing of NLP models with CheckList\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 4902–4912\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.442)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px8.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- B\. Roziere, J\. Gehring, F\. Gloeckle, S\. Sootla, I\. Gat, X\. E\. Tan, Y\. Adi, J\. Liu, R\. Sauvestre, T\. Remez, J\. Rapin, A\. Kozhevnikov, I\. Evtimov, J\. Bitton, M\. Bhatt, C\. C\. Ferrer, A\. Grattafiori, W\. Xiong, A\. Defossez, J\. Copet, F\. Azhar, H\. Touvron, L\. Martin, N\. Usunier, T\. Scialom, and G\. Synnaeve \(2023\)Code Llama: open foundation models for code\.arXiv preprint arXiv:2308\.12950\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- B\. Roziere, M\. Lachaux, L\. Chanussot, and G\. Lample \(2020\)Unsupervised translation of programming languages\.InAdvances in Neural Information Processing Systems,Vol\.33\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessı, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Sharma and P\. Chopra \(2026\)EsoLang\-Bench: evaluating genuine reasoning in large language models via esoteric programming languages\.arXiv preprint arXiv:2603\.09678\.Cited by:[§A\.2](https://arxiv.org/html/2606.10933#A1.SS2.p1.1),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1),[§1](https://arxiv.org/html/2606.10933#S1.p2.1),[§2](https://arxiv.org/html/2606.10933#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- G\. Starace, O\. Jaffe, D\. Sherburn, J\. Aung, J\. S\. Chan, L\. Maksin, R\. Dias, E\. Mays, B\. Kinsella, W\. Thompson, J\. Heidecke, A\. Glaese, and T\. Patwardhan \(2025\)PaperBench: evaluating AI’s ability to replicate AI research\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 56843–56873\.External Links:[Link](https://proceedings.mlr.press/v267/starace25a.html)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Tian, L\. Gao, S\. D\. Zhang, X\. Chen, C\. Fan, X\. Guo, R\. Haas, P\. Ji, K\. Krongchon, Y\. Li, S\. Liu, D\. Luo, Y\. Ma, H\. Tong, K\. Trinh, C\. Tian, Z\. Wang, B\. Wu, Y\. Xiong, S\. Yin, M\. Zhu, K\. Lieret, Y\. Lu, G\. Liu, Y\. Du, T\. Tao, O\. Press, J\. Callan, E\. Huerta, and H\. Peng \(2024\)SciCode: a research coding benchmark curated by scientists\.arXiv preprint arXiv:2407\.13168\.Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Twist, J\. M\. Zhang, M\. Harman, D\. Syme, J\. Noppen, H\. Yannakoudakis, and D\. Nauck \(2025\)A study of llms’ preferences for libraries and programming languages\.arXiv preprint arXiv:2503\.17181\.Note:To appear in Findings of ACL 2026Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- Vals AI \(2026a\)LiveCodeBench v6: vals\.ai public leaderboard\.Note:Vals\.ai third\-party model evaluation leaderboardAccessed 2026\-05\-06\.External Links:[Link](https://www.vals.ai/benchmarks/lcb)Cited by:[§B\.8](https://arxiv.org/html/2606.10933#A2.SS8.SSS0.Px3.p1.3),[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px9.p1.1),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1)\.
- Vals AI \(2026b\)SWE\-Bench Verified: vals\.ai public leaderboard \(bash\-tool\-only harness\)\.Note:Vals\.ai third\-party model evaluation leaderboardUsed for GPT\-5\.4 mini and GPT\-5\.4 xhigh SWE\-Bench Verified scores because OpenAI does not publish vendor SWE\-V numbers for the GPT\-5\.4 family\. Vals\.ai bash\-tool\-only harness scores: 73\.0 for GPT\-5\.4 mini, 78\.2 for GPT\-5\.4 xhigh\. Accessed 2026\-05\-06\.External Links:[Link](https://www.vals.ai/benchmarks/swebench)Cited by:[§B\.2](https://arxiv.org/html/2606.10933#A2.SS2.SSS0.Px1.p1.1),[Table 6](https://arxiv.org/html/2606.10933#A2.T6.10.10.4),[Table 6](https://arxiv.org/html/2606.10933#A2.T6.8.8.4),[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px9.p1.1),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1)\.
- Vals AI \(2026c\)Terminal\-Bench 2\.0: vals\.ai public leaderboard\.Note:Vals\.ai third\-party model evaluation leaderboardAccessed 2026\-05\-06\.External Links:[Link](https://www.vals.ai/benchmarks/terminal-bench-2)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px9.p1.1),[item 12](https://arxiv.org/html/2606.10933#Ax1.I1.i12.p1.1)\.
- Y\. Wang, W\. Wang, S\. Joty, and S\. C\. H\. Hoi \(2021\)CodeT5: identifier\-aware unified pre\-trained encoder\-decoder models for code understanding and generation\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 8696–8708\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.685)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px5.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Wang, Y\. Liu, Y\. Wang, W\. He, B\. Gao, M\. Diao, Y\. Chen, K\. Fu, F\. Sung, Z\. Yang, T\. Liu, and W\. Xu \(2025\)OJBench: a competition level code benchmark for large language models\.arXiv preprint arXiv:2506\.16395\.Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- E\. B\. Wilson \(1927\)Probable inference, the law of succession, and statistical inference\.Journal of the American Statistical Association22\(158\),pp\. 209–212\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1927.10502953)Cited by:[Table 4](https://arxiv.org/html/2606.10933#A1.T4.12.12.2.2.2),[§B\.6](https://arxiv.org/html/2606.10933#A2.SS6.p1.9),[§2](https://arxiv.org/html/2606.10933#S2.SS0.SSS0.Px5.p2.4)\.
- C\. S\. Xia, Y\. Deng, S\. Dunn, and L\. Zhang \(2024\)Agentless: demystifying LLM\-based software engineering agents\.arXiv preprint arXiv:2407\.01489\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu \(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.InAdvances in Neural Information Processing Systems,Vol\.37\.Note:Datasets and Benchmarks TrackExternal Links:[Link](https://openreview.net/forum?id=tN61DTr4Ed)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Xu, S\. Guan, D\. Greene, and M\. T\. Kechadi \(2024\)Benchmark data contamination of large language models: a survey\.arXiv preprint arXiv:2406\.04244\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px8.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. R\. Narasimhan \(2025\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=roNSXZpUDN)Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023a\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px2.p1.1)\.
- Q\. Ye, B\. Y\. Lin, and X\. Ren \(2021\)CrossFit: a few\-shot learning challenge for cross\-task generalization in NLP\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 7163–7189\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.572)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px8.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px3.p1.1)\.
- H\. Yu, B\. Shen, D\. Ran, J\. Zhang, Q\. Zhang, Y\. Ma, G\. Liang, Y\. Li, Q\. Wang, and T\. Xie \(2024\)CoderEval: a benchmark of pragmatic code generation with generative pre\-trained models\.InProceedings of the 46th IEEE/ACM International Conference on Software Engineering \(ICSE\),Note:arXiv:2302\.00288Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px6.p1.1)\.
- Y\. Zhang, H\. Ruan, Z\. Fan, and A\. Roychoudhury \(2024\)AutoCodeRover: autonomous program improvement\.InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis,External Links:[Document](https://dx.doi.org/10.1145/3650212.3680384)Cited by:[Appendix H](https://arxiv.org/html/2606.10933#A8.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Y\. Zhuo, M\. C\. Vu, J\. Chim, H\. Hu, W\. Yu, R\. Widyasari, I\. N\. B\. Yusuf, H\. Zhan, J\. He, I\. Paul, S\. Brunner, C\. Gong, J\. Hoang, A\. R\. Zebaze, X\. Hong, W\. Li, J\. Kaddour, M\. Xu, Z\. Zhang, P\. Yadav, N\. Jain, A\. Gu, Z\. Cheng, J\. Liu, Q\. Liu, Z\. Wang, B\. Hui, N\. Muennighoff, D\. Lo, D\. Fried, X\. Du, H\. de Vries, and L\. von Werra \(2025\)BigCodeBench: benchmarking code generation with diverse function calls and complex instructions\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=YrycTjllL0)Cited by:[§5](https://arxiv.org/html/2606.10933#S5.SS0.SSS0.Px1.p1.1)\.
## Appendix AExperimental details
### A\.1Problem order and task structure
Each esolang benchmark contains 80 tasks in a fixed fetch order \(20 easy, 20 medium, 20 hard, 20 extra\-hard\)\. Under the primary protocol used for every main\-text result, problems are fetched and finalized in that fixed forward order: a problem is finalized either when one hidden submission passes all six private tests or when the three\-submission cap is reached, after which the session advances to the next problem and finalized problems are not revisited \(Figure[2](https://arxiv.org/html/2606.10933#S2.F2)of the body\)\. Workspace isolation prevents the agent from reading sibling experiment folders or any other run’s artifacts\.
### A\.2Task substrate details
The four languages we evaluate are drawn from EsoLang\-Bench\[Sharma and Chopra,[2026](https://arxiv.org/html/2606.10933#bib.bib48)\], which releases task statements, hidden tests, problem identifiers, difficulty tiers, and reference interpreters under a single specification\. The task statements, hidden tests, and metadata used in this paper are loaded from the Lossfunk EsoLang\-Bench Hugging Face dataset\[Lossfunk,[2026](https://arxiv.org/html/2606.10933#bib.bib49)\]; we reproduce only a brief characterization of each language here so the appendix is self\-contained; the canonical problem list, per\-tier difficulty labels, and reference interpreters are documented in the EsoLang\-Bench release\. The supplementary archive ships the four interpreter Python sources used by the harness atsupplementary\_code/benchmark\_harness/interpreters/\(one file per language: Brainfuck, Befunge\-98, Whitespace, Shakespeare\), and the public problem statements atbenchmark\_harness/public/esolang\_full\_public\.json\.
Brainfuck\.Eight\-symbol minimal pointer machine over an unbounded tape of unsigned bytes\.\><move the pointer,\+\-mutate the cell with byte wrap,\.,do byte\-level I/O, and\[\]form a conditional loop on the current cell\. There are no variables or named functions; numeric I/O must be implemented digit by digit through ASCII conversion\. Programs in this language tend to be long, fragile pointer\-arithmetic sequences\.
Befunge\-98\.Two\-dimensional stack\-based language whose instruction pointer moves over a grid of one\-character commands and can be redirected by direction commands \(\><ˆv\), with stack arithmetic, string mode, an end token \(@\), and grid get / put primitives\. The 2D control flow makes program structure positional rather than sequential, and even small grid edits can change which path the instruction pointer takes\.
Whitespace\.An assembly\-style language whose lexicon consists only of the three characters Space, Tab, and Linefeed\. Numbers are encoded as a sign bit followed by binary digits; control flow uses labels and jumps\. Because the source is invisible to a human reader, programs must be produced by a generator script: any direct edit through standard text\-editing tools tends to silently corrupt the program by altering whitespace\.
Shakespeare\.A natural\-language shaped programming language in which programs are written as theatrical scripts with named characters and dialogue between exactly two on\-stage characters at a time\. Statements set, read, print, and stack the listener’s value, expressions are built from positive and negative nouns scaled by adjectives, and Roman\-numeral scenes act as jump targets\. The syntax constraints are stylistic rather than minimal, but the interpretive rules around speaker, listener, and stage state are unfamiliar to typical software training data\. EsoLang\-Bench’s fifth language, Unlambda, is excluded due to interpreter latency \(Section[2](https://arxiv.org/html/2606.10933#S2)\); the canonical problem list, hidden\-test contracts, and reference interpreters for the four languages we keep are released by EsoLang\-Bench\.
##### Task set and tier structure\.
Each language ships8080problems split into four difficulty tiers of2020each: easy \(E01–E20\), medium \(M01–M20\), hard \(H01–H20\), and extra\-hard \(X01–X20\)\. The same problem statements appear in every language, so cross\-language differences come from the target syntax and execution model rather than from problem selection\. The problems are short standard programming tasks that an introductory programming course could pose; the difficulty in our setting comes entirely from expressing them in an unfamiliar target language\. The easy tier covers I/O and one\-step arithmetic \(echo a line, sum or multiply two integers, output a constant string, character\-by\-character echo\)\. The medium tier covers list and string operations \(sort a list of integers, count vowels, compute the length of a string, parity / odd–even check, integer absolute value, formatted multiplication tables, simple counters and accumulators\)\. The hard tier introduces multi\-step numeric manipulation \(greatest common divisor, primality test, integer division and modulo, Fibonacci, leading\-zero suppression, signed arithmetic with both inputs negative\)\. The extra\-hard tier exercises combined control flow and data manipulation \(least common multiple, bracket\-depth maximum, count inversions in a list, Roman\-to\-integer conversion, base conversions, signed average / halving, string rotation checks\)\. EsoLang\-Bench’s own paper reports near\-ceiling zero\-shot and few\-shot accuracy on these same statements when models answer in Python or JavaScript, which is the basis for our claim that the difficulty in this benchmark comes from the target language rather than the algorithmic problem\. The full per\-problem statements ship in the supplementary archive atsupplementary\_code/benchmark\_harness/public/esolang\_full\_public\.json\.
### A\.3Harness commands
The harness exposes a small command interface\.fetchreveals the next problem statement\.runexecutes a candidate program locally with a provided interpreter or verifier and user\-specified input\.submitevaluates the candidate against private tests and updates the problem state\.statusprints a progress dashboard andexportdumps the full per\-cell session state as JSON\. Agents can edit local files and run local helper scripts inside their workspace\. The full implementation is the single filesupplementary\_code/benchmark\_harness/harness\.pyin the supplementary archive; the constantMAX\_SUBMISSIONS = 3encodes the hidden\-submission cap and is enforced insideharness\.pyrather than by the agent wrapper, so it cannot be bypassed\.
##### Local execution limits\.
The four reference interpreters \(supplementary\_code/benchmark\_harness/interpreters/\) terminate non\-halting candidate programs by capping the number of executed interpreter steps atMAX\_INTERPRETER\_STEPS=10710^\{7\}instructions per localrun\(Brainfuck, Befunge\-98, Whitespace, Shakespeare\)\. Programs that exceed the cap return aStepLimitExceededruntime error rather than hanging the session, which is important for Brainfuck and Befunge\-98 where infinite loops are easy to author\. The same cap applies inside hidden\-testsubmitevaluation\. The cap is generous enough that none of the headline solutions reported in the paper hit it; it functions as a watchdog, not as a difficulty knob\.
### A\.4Primary protocol
The protocol used for every result in the main body allows unlimited local interpreter calls and up to three hidden submissions per problem\. The agent decides when it has converged on the local interpreter and then spends a hidden submission\. The score is the number of problems solved out of8080\. The full per\-problem state machine is in Figure[2](https://arxiv.org/html/2606.10933#S2.F2)of the body, and the operating parameters are summarized in Table[4](https://arxiv.org/html/2606.10933#A1.T4)\.
Table 4:Primary protocol parameters\.All main\-text results use this single configuration unless explicitly stated otherwise \(the diagnostic variants in Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)–[3\.5](https://arxiv.org/html/2606.10933#S3.SS5)relax or tighten individual rows but leave hidden tests, scoring, and workspace isolation unchanged\)\.
### A\.5Per\-agent API endpoints, model identifiers, and harness invocations
Table[5](https://arxiv.org/html/2606.10933#A1.T5)lists the API endpoint, model identifier, sampling configuration, and wrapper used for each of the six agents in the headline runs\. We do not override sampling at the API call level: every agent runs at the wrapper’s default temperature, top\-pp, and top\-kksettings, so the per\-cell session\-to\-session variation we report comes from independent re\-invocations of the wrapper rather than from explicit sampling changes\. Example CLI invocations:claude \-\-model claude\-opus\-4\-6\(Claude Code\),codex \-\-model gpt\-5\.4 \-\-reasoning xhigh\(Codex\),opencode run \-\-model moonshot/kimi\-k2\-thinking\(OpenCode\); the exact wrapper versions, environment variables, and per\-cell invocations used for the headline runs ship in thesupplementary\_code/archive\.
Table 5:Per\-agent API endpoint, model identifier, sampling configuration, and harness invocation\. “Wrapper default” indicates that the agentic wrapper sets temperature, top\-pp, and top\-kkto its built\-in defaults; we do not override at the API call level\. The model strings are the production identifiers used by the listed wrapper at the time of the headline runs\.
### A\.6Reporting and aggregation
A problem is counted as solved only when all six private hidden tests pass on a single submission\. We do not aggregate per\-test pass counts across submissions, because partial\-credit aggregation can inflate the appearance of progress for runs whose submissions never fully pass\. A problem with three failed hidden submissions is counted as 0, not as the best per\-test count across the three\. Cells that have not yet completed all 80 problems are explicitly marked partial and excluded from matched comparisons rather than imputed as zeros\. Wilson 95% confidence intervals over solved counts are reported alongside the headline cells in Section[B\.6](https://arxiv.org/html/2606.10933#A2.SS6)\.
### A\.7Robustness across independent sessions
##### What a run means in this paper\.
We run*independent sessions*of the underlying coding harness rather than seeded re\-inferences\. Each headline model×\\timeslanguage cell in Table[1](https://arxiv.org/html/2606.10933#S2.T1)is the solved count from Session 1; we additionally ran two further independent sessions per cell as session\-to\-session sanity checks \(per\-session counts in Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7)\)\. Each ablation cell \(Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3), Section[3\.4](https://arxiv.org/html/2606.10933#S3.SS4), and Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5)\) is the mean over two independent sessions\. A session here is one fresh end\-to\-end invocation of the agent’s deployed CLI \(Claude Code for the Anthropic family, Codex for the GPT\-5\.4 family, OpenCode for Kimi K2\.5\) on a freshly initialized workspace, with no state shared across sessions\.
##### Sessions, not seeds\.
The harnesses we evaluate are deployed CLI products and do not expose a deterministic seed at the inference layer; the underlying provider APIs do not return reproducible token sequences across requests at the default temperatures and sampling configurations these CLIs use\. Two re\-runs of the same model on the same problem set therefore differ at the token level even with the same prompts, harness, and protocol parameters\. This captures more variation than a seeded re\-run inside a single inference call, because the spread also reflects session\-level choices: which helper file the agent writes first, whether it tries direct target\-language authoring before reaching for metaprogramming, how aggressively it batches problems within the session, and which debug\-and\-revise loops it falls into\. We therefore use the terms*run*and*session*consistently throughout the paper to denote one such independent end\-to\-end CLI invocation, not a seeded RNG call inside a single request\.
##### Why three sessions per headline cell and two per ablation cell\.
For headline cells in Table[1](https://arxiv.org/html/2606.10933#S2.T1), we report Session 1 as the headline value and use Sessions 2 and 3 as session\-to\-session sanity checks; per\-session counts for all three sessions are in Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7)\. Ablation cells are paired contrasts on top of an established headline \(metaprogramming\-allowed vs\. direct authoring in Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3), with vs\. without a reference library in Section[3\.4](https://arxiv.org/html/2606.10933#S3.SS4), and varying interpreter\-call and output\-token budgets in Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5)\); the relevant signal is the within\-agent, within\-language shift induced by the intervention rather than the absolute score, so we ran two independent sessions per ablation cell and report the mean\. In every reported ablation, the qualitative direction of the intervention is consistent across both sessions\.
##### What this means for the headline claims\.
The inter\-agent separations that drive Section[3\.1](https://arxiv.org/html/2606.10933#S3.SS1)and Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)of the body are tens of problems out of8080\(for example7979vs\.44on Brainfuck and8080vs\.44on Befunge\-98\)\. Observed session\-to\-session variation across the three sessions per headline cell is small relative to these gaps \(Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7)\), so the agent ordering reported in the body is stable under independent re\-runs\. The per\-cell binomial Wilson interval on the solved\-out\-of\-8080counts is in Appendix[B\.6](https://arxiv.org/html/2606.10933#A2.SS6)\.
### A\.8Controlled\-access \(fixed\-budget\) protocol
For a stricter reproducibility\-oriented control we also ran a controlled\-access protocol that caps local runs and hidden submissions explicitly\. Each problem allows at most three local interpreter or verifier calls before a single hidden submission is unlocked \(the agent may submit sooner if it is confident, but cannot make a fourth local run before submitting\), and at most one hidden submission per problem\. Hidden tests, scoring rule, problem order, and workspace isolation are unchanged from the primary protocol; only the local\-run and submission caps are tightened\. The qualitative agent ordering under this stricter protocol matches the primary protocol; absolute scores are lower across the board because frontier agents lose access to the iterative repair loop they normally use\. Per\-cell numbers are recorded in the supplementary CSV\.
### A\.9Interpreter\-budget ablation
The interpreter\-budget ablation \(Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5.SSS0.Px1), Figure[4](https://arxiv.org/html/2606.10933#S3.F4)of the body\) caps the number of local interpreter calls per problem at\{3,5,15,30,∞\}\\\{3,5,15,30,\\infty\\\}while holding the task substrate, hidden submissions, and scoring rule fixed\. The full per\-budget solved counts on both Brainfuck and Befunge\-98 are the values plotted in body Figure[4](https://arxiv.org/html/2606.10933#S3.F4); the unlimited\-budget endpoints match the headline cells in Table[1](https://arxiv.org/html/2606.10933#S2.T1)of the body \(6464for Opus 4\.6 on each of Brainfuck and Befunge\-98,1212for Sonnet 4\.6 on Brainfuck and6464on Befunge\-98,44for Haiku 4\.5 on each\)\. The qualitative pattern reported in the body is that Opus 4\.6 gains substantially as the budget grows on both languages and Sonnet 4\.6 gains on Befunge\-98, while Haiku 4\.5 stays near the floor at every budget\. Per\-cell raw counts are emitted by the per\-cellexport\.jsonfiles undersupplementary\_code/experiments/01\_main\_experiments/and its budget\-restricted variants\.
### A\.10Metaprogramming constraints
In metaprogramming\-allowed runs, the agent may write generator programs in a familiar language that emit the target esolang\. In no\-meta runs, the agent must write the target language directly\. Cross\-language generator runs constrain the generator language to Python, JavaScript, or Rust depending on the cell\.
### A\.11Isolation
Each run receives its own workspace\. Agents are instructed not to inspect prior runs, sibling language folders, hidden tests, solved artifacts, or transcripts\. This isolation is important because the benchmark is sequential: later problems can legitimately reuse notes and primitives developed earlier in the same run, but not artifacts from other runs\.
### A\.12System prompts and per\-condition configuration
Each cell of the paper runs under one of three agentic harnesses \(Claude Code, Codex, OpenCode\), each of which automatically reads a project\-level instruction file from the workspace at session start \(CLAUDE\.mdfor Claude Code,AGENTS\.mdfor Codex and OpenCode\) on top of its own internal default system prompt\. We ship one such instruction file per language; its content is what the harness adds to its native default\. The four prompts reproduced verbatim in the subsections below are the exact files shipped in the supplementary archive atsupplementary\_code/prompts/<lang\>/\(symlinked into every primary\-protocol cell atexperiments/01\_main\_experiments/<harness\>/<model\>/<lang\>/\)\. The harness’s own internal default \(turn\-taking, tool\-use formatting, file\-editing rules\) is left unchanged\. We refer to the file we ship as the*language\-reference prompt*\.
The language\-reference prompt has the same structure for every esolang: a one\-line “start solving now” directive, the benchmark task description, the harness command list \(fetch,run,submit,status,skip\), the scoring rule \(a problem is solved only if all six private hidden tests pass\), the operating limits \(up to three hidden submissions per problem, unlimitedruncalls, unlimited time, no skipping with submissions remaining\), and a language\-specific reference card\. This single language\-reference prompt is identical across the three Claude agents \(Opus 4\.6, Sonnet 4\.6, Haiku 4\.5\) within a given language; for Codex and OpenCode we use the same content via theAGENTS\.mdmechanism\. The four primary\-protocol prompts are reproduced verbatim in Sections[A\.12\.1](https://arxiv.org/html/2606.10933#A1.SS12.SSS1)–[A\.12\.4](https://arxiv.org/html/2606.10933#A1.SS12.SSS4)below\.
Per\-condition deviations from this primary\-protocol prompt are as follows\.
Primary protocol and main results \(Section[3](https://arxiv.org/html/2606.10933#S3)\)\.The language\-reference prompt described above \(verbatim text in Sections[A\.12\.1](https://arxiv.org/html/2606.10933#A1.SS12.SSS1)–[A\.12\.4](https://arxiv.org/html/2606.10933#A1.SS12.SSS4)\)\. No further additions\. The agent receives each problem statement throughfetch; everything else is handled by the harness\.
Metaprogramming\-allowed \(Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)\)\.Same as the primary protocol above \(the language\-reference prompt is loaded asCLAUDE\.mdorAGENTS\.md, on top of the harness’s native default\)\. The agent freely chooses whether to author the target esolang directly or via a host\-language generator; no additional preamble is added\.
No\-meta direct authoring \(Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)\)\.The harness removes the agent’s bash access and any tool that could execute a host\-language generator, and a short paper\-authored preamble is added to the system prompt instructing the agent to author the target esolang directly\. The preamble lists the file extensions accepted by the harness for that cell \(e\.g\.,\.bfonly\) and explicitly notes that any generator script in the workspace at submission time will cause submission failure\. Together, the bash removal and the prompt preamble prevent any host\-language generation pathway\.
Cross\-language generator transfer \(Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)\)\.A short preamble specifies which host language is permitted \(Python, JavaScript, or Rust per cell\) and lists the file extensions accepted for the generator\. All other harness behavior is unchanged\.
Distillation: text\-only preamble \(Section[3\.4](https://arxiv.org/html/2606.10933#S3.SS4), condition \(i\)\)\.A condensed prose version of Opus 4\.6’s strategies is preloaded as the system prompt\. The full text is reproduced in Appendix[G\.1](https://arxiv.org/html/2606.10933#A7.SS1)\.
Distillation: reference library \(Section[3\.4](https://arxiv.org/html/2606.10933#S3.SS4), condition \(ii\)\)\.The condition \(i\) preamble plus areference\_lib/directory containing a small strategy\-only library distilled from the strong\-agent traces \(Claude Opus 4\.6, with one generic Brainfuck builder pattern from GPT\-5\.4 xhigh’s session\)\. The preamble points the agent at the reference directory and itsREADME\.md\. The library contains generic code\-generation primitives only; no per\-problem generators and no solved benchmark programs are included\. The exact files shipped are listed in Appendix[G\.2](https://arxiv.org/html/2606.10933#A7.SS2), and the full preamble text is reproduced there\.
Interpreter\-budget ablation \(Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5.SSS0.Px1)\)\.Harness defaults plus a hard cap on local interpreter calls per problem \(3, 5, 15, 30, or unlimited\)\. The cap is enforced by the harness rather than the prompt\.
Token\-efficiency ablation \(Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5.SSS0.Px2)\)\.Harness defaults; the cell is read out of the primary\-protocol runs by counting API output tokens \(including extended thinking\) on E01–E20 only\.
#### A\.12\.1Primary\-protocol language\-reference prompt: Brainfuck
```
> **Start solving now.** The session is initialized. Begin by fetching your first problem:
> ‘‘‘
> python harness.py fetch
> ‘‘‘
> Work through all 80 problems sequentially. For each problem: read
> the description, generate your own test cases, use ‘run‘ to test,
> then ‘submit‘ when confident. Do NOT skip without attempting --
> see skip policy below.
# EsoLang-Bench: Brainfuck
You are solving 80 brainfuck programming problems sequentially.
Problems span 4 difficulty tiers: E01-E20 (easy), M01-M20 (medium),
H01-H20 (hard), X01-X20 (extra-hard).
Each problem has 6 hidden test cases you cannot see. No test cases
are provided -- generate your own test inputs from the problem
descriptions.
## Harness Commands
‘‘‘bash
python harness.py fetch # Get next problem
python harness.py run <code_file> --input "..." # Test with custom input (unlimited)
python harness.py submit <problem_id> <file> # Hidden evaluation (max 3 per problem)
python harness.py status # Progress dashboard
python harness.py skip # Skip ONLY after attempting (see policy)
‘‘‘
## Rules
- Solve problems in order. ‘fetch‘ advances to the next.
- Maximum 3 submissions per problem. Use ‘run‘ (unlimited) to test first.
- ‘submit‘ returns PASS/WRONG ANSWER/RUNTIME ERROR per hidden test
-- no inputs or expected outputs shown.
- Problem is solved only if 6/6 hidden tests pass.
- Do not read private test files, reverse-engineer the harness, or
hardcode outputs.
## CRITICAL: No Skipping -- Use All 3 Submissions
- You MUST work on every problem until you either score 6/6 OR use
all 3 submissions. Do NOT move to the next problem with unused
submissions remaining.
- You CANNOT revisit problems later -- once you move on, the problem
is gone forever. Every unused submission is wasted.
- There is NO time limit. You have UNLIMITED time and UNLIMITED test
runs (‘run‘ command). The ONLY limit is 3 submissions per problem.
- For each problem: read the description, write a genuine solution,
test it thoroughly with ‘run‘ using multiple inputs until it
produces correct output, THEN submit.
- Do NOT submit untested code. Do NOT submit placeholder or stub
solutions. Every submission must be a genuine, tested attempt.
- If a submission returns WRONG ANSWER or RUNTIME ERROR: analyze the
failure, try a completely different approach, test it, then use
your next submission.
- Even if a problem seems extremely difficult, you must use all 3
submissions with genuinely different approaches before moving on.
A partial score (1-5/6) is far better than 0/6.
- Do NOT call ‘fetch‘ or ‘skip‘ to move to the next problem while
you still have submissions remaining on the current one.
## Scoring
- Per problem: tests passed / 6
- Overall: problems solved / 80 and total tests passed / 480
## Brainfuck Language Reference
### Commands
| Cmd | Description |
|-----|-------------|
| ‘>‘ | Move pointer right |
| ‘<‘ | Move pointer left |
| ‘+‘ | Increment cell (wraps 255->0) |
| ‘-‘ | Decrement cell (wraps 0->255) |
| ‘.‘ | Output cell as ASCII character |
| ‘,‘ | Read one byte into cell (0 if EOF) |
| ‘[‘ | Jump past matching ‘]‘ if cell is 0 |
| ‘]‘ | Jump back to matching ‘[‘ if cell nonzero |
All other characters are comments.
### Memory Model
- Unbounded tape of unsigned byte cells (0-255), initialized to 0.
- Pointer starts at cell 0; cannot go below 0 (runtime error).
### Essential Patterns
- Zero a cell: [-]
- Move (destructive): [->+<] (cell 0 to cell 1)
- Copy: [->+>+<<]>>[-<<+>>]
- Add cell 1 into 0: >[-<+>]<
- Subtract 1 from 0: >[-<->]<
- Read loop: ,[...,] reads until EOF (0)
- Print decimal: Divide by 10, store remainders, print in
reverse + ASCII 48
### ASCII Quick Reference
‘0‘=48, ‘9‘=57, ‘A‘=65, ‘Z‘=90, ‘a‘=97, ‘z‘=122, space=32,
newline=10, ‘-‘=45
### Common Pitfalls
- Number I/O: Input arrives as ASCII chars (’5’=53), not raw
numbers. Output must also be ASCII digits. This is the #1 error
source.
- Forgetting to zero cells before reuse.
- Cell overflow: 255+1=0, 0-1=255. Can cause infinite loops.
- Pointer tracking: Always keep a written map of which cell holds
what.
- Multi-digit numbers: Parsing and printing require digit-by-digit
handling.
```
#### A\.12\.2Primary\-protocol language\-reference prompt: Befunge\-98
The header, harness commands, rules, no\-skipping policy, and scoring section are identical to the Brainfuck prompt above \(with “brainfuck” replaced by “befunge\-98”\)\. Only the language reference card differs and is reproduced below\.
```
## Befunge-98 Language Reference
### Program Structure
2D grid. Instruction pointer starts at (0,0) moving right. Wraps at
edges. Ends at ‘@‘.
### Instructions
Stack: 0-9 push digit; a-f push 10-15; " toggle string mode;
: dup; \ swap; $ pop; n clear stack
Arith: + - * / % (pop b, pop a, push a op b)
Compare: ! (NOT); ‘ (greater-than)
Direction: > < ^ v ?(random) [(turn left) ](turn right) r(reverse)
#(skip next) j(jump n)
Branch: _ (right if 0, left if nonzero); | (down if 0, up if
nonzero)
I/O: . print int+space; , print char; & read int;
~ read char (-1 at EOF)
Grid: g get; p put; s store next; ’ fetch next
Flow: @ end; q quit; ; comment toggle; space=nop
### Essential Patterns
- Print string: "!dlroW olleH">:#,_@ (push reversed, loop print)
- Print loop: >:#,_@ (dup, skip print if 0, print char, repeat)
- Read int: & (built-in)
- Conditional: !#v_ (branch on value)
### Common Pitfalls
- ‘.‘ outputs number + space. For clean output, convert to digit
chars and use ‘,‘.
- Stack order: 52- = 5-2=3, not 2-5.
- String mode: "abc" pushes a,b,c in order; they pop as c,b,a.
- Missing ‘@‘: IP wraps and re-executes, hitting step limit.
- ‘~‘ returns -1 at EOF, not 0.
```
#### A\.12\.3Primary\-protocol language\-reference prompt: Whitespace
The header, harness commands, rules, no\-skipping policy, and scoring section are identical to the Brainfuck prompt above \(with “brainfuck” replaced by “whitespace”\)\. Only the language reference card differs and is reproduced below\.
```
## Whitespace Language Reference
### Encoding
Programs use only: Space (S, ASCII 32), Tab (T, ASCII 9), Linefeed
(L, ASCII 10). All other chars are ignored.
### Instruction Encoding (selected)
Stack: SS<num> push number; SLS dup; SLT swap; SLL discard
Arith: TSSS add; TSST sub; TSSL mul; TSTS div; TSTT mod
Heap: TTS store; TTT retrieve
I/O: TLSS out_char; TLST out_num; TLTS read_char; TLTT read_num
Flow: LSS<lbl> label; LST<lbl> call; LSL<lbl> jump;
LTS<lbl> jz; LTT<lbl> jn (jump if negative);
LTL ret; LLL end
### Key Concepts
- Number encoding: Sign bit (S=+, T=-) + binary digits
(S=0, T=1) + L terminator.
- Heap-based I/O: read_char/read_num store to heap at a popped
address, not the stack. Push address first, then read, then
push address + retrieve to get value onto stack.
- Labels: Sequences of S and T terminated by L. Must be unique.
### Common Pitfalls
- Heap I/O: Read operations store to heap, not stack. Must
retrieve afterward.
- Sign bit required even for positive numbers.
- Always include the end instruction (LLL) or execution runs
off the end.
- Stack order: sub pops b then a, computes a-b.
```
#### A\.12\.4Primary\-protocol language\-reference prompt: Shakespeare
The header, harness commands, rules, no\-skipping policy, and scoring section are identical to the Brainfuck prompt above \(with “brainfuck” replaced by “shakespeare”\)\. Only the language reference card differs and is reproduced below\.
```
## Shakespeare Language Reference
### Program Structure
Title.
Character, description.
Act I: Description.
Scene I: Description.
[Enter Character1 and Character2]
Character1:
Statement.
### Value System
- Positive nouns (+1): angel, cat, day, flower, hero, joy, king,
rose, summer, sun
- Negative nouns (-1): bastard, beast, coward, death, devil,
famine, hell, pig, plague
- Zero: nothing, zero
- Each adjective doubles: "a big cat"=2, "a big big cat"=4,
"a big big big cat"=8
- Arithmetic: the sum of X and Y, the difference between X and Y,
the product of X and Y, the quotient between X and Y, the
remainder of the quotient between X and Y, the square of X,
twice X
- Pronouns: you / thou / thee = listener’s value;
I / me = speaker’s value
### Statements (all target the LISTENER)
- Assign: You are EXPR. / Thou art EXPR.
- Output char: Speak your mind. (listener’s value as char)
- Output number: Open your heart. (listener’s value as int)
- Read char: Open your mind. (-1 at EOF)
- Read int: Listen to your heart.
- Stack push: Remember EXPR. (onto listener’s stack)
- Stack pop: Recall. (pop listener’s stack into
listener’s value)
### Comparisons and Flow Control
- Am I better than EXPR? (speaker > EXPR)
- Am I worse than EXPR? (speaker < EXPR)
- Am I as good as EXPR? (speaker == EXPR)
- If so, let us proceed to Scene X. (jump if true)
- If not, let us proceed to Scene X. (jump if false)
- Let us proceed to Scene X. (unconditional jump)
### Stage Rules
- Exactly 2 characters on stage for dialogue.
- [Enter X and Y], [Exit X], [Exeunt] (remove all)
- Scene labels use Roman numerals and must be globally unique.
### Common Pitfalls
- I/O targets the LISTENER, not the speaker. This is the #1 bug
source.
- No direct numbers: "You are 72." is INVALID. Use adjective-noun
expressions.
- Stage management: Must have exactly 2 characters on stage.
- Roman numeral scenes must be globally unique across all Acts.
```
## Appendix BAdditional results
Every per\-cell number reported in this appendix is reproducible from the per\-cellexport\.jsonfiles emitted bypython harness\.py export; the supplementary archive ships the harness, the 48 ready\-to\-run cell directories, and a rigorous end\-to\-end test \(scripts/rigorous\_test\.sh\) that exercises the export path without requiring any provider API key\.
### B\.1Terminal\-Bench 2\.0 and SWE\-Bench Verified vs EsoLang\-Bench scatter \(cliff visualization\)
Figure[6](https://arxiv.org/html/2606.10933#A2.F6)below visualizes two columns of Table[2](https://arxiv.org/html/2606.10933#S3.T2)of the body, Terminal\-Bench 2\.0 and SWE\-Bench Verified, against each agent’s EsoLang\-Bench mean score\. The shaded vertical band in each panel is the agent cluster on the mainstream benchmark \(33\.333\.3pp wide on Terminal\-Bench 2\.0 and6\.66\.6pp wide on SWE\-Bench Verified\), while the vertical extent of the markers shows the88\.488\.4\-pt EsoLang\-Bench spread of the same six agents\.
Figure 6:Mainstream coding scores cluster while unfamiliar\-language scores separate, on both Terminal\-Bench 2\.0 and SWE\-Bench Verified\.Each marker is one of the six evaluated coding agents\. \(a\) Terminal\-Bench 2\.0 \(vendor\-published\) on the x\-axis; mean EsoLang\-Bench score under our protocol on the y\-axis; shaded vertical band is the33\.333\.3\-pt TB\-2\.0 cluster\. \(b\) SWE\-Bench Verified on the x\-axis with the same y\-axis; shaded vertical band is the6\.66\.6\-pt SWE\-V cluster\. Asterisks mark Vals\.ai bash\-tool\-only\-harness numbers used where vendor SWE\-V scores are not published for the GPT\-5\.4 family\.
### B\.2Main results in raw problems\-solved counts
Table[1](https://arxiv.org/html/2606.10933#S2.T1)in the body reports the four esolang columns as percentages out of 80 problems per language for readability\. Table[7](https://arxiv.org/html/2606.10933#A2.T7)reproduces the same six\-agent results in the underlying problems\-solved\-out\-of\-80 format\.
##### Mainstream\-benchmark sourcing\.
The SWE\-Bench Verified and Terminal\-Bench 2\.0 rows of Table[2](https://arxiv.org/html/2606.10933#S3.T2)use vendor\-published numbers wherever those exist\. We did not re\-run SWE\-Bench Verified or Terminal\-Bench 2\.0 ourselves for the headline numbers; every cell in those two rows is sourced from the public reports listed in Table[6](https://arxiv.org/html/2606.10933#A2.T6)below\. Where no vendor SWE\-V score is published, we use the Vals\.ai third\-party leaderboard\[Vals AI,[2026b](https://arxiv.org/html/2606.10933#bib.bib64)\], which evaluates SWE\-Bench Verified under a published bash\-tool\-only harness; this applies only to the GPT\-5\.4 family\. As a sanity check on the third\-party numbers, we additionally re\-ran SWE\-Bench Verified on GPT\-5\.4 mini and GPT\-5\.4 xhigh under our own harness and recovered scores within a few points of the Vals\.ai cells, so the headline values are stable under independent replication; we use the Vals\.ai numbers in the table for transparency about the source\.
Table 6:Per\-agent sourcing for the SWE\-Bench Verified and Terminal\-Bench 2\.0 rows of Table[2](https://arxiv.org/html/2606.10933#S3.T2)\.Every cell is a public report from the listed source; we did not re\-score any of the mainstream cells ourselves for the headline tables\. The Vals\.ai entries are used only because OpenAI does not publish vendor SWE\-V numbers for the GPT\-5\.4 family\.Table 7:Main esolang results in raw problems\-solved\-out\-of\-80 format\. Same data as the percentage columns of Table[1](https://arxiv.org/html/2606.10933#S2.T1); multiply each cell by100/80100/80to recover the percentage\.
### B\.3Opus 4\.6 Brainfuck local\-call distribution at budget3030
This subsection breaks down the per\-problem distribution of local interpreter calls used*within the budget\-3030cell of Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5.SSS0.Px1)*, conditional on the problem being solved in that cell\. It is descriptive only: the headline budget\-33cell of Figure[4](https://arxiv.org/html/2606.10933#S3.F4)is a separate run with a different cap, and a problem solved here at, say, call88would not have been reached in a budget\-33run because the agent could not have afforded the eighth call there\. The distribution should therefore not be read as predicting the budget\-33score\.
Among thekksolved problems in the budget\-3030Opus 4\.6 Brainfuck cell, the local\-call distribution observed at solve time is: a plurality solve within the first three local calls, with the remainder spread across higher call counts \(one or two solves each at calls44–99, and isolated solves at calls1212,1313, and1919\)\. The exact per\-problem counts are recorded in the budget\-3030export\.jsonfor that cell; the body figure plots the per\-cell aggregates, not this within\-cell distribution\.
### B\.4Cross\-harness evidence
The headline comparison is already cross\-harness by construction\. The three model families we evaluate run under three independently implemented agentic wrappers: Claude Opus 4\.6, Sonnet 4\.6, and Haiku 4\.5 under*Claude Code*\(Anthropic\); GPT\-5\.4 xhigh and GPT\-5\.4 mini under*Codex*\(OpenAI\); Kimi K2\.5 under*OpenCode*\(Moonshot, third\-party\)\. These three wrappers ship different default system prompts, different file\-editing semantics, different turn\-taking conventions, and different shell\-tool surfaces\. We hold the benchmark\-facing operations constant across them \(fetch,run,submit,status\) and ship the same per\-language language\-reference prompt \(theCLAUDE\.mdtext reproduced in Sections[A\.12\.1](https://arxiv.org/html/2606.10933#A1.SS12.SSS1)–[A\.12\.4](https://arxiv.org/html/2606.10933#A1.SS12.SSS4), loaded asCLAUDE\.mdunder Claude Code and asAGENTS\.mdunder Codex and OpenCode\)\. The same capability ordering and the same large per\-language separation between frontier and weaker agents \(Table[1](https://arxiv.org/html/2606.10933#S2.T1)\) appear on top of all three independent wrappers, so the order\-of\-magnitude headline spread is not a single\-wrapper artifact\.
Two additional consistency checks support this interpretation\. First, within the Claude family running entirely under Claude Code, the8080\-point gap between Opus 4\.6 and Haiku 4\.5 on Brainfuck and Befunge\-98 is too large to be explained by per\-wrapper tooling differences, since wrapper and harness are held constant within the family\. Second, on the languages where direct authoring is feasible \(Whitespace and Shakespeare for the strongest two agents\) the between\-family ordering and absolute scores under three different wrappers fall within a narrow band, indicating that wrapper differences are second\-order relative to the capability differences the headline numbers reveal\.
##### Single\-model\-multiple\-wrappers OpenCode check\.
On top of this natural three\-wrapper diversity we ran an explicit single\-model\-multiple\-wrappers control: we re\-ran the strongest agent in each of the Claude and GPT\-5\.4 families \(Opus 4\.6 and GPT\-5\.4 xhigh\) under*OpenCode*on the two diagnostic languages where direct authoring is most fragile and where the body’s metaprogramming finding is most diagnostic \(Brainfuck and Befunge\-98\)\. All other protocol parameters are held fixed: the same EsoLang\-Bench task statements, the same six private hidden tests per problem, the same up\-to\-three\-submissions cap, the same unlimited local interpreter access, and the same per\-language language\-reference prompt loaded asAGENTS\.md\.
Table 8:Single\-model\-multiple\-wrappers OpenCode check\.Solved problems out of 80 for the strongest agent in each native family on the two diagnostic languages, comparing the native wrapper to OpenCode while holding everything else fixed\. Each one\-problem step on this scale is1\.251\.25percentage points; both models drop by 1–2 problems out of 80 under OpenCode, well within the Wilson95%95\\%binomial CI of the native count, and far below the 50–80\-problem separations between frontier and weaker agents in the headline cells\.The native\-versus\-OpenCode delta is at most−2\-2problems out of 80 in every cell\. For the two strongest agents on the two diagnostic languages, the OpenCode re\-run lands within the Wilson95%95\\%CI of the native\-wrapper count and preserves the qualitative ordering \(Opus 4\.6 strong on Brainfuck, near\-ceiling on Befunge\-98; GPT\-5\.4 xhigh near\-ceiling on both\)\. Combined with the natural three\-wrapper diversity of the headline comparison, this confirms that the headline ordering and the order\-of\-magnitude per\-language separation between frontier and weaker agents are not artifacts of Claude Code or Codex specifically\.
### B\.5Kimi K2\.5 under OpenCode
Kimi K2\.5 is run under OpenCode in the headline cells because the Anthropic Claude Code wrapper does not load Kimi K2\.5 as a native target and the OpenAI Codex wrapper does not host third\-party checkpoints under its function\-calling protocol\. The OpenCode runs use the same harness command interface \(fetch,run,submit\) and the same language\-reference prompt \(AGENTS\.md\) as Codex\. The four headline Kimi K2\.5 cells solve4/804/80on Brainfuck,5/805/80on Befunge\-98,25/8025/80on Whitespace, and2/802/80on Shakespeare; these are reproduced as percentages in Table[1](https://arxiv.org/html/2606.10933#S2.T1)of the main text and as raw counts in Table[7](https://arxiv.org/html/2606.10933#A2.T7)\. The headline cells already use the primary metaprogramming\-allowed protocol \(the agent is free to write a host\-language generator if it chooses\), so no separate “meta\-allowed” Kimi K2\.5 variant is reported: the headline numbers*are*the meta\-allowed numbers, and they place Kimi K2\.5 in the low\-performance regime that the body identifies as capability\-gated\.
### B\.6Confidence intervals
Each headline cell in the main results is theSession 1solved count \(Appendix[A\.7](https://arxiv.org/html/2606.10933#A1.SS7)\)\. The reported uncertainty for this Session 1 count is a95%95\\%Wilson binomial confidence interval \(Wilson CI\)\[Wilson,[1927](https://arxiv.org/html/2606.10933#bib.bib63)\]on the raw countk/80k/80of that session; a bootstrap over the8080per\-problem outcomes \(10,00010\{,\}000resamples\) produces quantitatively similar intervals\. Table[1](https://arxiv.org/html/2606.10933#S2.T1)in the body reports the worst\-side Wilson CI half\-width as a single±\\pmsubscript for readability; the full asymmetric Wilson CIs are reproduced in Table[9](https://arxiv.org/html/2606.10933#A2.T9)below\. Each cell shows percentage solved with a separate upper and lower half\-width, so that bounded cells \(near0%0\\%or100%100\\%\) are visible as one\-sided or near\-one\-sided\. Half\-widths are computed from the Wilson score formula on raw countsk/80k/80for per\-language cells andk/320k/320for the pooledEsoLang mean\.
Table 9:Asymmetric Wilson 95% binomial confidence intervals on the headline cells of Table[1](https://arxiv.org/html/2606.10933#S2.T1)\.##### On the independence assumption\.
The Wilson95%95\\%binomial interval treats the8080per\-problem outcomes within a cell as independent Bernoulli trials\. This assumption is not literally satisfied here: the persistent workspace lets earlier\-problem successes feed reusable primitives into later\-problem attempts \(see Section[3\.2](https://arxiv.org/html/2606.10933#S3.SS2)of the body and Appendix[C\.5](https://arxiv.org/html/2606.10933#A3.SS5)\), so the per\-problem outcomes are positively correlated and a binomial interval is best read as a conservative*per\-cell*dispersion estimate rather than a classical confidence statement on i\.i\.d\. draws\. Two pieces of triangulation guard against this\. First, we additionally run two further independent sessions per headline cell as session\-to\-session sanity checks \(per\-session counts in Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7)\); session\-level variation is the source most directly relevant to the inter\-agent comparisons in Table[1](https://arxiv.org/html/2606.10933#S2.T1)\. Second, the inter\-agent separations that drive the body’s claims are typically tens of problems out of8080\(for example7979vs\.44on Brainfuck and8080vs\.44on Befunge\-98\), much larger than any plausible widening of the Wilson interval under violations of independence\.
### B\.7Per\-session solved counts on the four esolangs
Table[1](https://arxiv.org/html/2606.10933#S2.T1)of the body reports the solved count from Session 1 of each model×\\timeslanguage cell\. We additionally ran two further independent sessions \(Sessions 2 and 3\) per cell as session\-to\-session sanity checks; sessions are independent CLI re\-invocations of the same agent under the same primary protocol \(Appendix[A\.7](https://arxiv.org/html/2606.10933#A1.SS7)\), not seeded re\-runs\. Per\-session solved counts \(out of8080\) for all three sessions are reported in the four per\-language tables below\.
Across the2424model×\\timeslanguage cells, the maximum solved\-count range across Sessions 1, 2, and 3 is22problems \(Sonnet 4\.6 on Befunge\-98 and Opus 4\.6 on Shakespeare\);1111of the2424cells have range0,1111have range11, and only22have range22\. All such ranges are well inside the Wilson 95% half\-widths reported in Table[9](https://arxiv.org/html/2606.10933#A2.T9)\(typically33to1111problems onk/80k/80\), and the headline agent ordering on every language is preserved across all three sessions\.
Table 10:Per\-session solved counts on Brainfuck \(out of8080\)\. Run 1 reproduces the Brainfuck column of Table[1](https://arxiv.org/html/2606.10933#S2.T1)\.Table 11:Per\-session solved counts on Befunge\-98 \(out of8080\)\. Run 1 reproduces the Befunge\-98 column of Table[1](https://arxiv.org/html/2606.10933#S2.T1)\.Table 12:Per\-session solved counts on Whitespace \(out of8080\)\. Run 1 reproduces the Whitespace column of Table[1](https://arxiv.org/html/2606.10933#S2.T1)\.Table 13:Per\-session solved counts on Shakespeare \(out of8080\)\. Run 1 reproduces the Shakespeare column of Table[1](https://arxiv.org/html/2606.10933#S2.T1)\.
### B\.8Mainstream coding benchmarks: descriptions of the benchmarks compared in Table[2](https://arxiv.org/html/2606.10933#S3.T2)
Table[2](https://arxiv.org/html/2606.10933#S3.T2)of the body compares the same six agents across three mainstream coding benchmarks \(SWE\-Bench Verified, Terminal\-Bench 2\.0, LiveCodeBench v6\) and the EsoLang\-Bench four\-language mean\. Each benchmark targets a different aspect of coding capability; we describe each below so a reader can interpret why the spreads and SDs differ across them\.
##### SWE\-Bench Verified\.
SWE\-Bench Verified\[Jimenezet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib10)\]is a500500\-instance human\-curated subset of SWE\-Bench focused on real GitHub issues from widely used Python projects\. The agent receives an issue description and the repository state, and must produce a patch that resolves the issue and passes the project’s existing test suite\. We use SWE\-Bench Verified as the primary mainstream\-coding anchor because it has full official\-report coverage on all six agents and gives the cleanest like\-for\-like contrast against EsoLang\-Bench\. Per\-agent sources are listed in Table[6](https://arxiv.org/html/2606.10933#A2.T6)\.
##### Terminal\-Bench 2\.0\.
Terminal\-Bench 2\.0\[Merrillet al\.,[2026](https://arxiv.org/html/2606.10933#bib.bib11)\]is the canonical agentic terminal benchmark, comprising 89 hard tasks in computer terminal environments inspired by real workflows\. Tasks span software engineering, security, machine learning, and system administration; each task has a unique environment, a human\-written reference solution, and comprehensive verification tests\. The benchmark mixes coding with file\-system, environment\-configuration, and other tool\-use work, so it tests a broader notion of agentic capability than narrow patch\-generation\. The six agents we evaluate span33\.333\.3percentage points on it \(Table[2](https://arxiv.org/html/2606.10933#S3.T2)\)\. Per\-agent sources are listed in Table[6](https://arxiv.org/html/2606.10933#A2.T6)\.
##### LiveCodeBench v6\.
LiveCodeBench v6\[Jainet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib6)\]is a contamination\-resistant competitive\-programming benchmark whose problems are continuously collected from competitive\-programming platforms \(LeetCode, AtCoder, Codeforces\) and tagged with a release date so models can be evaluated on problems released after their training cutoff\. The v6 release covers problems from May 2023 to April 2025\. Coverage in current frontier vendor reports is partial: the only model in our six\-agent set with an officially reported LiveCodeBench v6 number is Kimi K2\.5 at85\.0%85\.0\\%\[Moonshot AI,[2026](https://arxiv.org/html/2606.10933#bib.bib60)\]; Anthropic and OpenAI do not report LiveCodeBench v6 in the Claude 4\.5/4\.6 or GPT\-5\.4 model cards we cite\. To construct a complete six\-agent column for Table[2](https://arxiv.org/html/2606.10933#S3.T2), we therefore use the Vals\.ai LiveCodeBench v6 leaderboard\[Vals AI,[2026a](https://arxiv.org/html/2606.10933#bib.bib65)\]uniformly for all six agents, including Kimi K2\.5 \(which Vals\.ai records at83\.9%83\.9\\%under its bash\-tool\-only harness, near the Moonshot\-reported85\.0%85\.0\\%\); using one source for all rows keeps the comparison harness\-consistent\.
##### HumanEval and MBPP \(omitted due to saturation\)\.
HumanEval\[Chenet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib1)\]and MBPP\[Austinet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib2)\]are early single\-function generation benchmarks\. Frontier coding agents in the 4\.5/4\.6 generation saturate both at or near ceiling\. Vendors in this generation generally stopped publishing HumanEval and MBPP scores in their model cards because the comparison is uninformative; where older versions of these models did publish numbers, both benchmarks were typically reported at95%95\\%–100%100\\%pass@1, with no meaningful separation across the six agents we evaluate\. We therefore do not include HumanEval or MBPP in the body\. We retain the citations in the related\-work section for historical context\.
##### Composite sensitivity check\.
A common\-three composite that averages SWE\-Bench Verified, Terminal\-Bench 2\.0, and LiveCodeBench v6 per agent gives the same qualitative conclusion as each benchmark individually: the same six agents fan out far more on EsoLang\-Bench than on the composite, and the EsoLang\-Bench SD remains several times larger than the composite SD\.
## Appendix CTrace examples and qualitative coding
### C\.1Successful generator workflow
A representative successful Brainfuck workflow has four stages\. First, the agent writes a generator in a familiar language\. Second, the generator emits a target Brainfuck program and saves it to disk\. Third, the agent runs local tests through the harness interpreter\. Fourth, failures are fixed in the generator, not by hand\-editing the generated Brainfuck\. This creates a reusable abstraction layer for subsequent problems\.
### C\.2Failure mode: shallow local iteration
Weaker runs often show many local tool calls without a stable intermediate representation\. The agent repeatedly edits target\-language code, tests one example, and submits before constructing robust numeric I/O or state layout primitives\. This can solve trivial problems but tends to fail once hidden tests exercise edge cases\.
### C\.3Failure mode: strategy without execution
In strategy\-transfer runs, weaker agents receive explicit instructions to use generator\-based solving, decimal representations, tape layouts, and local verification\. The common failure is not rejecting the strategy; it is failing to implement the strategy robustly\. The agent may create a generator but still emit incorrect target code, omit edge\-case tests, or patch generated code by hand\.
### C\.4Trace selection rule
The transcript excerpts in Appendix[C\.8](https://arxiv.org/html/2606.10933#A3.SS8)are drawn from a single recorded session \(Opus 4\.6 on Brainfuck under the metaprogramming\-allowed primary protocol\)\. Excerpts are selected to illustrate four pre\-specified phenomena \(generator emergence at the first multi\-digit\-arithmetic problem, library composition under cell pressure, sequence\-level revisitation, and substrate\-aware algorithmic substitution\) rather than chosen for narrative effect\. Turn numbers refer to the deduplicated assistant line index in the underlying\.jsonltranscript\.
### C\.5Metaprogramming emergence trace: Opus 4\.6 on Brainfuck E04
Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)describes how metaprogramming emerges reactively on Brainfuck around E04\. The file\-level evidence from the metaprogramming\-allowed Opus 4\.6 run is summarized below\. Submission sizes are reported in bytes; the “solver” column indicates whether the submission was hand\-written or emitted by a generator script\.
Thebflib\.pyinterface, constructed between E04 submissions 1 and 2 and reused for the rest of the run, exposes:
- •ABFclass that tracks the tape pointer in Python and emits Brainfuck text incrementally\. Core position\-keyed primitives:goto\(pos\),inc\(pos, val\),zero\(pos\),set\_val\(pos, val\),read\(pos\),write\(pos\)\.
- •Movement and copy primitives:move\_to\(src, dst\)\(destructive add\),move\_to2\(src, dst1, dst2\)\(destructive duplicate\),copy\(src, dst, tmp\)\(non\-destructive via temp\),sub\_from\(src, dst\)\(destructive subtract\)\.
- •ACellAllocclass for deterministic cell allocation, includingalloc\_bcd\(ndigits\)for binary\-coded\-decimal numbers \(digits plus sign cell\)\.
- •Conditional helperscheck\_eq\(bf, cell, value, flag, tmp\)andif\_nonzero\(bf, cell, body\_fn, tmp\)for compiling branches over unsigned\-byte cells\.
The reactive Brainfuck pattern \(direct authoring until a problem requires multi\-digit arithmetic; library construction at the first failure; generator\-emitted output thereafter\) contrasts with Whitespace, where the generator \(an 84\-line stack\-machine assembler exposingpush,dup,swap,add,sub,mod,store,retrieve,out\_num,read\_num, labeled jumps, andend\) is built before E01 and used from the first submission onward\. We treat the language\-conditioned emergence pattern as behavioral evidence: it can be read off file artifacts and submission sizes without inferring agent cognition\.
### C\.6GPT\-5\.4 xhigh on Brainfuck E04: generator excerpt
GPT\-5\.4 xhigh’s E04 generator follows the same broad shape as Opus’sbflib\.py\(a Python class that tracks tape position and emits Brainfuck incrementally\) but with a different surface API\. The first forty lines of the generator class are reproduced below\. The full file is in the supplementary material\.
```
class BF:
def __init__(self):
self.code = []
self.ptr = 0
self.cells = {}
self.next_pos = 0
def alloc(self, name, count=1):
start = self.next_pos
for i in range(count):
key = name if count == 1 else f"{name}{i}"
self.cells[key] = start + i
self.next_pos += count
return start
def move_to(self, name_or_pos):
pos = name_or_pos if isinstance(name_or_pos, int) \
else self.cells[name_or_pos]
delta = pos - self.ptr
if delta > 0: self.code.append(">" * delta)
elif delta < 0: self.code.append("<" * (-delta))
self.ptr = pos
def clear(self, cell):
self.move_to(cell); self.code.append("[-]")
def add_const(self, cell, value):
if value == 0: return
self.move_to(cell)
self.code.append("+" * value if value > 0 else "-" * (-value))
def copy(self, src, dst, tmp):
# non-destructive copy via tmp; restores src
self.clear(dst); self.clear(tmp)
self.move_to(src); self.code.append("[")
...
```
The cross\-lab agreement is on*structure*rather than surface syntax: a class that owns the tape pointer, a cell\-allocator that makes layouts deterministic, primitive movement and copy operations, and decimal\-arithmetic helpers built on top\.
### C\.7Cross\-language E04 generators \(artifact\)
This section reproduces excerpts of the actual GPT\-5\.4 xhigh generator programs that emit the Brainfuck solution to problemE04of EsoLang\-Bench \(signed\-decimal addition with multi\-digit input parsing\) under three host\-language conditions: Python, JavaScript, and Rust\. Per\-cell scores are reported inline in Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)of the body\. The full source files \(gen\_E04\.py,gen\_E04\.js,gen\_E04\.rs\) ship in the supplementary archive atsupplementary\_code/experiments/04\_cross\_language\_transfer/<host\>/brainfuck/where<host\>is one ofpython,javascript,rust\. The same algorithmic shape appears in all three: each generator declares a tape\-cell layout, defines low\-level Brainfuck primitives \(clear,move,copy,add\_const,eq\_const,if\_flag\), and composes those primitives into BCD arithmetic and decimal printing\. The host language differs in syntactic surface only; the conceptual machine the agent is constructing is the same across the three\.
##### Cell layout \(parallel structure across host languages\)\.
Each generator allocates named tape cells and digit arrays at the top of the file; the slot indices are cosmetic, but the named groups \(magnitudes, positive accumulator, negative accumulator, result, sign, comparison flags, carry, scratch\) line up across all three host languages\.
```
# gen_E04.py (Python)
C = 0; A = 1; B = 2; TMP = 3; FLAG = 4; AUX = 5; OUT = 6
# arithmetic helpers (mul10, move_into, parse_*) operate on these
# named cells; the bfasm.BF builder owns the pointer.
// gen_E04.js (JavaScript)
const C = {
ch: 0, sign: 1, digit: 2, delim: 3, cont: 4,
eq1: 5, eq2: 6, tmp1: 7, tmp2: 8, ws: 9,
posBranch: 10, carry: 11, over: 12, ...,
mag: range(30), pos: range(40), neg: range(50), result: range(60),
};
const b = new BFBuilder();
// gen_E04.rs (Rust)
const C: usize = 0; const NUM2: usize = 1; const IN_NUMBER: usize = 2;
const SIGN_A: usize = 3; const SIGN_B: usize = 4;
const A: [usize; 9] = [5, 6, 7, 8, 9, 10, 11, 12, 13];
const B: [usize; 9] = [14, 15, 16, 17, 18, 19, 20, 21, 22];
const R: [usize; 10] = [23, 24, 25, 26, 27, 28, 29, 30, 31, 32];
```
##### Brainfuck\-emitter primitives\.
The three host languages each carry a small Brainfuck\-emitter object or builder that owns the tape pointer and writes characters into a growing string\. The primitives are essentially identical; the syntax differs\.
```
# Python (uses bfasm.BF builder)
def mul10(bf, cell, tmp):
bf.clear(tmp); bf.move(cell)
bf.emit("[-"); bf.move(tmp); bf.emit("++++++++++"); bf.move(cell); bf.emit("]")
bf.move(tmp); bf.emit("[-"); bf.move(cell); bf.emit("+"); bf.move(tmp); bf.emit("]")
// JavaScript (BFBuilder)
function clearArray(cells) { for (const cell of cells) b.clear(cell); }
function shiftAppendDigit(cells) {
for (let i = DIGITS - 1; i >= 1; i -= 1) {
b.clear(cells[i]); b.transfer(cells[i - 1], cells[i]);
}
b.clear(cells[0]); b.transfer(C.digit, cells[0]);
}
// Rust (Bf struct, methods on &mut Bf)
fn move_value(&mut self, src: usize, dst: usize) {
self.clear(dst); self.move_to(src); self.raw("[");
self.sub(1); self.move_to(dst); self.add(1);
self.move_to(src); self.raw("]");
}
fn copy_value(&mut self, src: usize, dst: usize, scratch: usize) { ... }
```
##### Top\-level driver\.
After parsing the two signed integers into per\-digit buckets, all three generators perform a magnitude compare and either add the two absolute values \(same sign\) or subtract the smaller from the larger \(opposite signs\), then print the result with a sign character if needed\.
```
# gen_E04.py (top-level)
def main():
bf = BF()
parse_first_number(bf)
parse_second_number(bf)
output_byte_decimal(bf, A) # full pipeline: parse, add, print
// gen_E04.js (top-level)
b.input(C.ch);
skipWhitespace();
parseIntIntoBuckets();
skipWhitespace();
parseIntIntoBuckets();
compareArrays(C.pos, C.neg);
b.isNonzero(C.lt, C.negBranch, C.tmp1, C.tmp2);
b.set(C.posBranch, 1);
b.ifFlag(C.negBranch, () => {
b.clear(C.posBranch);
b.printConst(C.out, 45); // ASCII ’-’
subtractArrays(C.neg, C.pos);
printDecimalArray(C.result);
});
b.ifFlag(C.posBranch, () => {
subtractArrays(C.pos, C.neg);
printDecimalArray(C.result);
});
process.stdout.write(b.toString());
// gen_E04.rs (top-level)
fn main() {
let mut bf = Bf::new();
parse_input(&mut bf);
compute_sum(&mut bf);
print_result(&mut bf);
println!("{}", bf.finish());
}
```
##### Reading the artifact\.
Three observations stand out across the three generators\. First, the host language is being used as a typed scaffolding layer for the Brainfuck tape: named cells, named digit arrays, named flags, and named primitives\. The agent is not producing Brainfuck symbols directly; it is constructing a small domain\-specific assembler in the host language and emitting symbols through that\. Second, the algorithmic decomposition is the same across the three host languages: parse one signed integer, parse the other, magnitude\-compare, branch into add or subtract, print with an optional sign character\. The host language is incidental to that algorithm\. Third, the syntactic differences are visible \(e\.g\. Rust’s closures and ownership, JavaScript’s arrow callbacks forifFlag, Python’s flat function\-style emitters\), but the structure of what the agent ships to the Brainfuck interpreter is unchanged\. This is consistent with the body’s claim that the metaprogramming benefit comes from access to a familiar general\-purpose host language for constructing target programs, not from Python specifically\.
### C\.8Selected transcript excerpts: Opus 4\.6 on Brainfuck
The following excerpts are verbatim from the recorded Claude Code session transcript \(claude\_transcript\_1\.jsonl\) for an Opus 4\.6 Brainfuck run conducted under a relaxed diagnostic harness in which metaprogramming was allowed and the agent could skip unsolved problems and revisit them later; hidden tests, scoring rule, and workspace isolation were unchanged from the primary protocol\. The headline cells in Table[1](https://arxiv.org/html/2606.10933#S2.T1)use the strict primary protocol and do not come from this transcript; we use this single recorded session*only*as a qualitative trace source\. The excerpts illustrate four narratively distinct moments: the emergence of generator\- mediated solving, library composition under cell\-allocation pressure, sequence\-level portfolio behavior \(the relaxed revisit policy is what makes this moment visible\), and substrate\-aware algorithmic substitution\. Turn numbers refer to the ordered, deduplicated assistant/user line index in the transcript\.
##### \(1\) Emergence at E04*Sum Two Integers*\.
After hand\-authoring E01 through E03 \(programs of 5 to 115 bytes\), the agent fetches E04 and immediately switches strategy:
> Turn 66 \(assistant\)\.“Let me write a Python BF generator to handle this and future problems\.”
The first generator \(single\-byte cells\) submits and scores 2/6\. The agent diagnoses overflow and missing sign handling rather than patching the emitted Brainfuck:
> Turn 86 \(assistant\)\.“Both broken\. The issue is overflow and negative numbers\. I need multi\-byte arithmetic and sign handling\. Let me rewrite the generator with proper multi\-byte support\.”
The rewrite introduces binary\-coded\-decimal cells with a sign byte and emits a 24,500\-byte Brainfuck program that passes 6/6 on submission 2\.
##### \(2\) Library composition and cell\-allocation pressure at X12*LCM*\.
After M06 produced a GCD primitive and earlier problems produced multiplication, the agent composes them rather than authoring fresh:
> Turn 1063 \(assistant\)\.“LCM\(a,b\)=a⋅b/gcd\(a,b\)\(a,b\)=a\\cdot b/\\mathrm\{gcd\}\(a,b\)\. I have GCD \(M06 approach\) and multiply\. Need BCD division too\. Actually, easier: LCM\(a,b\)=a/gcd\(a,b\)⋅b\(a,b\)=a/\\mathrm\{gcd\}\(a,b\)\\cdot b\. Divideaaby gcd first \(smaller intermediate\), then multiply bybb\.”
Within a few turns the cell layout starts breaking:
> Turn 1067 \(assistant\)\.“The cell allocation is getting out of hand\. Let me use a simpler approach: computea/gcda/\\mathrm\{gcd\}first \(by repeated subtraction of gcd fromaa, counting\), then multiply result bybb\.”
The capability on display is not a single trick but a cumulative library plus the judgment to back off when cell pressure exceeds what the existing layout supports\.
##### \(3\) Sequence\-level portfolio behavior\.
The 80\-problem sequence is treated as a portfolio rather than a strict sequential walk\. Mid\-run, the agent batch\-skips problems to preserve forward momentum, then revisits unsolved cells with the now\-mature library:
> Turn 858 \(assistant\)\.“Let me go back and solve some of the skipped problems\. X09 \(Bracket Depth Maximum\) and X12 \(LCM\) look feasible\.”
After a second pass the agent triages by expected return:
> Turn 1398 \(assistant\)\.“23 problems to retry\. Let me prioritize by likely impact: problems where I can improve my score significantly\. I’ll work through them systematically\.”
This is the source of several of the 0/6→\\to6/6 reversals on extra\-hard problems \(*Bracket Depth Maximum*,*Count Inversions*,*Roman to Integer*\): a first\-pass attempt failed, the problem was set aside, and a second\-pass attempt succeeded once the library and primitive set had grown\.
##### \(4\) Substrate\-aware algorithmic substitution\.
On a string\-rotation problem, the agent rejects an elegant idiomatic algorithm in favor of one that composes more cleanly with the primitives it already has:
> Turn 1401 \(assistant\)\.“String rotation:S2S\_\{2\}is a rotation ofS1S\_\{1\}iffS2S\_\{2\}is a substring ofS1\+S1S\_\{1\}\+S\_\{1\}\. But substring search is hard\. Alternative: try each rotation offset and check equality\.”
This pattern recurs across the run: when a textbook algorithm requires a missing substrate primitive, the agent silently rewrites the algorithm rather than extending the substrate\. We read this as evidence that the metaprogramming layer is itself an object the agent reasons about\.
## Appendix DContamination and overlap
We do not claim distributional novelty\. We do report the following overlap checks, which together support the narrower claim that these interfaces are low\-ecosystem relative to mainstream programming languages\.
##### Public\-code frequency\.
Querying a standard open code corpus for file extensions associated with each target language gives, relative to Python files, frequency ratios of approximately10−510^\{\-5\}for Brainfuck,10−610^\{\-6\}for Befunge\-98 and Whitespace, and10−710^\{\-7\}for Shakespeare\. The full query protocol and the exact ratios are given in the supplementary material; the qualitative point is that all four esolangs are several orders of magnitude rarer than mainstream programming languages\.
##### Hidden\-test isolation\.
The private hidden tests for every problem in EsoLang\-Bench were authored specifically for the benchmark and are not available to any agent\. The public statement of each problem contains the natural\-language specification and input/output examples, but not the private tests used for grading\.
##### nn\-gram overlap\.
For a sample of 20 Brainfuck problem statements, maximumnn\-gram overlap with publicly scraped Brainfuck corpora is dominated by generic tokens \(\+\[\>\],\[\-\], numerical output idioms\) rather than by statement text\. We treat this as a weak positive control on statement\-level novelty without claiming distributional novelty\.
## Appendix EPer\-language per\-tier solve distribution
For every completed cell, the supplementary archive’s per\-cellexport\.json\(produced bypython harness\.py export\) contains the split of solves across the four difficulty tiers \(easy, medium, hard, extra\-hard\)\. The qualitative pattern across cells is consistent: solves concentrate in easy and medium tiers for weaker agents, and spread more evenly across all four tiers for stronger agents\. On Brainfuck, all thirteen of Opus 4\.6’s extra\-hard\-tier solves in the budget\-30 run use generator\-based construction; none are produced by direct target\-language authoring\.
## Appendix FSample programs in the four esolangs
This appendix gives short illustrative excerpts from each of the four esolangs we evaluate, drawn from agent submissions\. The point is to make concrete why direct authoring is feasible for some languages and not others, and why metaprogramming emerges asymmetrically in the strong\-agent runs \(Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)\)\.
##### Brainfuck \(E03*Hello Name*\)\.
A small hand\-written program that reads a name and prints “Hello, <name\>\!”\. Direct authoring is feasible for problems below the multi\-digit\-arithmetic threshold; programs are typically tens to hundreds of bytes\.
```
++++++++[>++++++++++<-]>++.<+++[>+++++<-]>.+++++++..+++.
[input loop emits each name character; closing ‘‘!’’ added]
```
##### Befunge\-98 \(E07*Maximum of two integers*\)\.
Befunge\-98 is a 2D stack\-based language; the program counter moves in cardinal directions through a code grid\. The maximum operator is expressed as a 1D strip with comparison and direction flips\.
```
&&\:\:\‘\\!|.@
>\.@
```
The two integers are read with&, the comparison\\:\\:\\‘produces a boolean, and the conditional\|routes execution upward or downward depending on the result\.
##### Whitespace \(E01*Hello world*\)\.
Whitespace uses only space, tab, and newline characters; all other characters are ignored as comments\. Because the source is invisible to a human reader, throughout this paragraph we render the three significant whitespace characters with the placeholder lettersS\(space\),T\(tab\), andL\(linefeed\); the real source contains the literal whitespace characters\. The following placeholder\-rendered program pushes the stringHionto the stack and prints it\.
```
[push ’H’ = 72] SS S TSSSSS SSSS L
[push ’i’ = 105] SS S TTSS T SSS L
[print char] TL SS
[print char] TL SS
[end] LLL
```
Whitespace programs are typically short stack\-machine sequences ofpush,dup,add,out\_char,end; the strong agents build a small assembler\-style generator before E01 and emit programs from it\.
##### Shakespeare \(E02*Echo line*\)\.
Shakespeare programs read as theatrical plays\. Variables are characters \(Romeo,Juliet\); arithmetic is expressed in dramatic monologue; control flow is in stage directions\. The excerpt below reads a character and outputs it\.
```
[A program preamble names the dramatis personae.]
Act I: Echoing.
Scene I: A reading.
[Enter Romeo and Juliet]
Romeo: Open your mind!
Juliet: You are as good as nothing.
Romeo: Speak your mind!
Juliet: Is your father a coward? If so, let us return to scene I.
```
Shakespeare’s surface form is rule\-bound English text, not a low\-level target\. Direct authoring remains tractable for the strong agents, which is why metaprogramming does not emerge as the dominant strategy on this language\.
## Appendix GDistillation prompts and reference library
This appendix reproduces the materials used in the two distillation conditions of Section[3\.4](https://arxiv.org/html/2606.10933#S3.SS4)\. The same materials live verbatim in the supplementary archive atsupplementary\_code/experiments/03\_distillation/: the condition prompts underprompts/PROMPT\_BRAINFUCK\.mdandprompts/PROMPT\_BEFUNGE98\.md; the strategy library scaffolds underreference\_lib/<language\>/\(library\-only, no per\-problem generators\); and the 12 ready\-to\-run cells undertext/<model\>/<language\>/andlibrary/<model\>/<language\>/\.
### G\.1Text\-only strategy preamble \(condition \(i\)\)
The system prompt provided to weaker models in the text\-only condition consists of high\-level strategies condensed from Opus 4\.6’s own Brainfuck session transcript\. We read the natural\-language reasoning Opus produced during its successful run and rewrote it as a short prose preamble for the weaker model\. No code, no solved programs, and no per\-problem ground truth are included\. The substantive content of the preamble is reproduced verbatim below; boilerplate harness commands and integrity rules are summarized afterwards\.
```
## Distilled Frontier Strategy Bundle
### 1. Treat Python as a compiler, not a scratchpad.
For non-trivial problems, write Python that generates Brainfuck.
The Python script is your compiler: it should manage cell
allocation, pointer movement, copy/move/clear primitives, branching
patterns, input parsing, and output formatting.
Do not merely concatenate ad-hoc Brainfuck strings. Build a small
local generator library with:
- alloc(name, count) for stable cell layouts.
- move_to(cell), emit(code), clear(cell), set_const(cell, value).
- Non-destructive copy(src, dst, tmp).
- Destructive move(src, dst).
- Boolean helpers: is_zero, is_nonzero, eq_const.
- A safe if_flag(flag, body) pattern that consumes/clears flags.
- String output helpers using one temporary cell and char deltas.
Operational rule: if the problem is more than a tiny fixed-string
transform, default to writing or adapting a Python generator first.
### 2. Build reusable numeric I/O immediately.
Brainfuck failures usually come from ASCII numeric I/O, not the
high-level algorithm. When a problem says ‘‘integer’’, assume
hidden tests may include multi-digit values, zero, negatives,
optional trailing newline or EOF, whitespace separators, and
outputs larger than 255.
Avoid raw byte arithmetic. Use decimal digit arrays / BCD:
- Store each number as sign cell + array of decimal digits.
- Parse input one character at a time.
- Shift digit arrays when appending new digits.
- Add magnitudes digit-by-digit with carry.
- Subtract magnitudes with borrow using the +10 trick to avoid
unsigned underflow.
- Compare signed values by sign first, magnitude second.
- Print with leading-zero suppression and never print -0.
### 3. Keep reusable arithmetic primitives.
Strong runs reused the same components across many problems.
Build and test: signed decimal parse, signed add/subtract,
magnitude compare, min/max selection, decimal output, divmod by 10
for printing and carry propagation, division by 2 via digit scan
for average/halving tasks, multiplication using grade-school digit
loops when raw bytes are unsafe.
### 4. Use simple algorithms that are target-language-friendly.
Choose algorithms that are easiest to compile to Brainfuck, not
necessarily the most elegant in Python. Prefer streaming
transforms for character problems, fixed stable cell layouts over
dynamic pointer tricks, decimal arrays for arbitrary integer work,
bounded arrays and explicit loops over self-modifying pointer
layouts.
### 5. Verification discipline.
Before the one hidden submission, run many local tests: empty
input, single-character/single-digit, multi-digit, large values,
negative/positive mixtures, zero and cancellation cases, inputs
with and without trailing newline, boundary strings.
If local tests fail, fix the generator or library first. Do not
patch random Brainfuck by hand unless the program is tiny.
## Required Startup Ritual Per Problem
1. Decide whether this is a tiny direct Brainfuck task or a
generator task.
2. For generator tasks, start from a local scaffold.
3. Write down the intended cell layout before adding algorithm
logic.
4. For numeric tasks, choose decimal/BCD by default.
5. Run a diverse local test set before the single hidden
submission.
6. If local tests expose a bug, fix the generator/library,
regenerate, and test again before submitting.
```
The remaining sections of the preamble repeat the harness command list \(init,fetch,run,submit,status,export\) and integrity rules \(no parent or sibling directories, noharness\.pyorharness\_state\.jsoninspection, no web search, no reading of prior generated artifacts\) that already match the harness defaults\. The per\-language preambles \(Befunge\-98, Whitespace, Shakespeare\) follow the same schema with substrate\-specific strategies in place of BCD arithmetic\.
### G\.2Reference library \(condition \(ii\)\)
The reference\-library condition additionally provides a small,*strategy\-only*library distilled from the strongest agents’ host\-language traces \(Claude Opus 4\.6 and GPT\-5\.4 xhigh\)\. The library is intentionally generic: it contains reusable code\- generation primitives and a compact strategy notes document, but*no per\-problem generators and no solved target\-language artifacts*\. No solved\.bfor\.b98programs, no hidden\-test inputs, no expected outputs, and no problem\-specific solution scaffolds are copied across\. The exact files shipped are those undersupplementary\_code/experiments/03\_distillation/reference\_lib/in the supplementary archive, and the experimental contrast is whether the weaker agent can*build*a working generator on top of this scaffolding rather than copy a finished solution\.
The system prompt opens with the same strategy preamble as condition \(i\), then adds a short “Read First” block that points the agent at the localreference\_lib/directory and at itsREADME\.md\(and, for Brainfuck, also atopus\_learning\_notes\.md\)\. The two per\-language directories contain:
##### reference\_lib/brainfuck/\.
- •meta\_bflib\.py: a generic Brainfuck code\-generation helper library: aBFbuilder class, BCD\-arithmetic helpers, cell\-allocator pattern, and decimal\-print primitives\.
- •gpt5\_xhigh\_bf\_codegen\.py: a stable cell\-layout / builder pattern with aBFBuilderdataclass andalloc/clear/move\_toprimitives, authored by GPT\-5\.4 xhigh in its own session\.
- •opus\_learning\_notes\.md: a notes document describing state\-tracking discipline, common pitfalls, and cell\-layout discipline written by Opus during its own session\.
##### reference\_lib/befunge98/\.
- •opus\_simulate\.py: a minimal Befunge\-98 simulator \(pointer machine over a 2D grid, stack operations, string mode, basic I/O\), used to verify candidate Befunge programs locally before submitting\.
The agent is told these files are intentionally provided as local reference scaffolds and may be copied, renamed, edited, or extended\. The integrity rules forbid reading parent or sibling experiment folders, prior solution artifacts, transcripts, private tests, or any other workspace beyond the localreference\_lib/directory and the harness state file\. A representative excerpt of the GPT\-5\.4 xhigh Brainfuck E04 generator \(used in the cross\-language Brainfuck comparison of Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3), not part of the distillation library\) is reproduced in Appendix[C\.6](https://arxiv.org/html/2606.10933#A3.SS6)\.
### G\.3Three\-tier visualization
Figure[7](https://arxiv.org/html/2606.10933#A7.F7)groups the same per\-cell results as Table[3](https://arxiv.org/html/2606.10933#S3.T3)into a three\-tier bar layout \(direct→\\todistilled strategies→\\todistilled strategies \+ code\)\. Figure[8](https://arxiv.org/html/2606.10933#A7.F8)re\-renders the same cells as trajectories, which makes the per\-model jump from text to code visually explicit \(Sonnet 4\.612→6412\\\!\\to\\\!64on Brainfuck, GPT\-5\.4 mini11→6411\\\!\\to\\\!64on Befunge\-98, Haiku 4\.5 essentially flat\)\.
Figure 7:Three\-tier distillation results\.Brainfuck \(left\) and Befunge\-98 \(right\) problems solved out of 80 under three conditions: direct authoring \(no distillation\), distilled strategies as a system\-prompt preamble, and distilled strategies plus the strategy\-only reference library \(Appendix[G\.2](https://arxiv.org/html/2606.10933#A7.SS2)\)\. The jump from the second to the third bar is the size of the code\-vs\-description effect\.Figure 8:Per\-model distillation trajectory\.Same cells as Figure[7](https://arxiv.org/html/2606.10933#A7.F7), rendered as per\-model trajectories\. The horizontal jump from the strategies marker to the code marker is the lift from sharing runnable primitives \(large for Sonnet 4\.6 and GPT\-5\.4 mini on both languages; near\-zero for Haiku 4\.5\)\.
## Appendix HExtended related work
Section[5](https://arxiv.org/html/2606.10933#S5)of the body covers the most directly adjacent work; this appendix expands the discussion to cover neighboring literatures that informed our setup, design choices, and framing but that were too tangential for the body’s space budget\.
##### Agentic software\-engineering benchmarks\.
Beyond SWE\-Bench Verified\[Jimenezet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib10)\], a recent line of work studies LLM agents that act on full software repositories rather than isolated functions: SWE\-agent\[Yanget al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib15)\]introduces an agent–computer interface specifically designed for repository\-scale software work; Agentless\[Xiaet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib16)\]demonstrates that careful prompting without an explicit agent loop can be competitive on the same SWE\-Bench targets; AutoCodeRover\[Zhanget al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib17)\]formalizes autonomous program improvement over real GitHub issues\. We inherit executable hidden\-test grading from this line but vary the controlled axis: instead of repository complexity, we hold the repository \(and the per\-problem task\) simple and vary the familiarity of the target language\.
##### Multimodal, web, and desktop agents\.
Agentic evaluation also extends beyond software engineering into general computer use: OSWorld\[Xieet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib14)\]grades agents on full desktop tasks; WorkArena\[Drouinet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib20)\]measures agents on realistic enterprise web flows; VisualWebArena\[Kohet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib21)\]and WebVoyager\[Heet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib22)\]evaluate visual web agents\. These environments are realistic but mix many factors \(navigation, multi\-application coordination, image grounding, domain familiarity\); our evaluation deliberately keeps the non\-language axes narrow so that variance can be attributed to target\-language adaptation specifically\.
##### Research\-engineering and ML\-engineering agents\.
PaperBench\[Staraceet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib12)\]grades agents on replicating ML research from papers, and MLE\-bench\[Chanet al\.,[2025](https://arxiv.org/html/2606.10933#bib.bib13)\]grades agents on Kaggle\-style ML\-engineering tasks\. Both are agentic and execution\-graded but test long\-horizon, multi\-skill capability rather than the narrow question of within\-session adaptation to an unfamiliar programming substrate\.
##### Tools, feedback, and intermediate computation\.
A large prior literature shows that explicit tool use, intermediate reasoning, and feedback can lift LLM performance on language and coding tasks: ReAct\[Yaoet al\.,[2023b](https://arxiv.org/html/2606.10933#bib.bib23)\]interleaves reasoning and acting in a single loop; Toolformer\[Schicket al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib24)\]teaches models to call external tools; Reflexion\[Shinnet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib25)\]adds verbal reinforcement on prior failures; Self\-Refine\[Madaanet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib26)\]adds iterative self\-feedback; Tree of Thoughts\[Yaoet al\.,[2023a](https://arxiv.org/html/2606.10933#bib.bib27)\]adds branched search; PAL\[Gaoet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib28)\]and Program of Thoughts\[Chenet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib29)\]use code as the reasoning substrate; self\-debug\[Chenet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib30)\]adds runtime feedback on generated code; and scratchpads\[Nyeet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib31)\]externalise intermediate state\. Our metaprogramming\-emergence finding \(Section[3\.2](https://arxiv.org/html/2606.10933#S3.SS2)\) sits adjacent to this line: the strongest agents*spontaneously*use a familiar host language as a structured scratchpad for the unfamiliar target, but our intervention is on the tool\-use*strategy*\(remove the scratchpad, see what breaks\), not on the prompting template itself\.
##### Multilingual code generation and translation\.
TransCoder\[Roziereet al\.,[2020](https://arxiv.org/html/2606.10933#bib.bib41)\]and PLBART\[Ahmadet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib42)\]study unsupervised translation between mainstream languages; CodeT5\[Wanget al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib43)\], InCoder\[Friedet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib44)\], CodeGen\[Nijkampet al\.,[2022](https://arxiv.org/html/2606.10933#bib.bib45)\], StarCoder\[Liet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib46)\], and Code Llama\[Roziereet al\.,[2023](https://arxiv.org/html/2606.10933#bib.bib47)\]are open code models that defined much of the multilingual coding evaluation surface; IRCoder\[Paulet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib39)\]adds intermediate\-representation training for multilingual robustness; andTwistet al\.\[[2025](https://arxiv.org/html/2606.10933#bib.bib40)\]systematically document LLM bias toward Python across libraries and languages\. Our cross\-host generator experiment \(Python / JavaScript / Rust\) is a small empirical check on whether the metaprogramming benefit is host\- language\-bound or substrate\-bound, and lands closer to the multilingual line than to the model\-architecture line\.
##### Class\-level and pragmatic code\-generation benchmarks\.
ClassEval\[Duet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib54)\]and CoderEval\[Yuet al\.,[2024](https://arxiv.org/html/2606.10933#bib.bib55)\]extend execution\-based code evaluation from single\-function targets \(HumanEval, MBPP, APPS\) to richer real\-code structures\. We do not use either directly because both target mainstream languages where the per\-language ecosystem prior is strong; the controlled variable in our setting is exactly the absence of that prior\.
##### Long\-horizon reasoning under static prompting\.
LongCoT\[Motwaniet al\.,[2026](https://arxiv.org/html/2606.10933#bib.bib56)\]probes long chain\-of\-thought reasoning capability under fixed prompts\. Our setting is closer to long\-horizon*tool\-use*than long\-horizon reasoning per se: the agent’s external state \(workspace, generator file, local test runs\) carries most of the long\-horizon information, not its own monologue\.
##### Benchmark validity, robustness, and contamination\.
A line of work warns that high benchmark scores can be brittle: CheckList\[Ribeiroet al\.,[2020](https://arxiv.org/html/2606.10933#bib.bib32)\]introduces behavioral testing for NLP models; Adversarial NLI\[Nieet al\.,[2020](https://arxiv.org/html/2606.10933#bib.bib33)\]and CrossFit\[Yeet al\.,[2021](https://arxiv.org/html/2606.10933#bib.bib34)\]stress\-test against adversarial or cross\-task generalization;Bowman and Dahl \[[2021](https://arxiv.org/html/2606.10933#bib.bib35)\]discuss what it would take to fix NLU benchmarking\. On contamination specifically,Orenet al\.\[[2024](https://arxiv.org/html/2606.10933#bib.bib36)\]provide black\-box tests for training\-set contamination,Denget al\.\[[2024](https://arxiv.org/html/2606.10933#bib.bib37)\]investigate contamination in modern LLM benchmarks, andXuet al\.\[[2024](https://arxiv.org/html/2606.10933#bib.bib38)\]survey the broader landscape\. We do not claim formal distributional novelty for the four target esolangs; Appendix[D](https://arxiv.org/html/2606.10933#A4)reports public\-code prevalence andnn\-gram overlap analyses motivated directly by this line\.
##### Mainstream\-benchmark sourcing checks\.
For transparency on the mainstream\-benchmark rows of Table[2](https://arxiv.org/html/2606.10933#S3.T2), the Vals\.ai SWE\-Bench Verified leaderboard\[Vals AI,[2026b](https://arxiv.org/html/2606.10933#bib.bib64)\], LiveCodeBench v6 leaderboard\[Vals AI,[2026a](https://arxiv.org/html/2606.10933#bib.bib65)\], and Terminal\-Bench 2\.0 leaderboard\[Vals AI,[2026c](https://arxiv.org/html/2606.10933#bib.bib67)\]were used as third\-party verification of vendor\-published numbers where applicable; per\-agent attribution is in Table[6](https://arxiv.org/html/2606.10933#A2.T6)\.
##### Cognitive\-science framing\.
The view that agents reorganize hard problems by building external structure, rather than solving them entirely “in the head,” has a long history in cognitive science\.Hutchins \[[1995](https://arxiv.org/html/2606.10933#bib.bib61)\]introduced*distributed cognition*andClark and Chalmers \[[1998](https://arxiv.org/html/2606.10933#bib.bib62)\]the*extended mind*hypothesis\. We do not attempt to evaluate these as theories of LLM cognition\. We use the framing only as a label for the empirical pattern we observe: in our setting, the strongest agents externalise fragile target\-language state into named, reusable host\-language primitives, and the resulting “scaffolding” is itself the locus of capability differences between agents\.
## Appendix IFuture work
Constructing a new programming language designed to be genuinely out\-of\-distribution, such as a niche or constructed substrate, would let us test the gap under fully controlled conditions without the contamination concerns inherited by public esolangs\. Extending the same methodology to non\-code environments where tools and external state matter \(scientific workflows, data analysis, theorem proving, interactive web tasks\) would test whether similar hidden capability gulfs appear once surface fluency is removed\. Benchmark families that vary surface unfamiliarity, semantic complexity, tool access, and program length independently would let us attribute the gap to specific factors rather than treat it as monolithic\. Agent\-level diagnostics that measure how local feedback converts into solves, and metrics for adaptive tool use under unfamiliar interfaces, would make agentic adaptation measurable in its own right\. A natural follow\-up to our reference\-library finding is whether agents can persist accumulated structure as local notes or textbooks and build on that knowledge across runs, turning a one\-shot library into long\-horizon scaffolding\. Together these open a research program on out\-of\-distribution agentic adaptation, focused on how agents construct working interfaces to unfamiliar environments using tools, external structure, and intermediate representations\.
## NeurIPS paper checklist
This checklist reflects the current draft\. The anonymous supplementary archive \(harness, interpreters, prompts, all 48 experiment cells, smoke and rigorous end\-to\-end tests\) is included with this submission and is referred to throughout this checklist\.
1. 1\.Claims\.Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: Yes\. Justification: The draft states an empirical claim about agent\-system adaptation under unfamiliar executable interfaces and explicitly avoids a formal distributional\-novelty claim\.
2. 2\.Limitations\.Does the paper discuss limitations of the work? Answer: Yes\. Justification: Section[4](https://arxiv.org/html/2606.10933#S4)discusses training\-data uncertainty, the mixture of unfamiliarity and difficulty, cross\-protocol cells, agent\-wrapper effects, and the artificial nature of esolangs\.
3. 3\.Theory assumptions and proofs\.For each theoretical result, does the paper provide assumptions and proofs? Answer: N/A\. Justification: The paper is empirical and does not claim new theoretical results\.
4. 4\.Experimental result reproducibility\.Does the paper disclose information needed to reproduce the main experimental results? Answer: Yes\. Justification: The methodology and appendix describe the sequential harness, problem order, model wrappers, budget regimes, hidden\-submission rule, and solved\-task scoring rule\. The accompanying anonymous supplementary archive ships the harness, the four esoteric\-language interpreters, the per\-language language\-reference prompts \(Appendix[A\.12](https://arxiv.org/html/2606.10933#A1.SS12)\), the four experiment configurations \(main grid, metaprogramming ablation, distillation, cross\-language transfer\) wired as 48 ready\-to\-run cells, and a smoke test plus rigorous end\-to\-end test that exercise every harness command path without requiring any provider API key\. The harness emits per\-cell JSON exports \(python harness\.py export\) that contain the full per\-problem fetch/run/submit/skip event log used to compute the reported numbers\.
5. 5\.Open access to data and code\.Does the paper provide open access to data and code with reproduction instructions? Answer: Yes\. Justification: The dataset \(EsoLang\-Bench\) is a previously released third\-party artifact, publicly hosted at the canonical URL referenced in Section[2](https://arxiv.org/html/2606.10933#S2)\. The harness, interpreters, prompts, experiment scaffolds, and reproducibility scripts are released under the MIT license in the anonymous supplementary archive accompanying this submission;README\.mdandHOWTO\_RUN\.mddocument the four reviewer paths from a 10\-second smoke test \(no key\) through the rigorous end\-to\-end test \(no key\) to running an agent against any of the 48 cells \(provider key required\)\. The archive uses only the Python standard library plus the third\-partyshakespearelangpackage for the Shakespeare interpreter\.
6. 6\.Experimental setting/details\.Does the paper specify the experimental settings needed to understand the results? Answer: Yes\. Justification: Section[2](https://arxiv.org/html/2606.10933#S2)of the body and Appendix Table[4](https://arxiv.org/html/2606.10933#A1.T4)together specify the primary protocol’s task substrate, problem order, hidden\-test rule, hidden\-submission cap, local interpreter call regime, per\-turn output token budget, workspace isolation, sampling settings, aggregation across runs \(Appendix[A\.7](https://arxiv.org/html/2606.10933#A1.SS7)\), and uncertainty reporting\. Per\-condition deviations from the primary protocol \(no\-metaprogramming variant, cross\-language generator transfer, distillation text\-only and library conditions, interpreter\-budget ablation, token\-efficiency ablation, controlled\-access protocol\) are enumerated in Appendix[A\.12](https://arxiv.org/html/2606.10933#A1.SS12)alongside the verbatim per\-language reference prompts\. Per\-cell raw counts for every reported number are reproducible from theexport\.jsonfiles emitted bypython harness\.py exportin each of the 48 cells shipped in the supplementary archive; the rigorous end\-to\-end test \(scripts/rigorous\_test\.sh\) exercises this export path without requiring any provider API key\.
7. 7\.Experiment statistical significance\.Does the paper report error bars or appropriate uncertainty information? Answer: Yes\. Justification: All four esoteric\-language columns in Table[1](https://arxiv.org/html/2606.10933#S2.T1)\(Brainfuck, Befunge\-98, Whitespace, Shakespeare\) report cells in percentage\-solved form with±\\pm95% binomial Wilson half\-widths over 80 problems per language, as stated in the table caption\. The Reporting paragraph in Section[2](https://arxiv.org/html/2606.10933#S2)states that headline cells report the Session 1 solved count \(with two further independent sessions per cell as sanity checks, tabulated in Appendix[B\.7](https://arxiv.org/html/2606.10933#A2.SS7)\), and that ablation cells are means of two independent sessions\. Full per\-cell asymmetric Wilson95%95\\%confidence intervals are tabulated in Appendix[B\.6](https://arxiv.org/html/2606.10933#A2.SS6); the session\-aggregation protocol is in Appendices[A\.6](https://arxiv.org/html/2606.10933#A1.SS6)and[A\.7](https://arxiv.org/html/2606.10933#A1.SS7)\. The same Wilson interval treatment carries through to the cross\-harness control \(Appendix[B\.4](https://arxiv.org/html/2606.10933#A2.SS4)\), the metaprogramming ablation \(Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3)\), the distillation cells \(Section[3\.4](https://arxiv.org/html/2606.10933#S3.SS4)\), and the cross\-language transfer cells reported inline in Section[3\.3](https://arxiv.org/html/2606.10933#S3.SS3); none of these report a point estimate without an accompanying interval\.
8. 8\.Experiments compute resources\.Does the paper provide compute\-resource information? Answer: Yes\. Justification: Appendix Table[4](https://arxiv.org/html/2606.10933#A1.T4)specifies the per\-turn output token budget, the local interpreter call regime, the number of hidden submissions per problem, and the sampling settings \(provider / wrapper defaults\)\. The token\-efficiency analysis in Section[3\.5](https://arxiv.org/html/2606.10933#S3.SS5)reports cumulative API output tokens per cell on the easy tier\. Local wall\-clock and operator hardware are not reported because all evaluated agents run as managed APIs on vendor infrastructure rather than on local accelerators, so compute is fully described by the per\-cell API output\-token budget plus the interpreter\-call regime above\.
9. 9\.Code of ethics\.Does the research conform to the NeurIPS Code of Ethics? Answer: Yes\. Justification: The work evaluates existing coding agents on programming tasks and does not involve human\-subject experiments, private personal data, or model training\.
10. 10\.Broader impacts\.Does the paper discuss positive and negative societal impacts? Answer: Yes\. Justification: The evaluation reported here is intended to improve the visibility of capability differences among coding agents in long\-tail settings relevant to real deployments\. Positive impact: clearer evaluations in low\-ecosystem programming environments help practitioners pick and budget agents for internal DSL work, legacy integration, and closed\-source platform development, where current leaderboards are poorly predictive\. Negative impact: the same evaluations can be misused as ranking claims that extend beyond the measured regime; we mitigate this by reporting the primary protocol and the controlled\-access protocol separately and flagging incomplete cells explicitly\. No new model, weights, or dataset capable of offensive use is released in this work, and all evaluated systems are already publicly available\.
11. 11\.Safeguards\.Does the paper describe safeguards for responsible release of high\-risk assets? Answer: N/A\. Justification: The paper evaluates existing models and benchmark harnesses rather than releasing a model, exploit, or high\-risk dataset\.
12. 12\.Licenses for existing assets\.Are existing assets credited and licenses respected? Answer: Yes\. Justification: The EsoLang\-Bench dataset\[Sharma and Chopra,[2026](https://arxiv.org/html/2606.10933#bib.bib48)\]is a previously released third\-party artifact, cited at its canonical public URL with the dataset paper referenced; we consume it without modification under its public release terms\. Mainstream\-benchmark scores re\-used from public vendor reports \(SWE\-Bench Verified, Terminal\-Bench 2\.0, LiveCodeBench v6\) are sourced per agent in Appendix Table[6](https://arxiv.org/html/2606.10933#A2.T6), with each row attributing the cited Anthropic system card / OpenAI system card / Moonshot release / Vals\.ai third\-party leaderboard\[Anthropic,[2026](https://arxiv.org/html/2606.10933#bib.bib57), OpenAI,[2026a](https://arxiv.org/html/2606.10933#bib.bib58),[b](https://arxiv.org/html/2606.10933#bib.bib59), Moonshot AI,[2026](https://arxiv.org/html/2606.10933#bib.bib60), Vals AI,[2026b](https://arxiv.org/html/2606.10933#bib.bib64),[a](https://arxiv.org/html/2606.10933#bib.bib65),[c](https://arxiv.org/html/2606.10933#bib.bib67)\]as the source\. This submission uses the official NeurIPS 2026 style file without modification\. Code shipped in the supplementary archive \(harness, four esoteric\-language interpreters, prompts, experiment scaffolds, reproducibility scripts, distillation reference library\) is released under the MIT license; the Shakespeare interpreter wraps the third\-partyshakespearelangpackage, used unmodified under its own license\.
13. 13\.New assets\.Are new assets introduced in the paper documented? Answer: Yes\. Justification: The new assets introduced by this paper are \(i\) the evaluation harness wrapping EsoLang\-Bench \(released in the supplementary archive underbenchmark\_harness/withREADME\.mdandHOWTO\_RUN\.mdas entry points\); \(ii\) the four per\-language language\-reference prompts \(prompts/<lang\>/CLAUDE\.mdandAGENTS\.md, identical content\); \(iii\) the four experiment configurations as 48 ready\-to\-run cells underexperiments/; and \(iv\) the reference library scaffolds for the distillation condition underexperiments/03\_distillation/reference\_lib/\. All are documented in their respectiveREADME\.mdfiles\. Trace excerpts reproduced in Appendix[C](https://arxiv.org/html/2606.10933#A3)are selected by the rule stated in Appendix[C](https://arxiv.org/html/2606.10933#A3)\(single recorded session, four pre\-specified phenomena\)\.
14. 14\.Crowdsourcing and human subjects\.Does the paper include details for crowdsourcing or human\-subject work? Answer: N/A\. Justification: The work does not involve crowdsourcing or human\-subject experiments\.
15. 15\.IRB approvals\.Does the paper describe IRB approvals or equivalent review for human\-subject work? Answer: N/A\. Justification: The work does not involve human\-subject experiments\.
16. 16\.Declaration of LLM usage\.Does the paper describe LLM usage when it is part of the core method? Answer: Yes\. Justification: The evaluated systems are LLM\-based coding agents; the methodology section describes model snapshots, agent wrappers, and harness interaction\.Similar Articles
Are coding agents much better at starting projects than fixing real codebases?
An observation that coding agents perform well on new projects but often struggle with existing codebases, where the need for minimal changes and understanding of hidden dependencies limits their effectiveness.
Coding with Agents
Coding with Agents explores how AI agents can assist developers in writing code, automating tasks, and improving productivity.
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automatically generates diverse open-ended coding problems from closed-ended tasks, improving LLM coding performance on benchmarks through enhanced agent interactions and training data synthesis.
FrontierCode: a coding eval that raises the bar for difficulty & quality.
FrontierCode is a new coding evaluation benchmark designed to increase difficulty and quality standards for AI code generation.
@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…
FrontierCode is a new benchmark for coding agents, human-verified with a continuous scoring model, designed to evaluate real-world performance.