@Xudong07452910: 这可能是人类写给 AI 看的最后一篇论文了。最近刷到Stanford、CMU、Michigan 等 37 位作者联名的论文：《The Last Human-Written Paper》。核心观点很狠：沿用几百年的论文，在 AI 时代可…

X AI KOLs Timeline 2026/06/09 03:20 论文

摘要

来自Stanford、CMU、Michigan等37位作者联名的论文提出ARA（Agent原生研究工件）替代传统论文格式，旨在解决叙事税和工程税，让AI Agent能理解、复现和扩展研究。

这可能是人类写给 AI 看的最后一篇论文了。最近刷到Stanford、CMU、Michigan 等 37 位作者联名的论文：《The Last Human-Written Paper》。核心观点很狠：沿用几百年的论文，在 AI 时代可能已经过时了。作者点出了两个被我们忽视已久的“隐形税”：一个是叙事税。为了讲一个漂亮故事，我们把失败实验、死路、被推翻的假设都删掉了。AI 读到的是“通关攻略”，却看不到真正有价值的“踩坑记录”。另一个是工程税。论文里的实现细节通常足够说服审稿人，但不够让 Agent 直接复现。很多关键 tricks 还藏在作者脑子、代码注释和 Slack 记录里。所以作者提出 ARA，直接把论文改造成 Agent 能读取和执行的“研究包”：不只告诉你结论，还把怎么想到的、代码怎么跑、证据链在哪、哪些路走不通都打包进去。我觉得这篇最有意思的地方是，它不是在讨论 AI 怎么帮人写论文，而是在问：当 AI 也变成论文读者和执行者时，论文还应该长成今天这样吗？未来科研输出的核心，可能不再是“写得多像一篇 paper”，而是能不能被 AI 理解、复现、追踪和继续扩展。人类写论文写了几百年，接下来可能要开始写给 Agent 执行的研究包了。 https://arxiv.org/pdf/2604.24658

查看原文

查看缓存全文

缓存时间: 2026/06/10 09:48

这可能是人类写给 AI 看的最后一篇论文了。

最近刷到Stanford、CMU、Michigan 等 37 位作者联名的论文：《The Last Human-Written Paper》。

核心观点很狠：沿用几百年的论文，在 AI 时代可能已经过时了。

作者点出了两个被我们忽视已久的“隐形税”：

一个是叙事税。为了讲一个漂亮故事，我们把失败实验、死路、被推翻的假设都删掉了。AI 读到的是“通关攻略”，却看不到真正有价值的“踩坑记录”。

另一个是工程税。论文里的实现细节通常足够说服审稿人，但不够让 Agent 直接复现。很多关键 tricks 还藏在作者脑子、代码注释和 Slack 记录里。

所以作者提出 ARA，直接把论文改造成 Agent 能读取和执行的“研究包”：不只告诉你结论，还把怎么想到的、代码怎么跑、证据链在哪、哪些路走不通都打包进去。

我觉得这篇最有意思的地方是，它不是在讨论 AI 怎么帮人写论文，而是在问：

当 AI 也变成论文读者和执行者时，论文还应该长成今天这样吗？

未来科研输出的核心，可能不再是“写得多像一篇 paper”，而是能不能被 AI 理解、复现、追踪和继续扩展。

人类写论文写了几百年，接下来可能要开始写给 Agent 执行的研究包了。

https://arxiv.org/pdf/2604.24658

1 Introduction

Source: https://arxiv.org/html/2604.24658 The Last Human-Written Paper: Agent-Native Research ArtifactsJiachen Liu1, *, Jiaxin Pei2, Jintao Huang3, Chenglei Si2, Ao Qu4, Xiangru Tang5, Runyu Lu1, Lichang Chen6, Xiaoyan Bai7, Haizhong Zheng8, Carl Chen9, Zhiyang Chen10, Haojie Ye11, Yujuan Fu12, Zexue He2, Zijian Jin13, Zhenyu Zhang2, Shangquan Sun14, Maestro Harmon15, Dianzhuo Wang16, Qian-ze Zhu16, Jianqiao Zeng, Jiachen Sun17, Mingyuan Wu18, Baoyu Zhou19, Chenyu You20, Shijian Lu14, Yiming Qiu21, Fan Lai18, Yuan Yuan22, Yao Li23, Junyuan Hong24, Ruihao Zhu25, Beidi Chen8, Alex Pentland2, Ang Chen1, Mosharaf Chowdhury1, Zechen Zhang16, 151University of Michigan,2Stanford University,3Ohio State University,4MIT,5Yale University,6Meta Superintelligence Labs,7University of Chicago,8Carnegie Mellon University,9University of Washington,10University of Toronto,11NVIDIA,12Meta,13New York University,14Nanyang Technological University,15Orchestra Research,16Harvard University,17LinkedIn,18UIUC,19Arizona State University,20Stony Brook University,21University of Hong Kong,22Boston College,23Portland State University,24National University of Singapore,25Cornell UniversityScientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: aStorytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and anEngineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce theAgent-Native Research Artifact (Ara), a protocol that replaces the narrative paper with an agent-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: aLive Research Managerthat captures decisions and dead ends during ordinary development; anAraCompilerthat translates legacy PDFs and repos intoAras; and anAra-native review systemthat automates objective checks (analogous to a grammar checker for prose) so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench,Araraises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench’s five open-ended extension tasks, preserved failure traces inAraaccelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent’s capabilities.Correspondence:Jiachen Liu ([email protected])Code:github.com/AmberLJC/Agent-Native-Research-ArtifactARA Commons [Uncaptioned image]

\icml@noticeprintedtrue

Research produces a rich, branching knowledge object: months of hypotheses tested and rejected, implementation tricks discovered through trial and error, design alternatives weighed against each other, and the full exploration trajectory that explains why the final approach was chosen. Publishing compiles this object into a linear narrative(Medawar,1963; Canini,2026), discarding failed experiments, tacit engineering knowledge, and the branching process to satisfy the conventions of human-readable storytelling(Rosenthal,1979; Francoet al.,2014). This compilation cost, a consequence of the documentation convention rather than any particular file format, was tolerable when every consumer of a paper was human. It is not when AI agents routinely read papers to understand a field, reproduce experiments to validate findings, and extend published methods to new settings(Luet al.,2024; Liu,2026a): each task requires precisely the knowledge that compilation discards (Figure1). More specifically, the compilation incurs two structural costs.

Refer to caption Figure 1:Publishing compiles a rich research object into a lossy narrative (left);Arapreserves the original as a high-fidelity, agent-executable knowledge package (right).The first is theStorytelling Tax: the systematic erasure of research process knowledge imposed by compilation into narrative (Figure2). Research does not proceed linearly—it branches, backtracks, and accumulates hard-won failure knowledge before converging on a publishable result(Kuhn,1962; Medawar,1963). Narrative compilation flattens this process into a polished linear story, discarding every failed experiment, rejected hypothesis, and abandoned approach. This emphasis on success leaves failures undocumented; although modern platforms archive final artifacts, the branching research process remains unrecorded, causing independent rediscovery of the same dead ends across groups(Rosenthal,1979; Francoet al.,2014). Our own analysis of the METR eval-analysis-public dataset(Wijket al.,2025), covering 24,008 agent runs across 21 frontier models on RE-Bench, quantifies the cost (per-task breakdown in AppendixE.3): failed runs account for90.2%of total dollar cost (and 59.2% of tokens), with a median failed-to-success token ratio of113×\times, agents without access to prior failure records must independently rediscover every dead end. Equally lost is the record of humanjudgmentalong the trajectory: every rejection, revision, and endorsement is a preference signal over what constitutes good research, the scarce resource that binds once agents shoulder the grunt work. Narrative compilation discards this signal; a preserved trajectory renders it as structured supervision that compounds across projects.

The second is theEngineering Tax: the gap betweenreviewer-sufficientandagent-sufficientdocumentation (Figure3). The paper communicates its contribution at the level of detail needed to convince a human reviewer; the codebase provides an implementation but not the operational specification needed to execute it. Between the two lies tacit knowledge(Polanyi,1966)—algorithmic tricks, implementation decisions, and configuration choices—knowledge that exists in no written document and is transmitted only through direct lab contact or painstaking reverse-engineering. We quantify this void by classifying each of PaperBench’s 8,921 expert-annotated reproduction requirements across 23 ICML 2024 papers(Staraceet al.,2025)against its source PDF (per-category breakdown and gap-type taxonomy in AppendixE.2): despite widespread artifact sharing, only45.4%are fully specified. Code development is the most underspecified category (37.3% sufficient), and missing hyperparameters alone account for 26.2% of all gaps (full breakdown in AppendixE.2): a fundamental mismatch between the precision at which papers are written (sufficient to produce belief) and the precision at which agents must operate (sufficient to produce correct execution)(Stoddenet al.,2016; Baker,2016).

Both taxes have persisted throughout the history of research because the human reader has always been the bandwidth-limited layer processing a vast, non-linear research trajectory. Capable AI agents now offer a more efficient, human-proximal proxy that processes the trajectory at machine bandwidth on the reader’s behalf, and three trends suggest what such an artifact should look like.First, AI agents have become indispensable research companions, co-authoring code, running experiments, and iterating on hypotheses alongside humans(Luet al.,2024), with LLM adoption associated with paper-production increases of 23.7%–89.3% across scientific fields(Kusumegiet al.,2025); the full research trajectory (every failure, implementation trick, configuration choice, design pivot) is now captured as machine-readable text in researcher-agent sessions, yet no protocol preserves it as a first-class output.Second, humans and AI agents have divergent information needs: humans skim abstracts and figures(Renear and Palmer,2009), while agents benefit from exhaustive detail that strictly improves reproduction, verification, and extension, so a single artifact optimized for human narrative can no longer serve both.Third, research is scaling into a massively parallel enterprise where agents fork, extend, and merge each other’s work at machine speed, shifting the bottleneck from individual productivity to artifactoperability: narrative PDFs, compiled for sequential human reading, cannot be forked, diffed, or merged, but a structured, lossless artifact can, letting research compound like software.

Refer to caption Figure 2:The Storytelling Tax: research proceeds as a branching tree with dead ends (left), but publication compiles it into a linear narrative (right), discarding all failure knowledge.Existing efforts address fragments of this problem. The FAIR principles(Wilkinsonet al.,2016)mandate findable, accessible data but say nothing about the structure of researcharguments. RO-Crate(Soiland-Reyeset al.,2022)packages research artifacts as archival bundles, not executable objects. Nanopublications(Grothet al.,2010)formalize atomic claims but lack the execution layer needed for reproduction. The emergingAGENTS.mdstandard(OpenAI,2025)provides agent-oriented documentation for code repositories but does not address the epistemic structure of research itself. None of these efforts jointly structure scientific logic, executable code, and exploration history into a single operable object (§8).

We propose theAgent-Native Research Artifact (Ara), a protocol that recasts the primary research object from narrative document to agent-executable knowledge package, with papers serving as compiled views of the underlying artifact (Figure1).Araorganizes research into four interlocking layers:structured scientific logicthat distills the paper’s conceptual abstractions into queryable claims and dependency graphs;executable codewith full operational specifications; anexploration graphpreserving the branching research process—failed experiments, rejected hypotheses, and design pivots—that narrative compilation discards; andgrounded evidencebinding every claim to its raw empirical outputs. Instead of parsing prose, reverse-engineering repositories, and rediscovering dead ends, an agent operating on anAraartifact can query structured claims, execute declarative specifications, and build on the full decision history directly—a research object designed not to be read, but to beoperated.

To build the ecosystem aroundAra, we develop three enabling mechanisms. TheLive Research Manager(§3) captures research decisions and dead ends as natural side-effects of everyday development, producing conforming artifacts without additional documentation burden. TheAraCompiler(§4) translates legacy PDFs, repositories, and supplementary materials intoAraformat, providing backward compatibility with the existing publication ecosystem. TheAra-Native Review System(§5) automates structural verification and budget-aware reproduction (analogous to a grammar checker for prose), redirecting expert attention from mechanical checking to judgment—significance, novelty, and taste(Aczelet al.,2021).

Refer to caption Figure 3:The reproduction information gap across 8,921 PaperBench requirements.(a)PDFs systematically under-specify code development tasks.(b)The three largest gap types are precisely the categoriesAra’s structured layers address.##### Contributions.

•We identify two structural costs of compiling research into narrative—theStorytelling Taxand theEngineering Tax—and introduce theAgent-Native Research Artifact (Ara): a protocol that recasts the primary research object from narrative document to agent-executable knowledge package organized into four interlocking layers (§2).
•We develop three enabling mechanisms: aLive Research Manager(§3) that captures research decisions during ordinary development; anAraCompiler(§4) that translates legacy PDFs and repositories intoAraformat; and anAra-Native Review System(§5) that automates objective verification (analogous to a grammar checker for prose) so human reviewers can focus on judgment.
•We evaluateAraacross three layers of research utility (§7):understanding(what an agent can extract from the artifact),reproduction(whether the agent can re-execute the paper’s experiments), andextension(whether the agent can build beyond the documented results and discover genuinely new findings, the defining goal of research). Across all three layers, agents operating on anAraconsistently outperform those reading the paper PDF and its associated code repository.

2TheAraProtocol

The Agent-Native Research Artifact (Ara) protocol defines a file-system ontology that transforms CS research from a narrative document into a machine-executable knowledge package. We describe the design philosophy (§2.1) and the layered architecture (§2.2). Figure6(§3) illustrates how the Live Research Manager mediates between the human–AI research process and the living artifact across a long time horizon.

Refer to caption Figure 4:TheAradirectory structure. Each file’s role is annotated inline; layer badges mark the four top-level divisions. Cross-layerforensic bindingslink claims in/logicto evidence in/evidenceand code in/src. Full schema in AppendixA.3; design rationale in AppendixA.### 2.1Design Philosophy

TheAraprotocol is grounded in a single principle:Knowledge over Narrative—the organized, evolving knowledge produced during research is the primary scientific object; the narrative paper is a compiled view.

A structured knowledge vessel.

An agent engaging with a research project asks four structurally distinct questions:whydoes this work (scientific reasoning),howis it implemented (executable code),what was triedalong the way (exploration trajectory), andwhat are the numbers(raw empirical evidence). A narrative paper forces the agent to extract all four answers from the same linear prose, yet these knowledge types conflict in structure: reasoning demands stable, citable units while code iterates continuously; the exploration history is inherently branching while narrative enforces linearity; evidence requires machine-precise values while prose rounds and paraphrases. Compressing them into a single document is not merely suboptimal butlossy: once flattened into narrative, the original structure cannot be recovered.Araeliminates this loss by materializing each knowledge type as its own layer within an agent-native file-system structure (Figure4): plain-text, independently queryable directories that agents navigate, read, and act on with standard tool calls, without parsing prose or reverse-engineering repositories. Because agent context windows are a shared, finite resource, the structure further supportsprogressive disclosure: agents load only the layers and files relevant to their current task, avoiding unnecessary context pollution.

The four layers below are the structural response to the Storytelling and Engineering Taxes introduced in §1: the Exploration Graph recovers the branching trajectory collapsed by narrative; the Cognitive and Physical Layers, bound by forensic links, close the gap between conceptual description and executable specification. Within each layer, artifact text maximizes information per token: subjective qualifiers, hedges, and narrative connectives are stripped, and statements requiring judgment carry provenance rather than rhetoric.

Refer to caption Figure 5:Cross-layer structure of a realAra. Claims in/logiclink to code in/srcand evidence in/evidencevia forensic bindings. The Exploration Graph (bottom center) captures the research DAG with dead-end nodes (marked×\times) that preserve failure modes and lessons.

2.2AraArchitecture

An agent engaging with a research artifact has four fundamental needs:understandthe contribution,reproduceit,learn from the process, andverifyclaims against raw evidence.Aramaterializes each need as a dedicated layer (Figure4), rooted in a manifestPAPER.mdwhose YAML frontmatter and layer index enable an agent to triage relevance in∼500{\sim}500tokens (AppendixA.3). The decomposition is informed by a taxonomy of ten reproduction-critical information categories derived from PaperBench rubrics (AppendixA.1).

The Cognitive Layer (/logic).

An agent reads/logicfirstto understandwhat was done and why:problem.mddefines the gap and key insight,solution/specifies the architecture, algorithm, and convergence-critical heuristics,claims.mddistills falsifiable assertions with explicit proof pointers, andexperiments.mddeclares the verification plan.related_work.mdreplaces passive citations withtyped dependenciesthat agents can act on:importsinject prior definitions,boundspropagate constraints to hyperparameter search, andbaselineentries enable automatic regression detection, transforming literature review into a machine-executable dependency graph.

The Physical Layer (/src).

Contains thehow—executable code calibrated to the contribution type. Algorithmic contributions usekernel mode: only the core modules with typed I/O signatures, often one to two orders of magnitude smaller than the full repository, so that a coding agent can regenerate environment-native boilerplate on demand. Systemic contributions (CUDA kernels, distributed training, systems architectures) userepository mode: the full implementation is retained but annotated via anindex.mdmanifest that maps each source file to theAracomponent it implements.configs/annotates every hyperparameter with rationale and search range;environment.mdpins dependencies, hardware, and seeds (mode specification, forensic bindings, and detailed rationale in AppendixA.2).

The Exploration Graph (/trace).

exploration_tree.yamlstores the complete research directed acyclic graph (DAG) as a nested YAML tree with five typed node kinds—question,decision,experiment,dead_end,pivot—where nesting encodes parent→\tochild edges and analso_depends_onfield captures convergence points. The format functions as a “git log for research”: agents traverse branches directly, and dead-end nodes preserve the hypothesis, failure mode, and lesson that narrative papers discard (Figure5).

The Evidence Layer (/evidence).

The raw outputs that ground every claim:results/contains machine-readable metric tables and generated data with exact values and source annotations;logs/captures training curves, resource usage, and diagnostics. It holdsoutputs only, so that every claim’s proof chain flowsclaims.md→\toexperiments.md→\to/evidence/.

Withholding ground-truth enables layered access control.

Experimentlogic(what to verify) lives in/logic; experimentdata(exact results) lives exclusively in/evidence. A verification agent can be granted the code kernel and algorithm descriptions while the evidence layer is withheld, preventing fabrication by copying expected values. This separation also makes everyAraa ready-to-use training environment: the task lives in/logic/experiments.md, the reward in/evidence/, and preference signals in every accept, reject, and revision logged in/trace/.

We consider anArasufficient when asufficiently capablecoding agent can reproduce the core claimzero-shotfrom it, without human intervention or external context beyond the artifact itself. Sufficiency is therefore a capability-relative criterion: it measures whether the artifactcontainsthe information required to reproduce its claim, not whether any present-day agent can fully exploit it. At the limit of agent capability, a completeArais reproducible by definition, so artifacts authored today remain valid as agents advance.

3The Live Research Manager

The previous section defines theAraprotocol: a structured target format for research knowledge. A natural question follows:*who populates it?*Requiring researchers to hand-author structured files would simply replace one documentation burden with another, negating the very tax the protocol was designed to eliminate.

The key observation is that in AI-native research, the full research trajectory—every design choice, failed experiment, and hard-won heuristic—already exists as machine-readable text in the conversation between researcher and agent. The Live Research Manager crystallizes this latent signal into a livingAraartifact that accumulates organically in the background, with zero documentation overhead for the researcher.111github.com/AmberLJC/Agent-Native-Research-ArtifactWe derive design principles from the AI-native paradigm (§3.1) and present the system that realizes them (§3.2; Figure6).

Refer to caption Figure 6:The Live Research Manager operates at session boundaries: a three-stage pipeline (Context Harvester→\toEvent Router→\toMaturity Tracker) distills each researcher–agent conversation into typed events that accumulate acrossAralayers over time.### 3.1Design Principles

AI-native research.

Computer science research is increasinglyAI-native: the researcher collaborates with a general-purpose coding agent across the full research lifecycle. A single conversational loop may span brainstorming a hypothesis, surveying related work, writing and debugging experiment code, analyzing results, and drafting the paper—the researcher provides direction, the agent executes, and the cycle repeats across days or weeks. This paradigm shift has a structural consequence: for the first time, the research process isborn digital and born textual. Every instruction, intermediate result, design choice, and abandoned direction already exists as machine-readable text—the raw material for a complete research record, lacking only a system to let it crystallize naturally. Prior efforts to preserve process knowledge—negative-result journals(Matosinet al.,2014), registered reports(Chambers,2013)—foundered because documentation remained a separate, unrewarded burden on the researcher. AI-native research dissolves this barrier: the process trace is not an additional deliverable but abyproductof the research itself, generated automatically in every researcher–agent session. We are, in other words, at the first moment in scientific history where comprehensive research-process capture is feasible at negligible marginal cost.

This observation motivates the Live Research Manager: if the research trajectory is already captured in conversation, a background system can distill it into a structuredArawithout asking the researcher to document anything. We derive three design principles from this paradigm.

P1.Silent, framework-independent integration: the manager is a natural-language specification loadable by any coding agent, requiring no custom SDKs. It runs as a background process that silently collects research traces without researcher intervention.
P2.Faithful epistemic provenance: every event is tagged with provenance (user,ai-suggested,ai-executed,user-revised) to preserve epistemic origin. Raw observations are staged rather than forced into categories, maturing progressively into typed events as evidence accumulates, so downstream consumers can reconstruct not justwhatwas concluded butwhy.
P3.Comprehensive trajectory capture: the full branching research process, including dead ends and pivots, is recorded with cross-layer bindings established at capture time. The artifact is version-controlled, so each milestone produces a navigable snapshot and retroactive revisions are first-class operations rather than destructive overwrites.

Full rationale for each principle is in AppendixC.1.

3.2System Design

Overview.

We implement the Live Research Manager as anagent skill(Anthropic,2025a): a lightweight, natural-language specification that, when loaded into a general-purpose coding agent’s context window, turns it into a domain-specialized agent (P1). A skill requires no custom SDKs or infrastructure; it simply composes tools the agent already possesses (file read, write, edit, shell execution) with domain knowledge of theAraschema, and artifact quality improves automatically as the underlying language model advances. The skill is open-source and available athttps://github.com/AmberLJC/Agent-Native-Research-Artifact.

The manager remains silent during active research, then runs a three-stage retrospective pipeline at the end of each conversation (P1; Figure6). TheContext Harvesterscans the session record—conversational history, tool outputs, experiment results, and code diffs—to extract research-significant events. TheEvent Routerclassifies each event, tags it with provenance, and writes it to the appropriateAralayer. TheMaturity Trackerreviews the staging area, promoting observations that have accumulated sufficient evidence into formal entries and flagging stale or conflicting items. At the start of the next session, the manager reads the artifact back to surface a structured briefing, closing the loop. The remainder of this subsection details each stage and the cross-session mechanisms that bind them together.

Context harvesting and event routing.

The Context Harvester scans the full session record and identifies two categories of research-significant activity: actions the AI agent has already performed (e.g., experiment runs, code changes, literature searches) and directions the researcher has expressed or confirmed (e.g., hypotheses to test, design choices, abandoned approaches). The Event Router then classifies each extracted event into one of seven types (Table1) and tags it with provenance (P2)—user,ai-suggested,ai-executed, oruser-revised. Anai-suggestedevent never auto-upgrades until the researcher explicitly confirms it, preserving the epistemic integrity of the artifact. Event payloads follow the protocol’s factual-density requirement (§2.1): conversational prose is distilled into telegraphic, quantitative language before committing to the artifact. Each typed event is written to the appropriateAralayer: trace events (decisions, experiments, dead ends, pivots) append to the Exploration Graph (/trace/), building a faithful, chronological record of the research trajectory including the directions that were tried and abandoned (P3); claims and heuristics enter the Cognitive Layer (/logic/); and events that resist classification are staged in/staging/for later promotion.

Table 1:Research event types and structured payloads.

Progressive crystallization.

Artifact construction proceeds at two timescales (P2).Continuously, at every session boundary, the manager appends trace events to the Exploration Graph, recording how the researcher navigates the open-ended research landscape.Periodically, when a major milestone is reached—a hypothesis confirmed or refuted, a working prototype completed, a key design choice finalized—the Maturity Trackercrystallizesthe accumulated observations into the more structured layers of the artifact. Raw observations mature into formal claims in the Cognitive Layer (/logic/), working code is documented in the Physical Layer (/src/), and concepts are added to the knowledge index. This two-phase rhythm mirrors how research understanding actually develops: insights begin as scattered observations, and forcing premature structure would distort the record. The manager stages what is not yet classifiable and crystallizes only whenclosure signalsin the session record—topic abandonment, explicit researcher affirmation, empirical resolution of linked experiments, or artifact-level commitment—indicate the observation has been treated as settled (AppendixC.2). When a pivot invalidates an earlier design choice, the manager propagates the change: it updates the affected artifact entries (claims, heuristics, or configuration) to reflect the new direction while the Exploration Graph retains the original rationale and the reason for revision. Because the artifact is version-controlled, each milestone crystallization produces a commit, giving the project a navigable history of its intellectual evolution, much as GitHub provides for code.

Cross-session continuity.

The manager is stateless; the artifact itself carries memory across sessions. At session close, it writes a short session record (events captured, claims touched, open threads) and appends to a running session index. The next session reads this index alongside the current claims and staged observations, and raises the relevant pieces only when they bear on the task at hand, rather than leading with a formal briefing the researcher did not ask for (AppendixC.3).

4TheAraCompiler

The born-agent pathway (§3) produces the highest-fidelityAras by capturing knowledge as it emerges during the research process. However, the scientific record contains millions of legacy PDFs that were never structured, and the born-agent pathway cannot reach backward. Beyond the PDF itself, research knowledge is distributed across complementary sources: GitHub repositories encode implementation decisions that prose omits, expert-curated evaluation rubrics encode what domain practitioners consider the core claims, and recorded experimental trajectories preserve the failure modes that papers systematically suppress. To bridge this gap, we introduce theAraCompiler, an agent skill that translates any combination of legacy research sources into a completeAra.222github.com/AmberLJC/Agent-Native-Research-ArtifactQuality is enforced in two stages. During compilation, the Compiler uses only ARA Seal Level 1 as an in-loop validation signal. After compilation, the finished artifact enters the downstream Seal pipeline, where Levels 2–3 evaluate its argumentative support and empirical reproducibility (§5.2).

Refer to caption Figure 7:TheAraCompiler accepts any combination of research sources and guides a coding agent through four stages of top-down artifact compilation, iterating 2–3×\timeswith in-loop ARA Seal Level 1 validation until the output conforms to the protocol.### 4.1Design Principles

Universal input, canonical output.

The Compiler is many-to-one: it accepts any combination of PDFs, code repositories, datasets, human-curated rubrics (e.g., PaperBench(Staraceet al.,2025)claim decompositions), and experimental trajectory logs (e.g., RE-Bench(Wijket al.,2025)traces that record failures the paper omits), and always produces a singleAraconforming to §2. Degradation is graceful: a PDF alone yields a valid artifact with stub-level physical layers; richer inputs populate progressively more complete layers.

High-fidelity preservation.

A narrative paper compresses and selects; the Compiler decompresses and restores. Every numerical result, hyperparameter, architectural detail, and negative finding in the sources must appearsomewherein the artifact, and any PDF-accessible information missing from theAraconstitutes a compilation failure (evaluated in §7.2). Preservation is faithful to sources;enrichment(§4.2) separately surfaces cross-artifact patterns no single source expresses.

Knowledge lineage, not flat extraction.

Narrative compilation destroys theprovenance chainsconnecting claims to experiments, experiments to evidence, and evidence to code. Plain-text extraction recovers content but not these connections: parsing a PDF into Markdown populates four directories yet leaves them structurally isolated. The Compiler performsforensic reconstruction, recovering the cross-layer forensic bindings (§2.2) from sources where lineage exists only implicitly—scattered across prose, figure captions, appendix tables, and code comments—so an agent can trace any claim downstream to code or any number upstream to its hypothesis. Recovering this lineage, not populating layers, is the core compilation problem.

4.2Compiler Implementation

Realizing these principles against narrative PDFs requires concrete mechanisms that address the specific ways prose obscures structure, lineage, and source diversity. The Compiler is implemented as a single agent skill (Figure7) that guides a coding agent through the translation from legacy sources toAraformat. The skill prompt encodes the fullAraschema, field-level requirements, and a structured generation protocol. We describe the key elements below; the complete skill specification is reproduced in AppendixB.1.

Top-down generation.

The skill instructs the agent to construct the artifacttop-down, mirroring how a researcher explains their work to a new collaborator: high-level concepts first, details next, implementation last. Concretely, the agent proceeds through four stages.Semantic Deconstructionstrips narrative framing to expose raw research content (formulations, configurations, results, dependencies, failed approaches), rewritten in fact-dense telegraphic form that eliminates the Storytelling Tax at the source level.Cognitive Mappingpopulates/logic: motivation chain (observations→\togaps→\toinsight), falsifiable claims with proof pointers, formal concepts, and the solution structure, with every claim linked to the experiment that verifies it.Physical Groundinggenerates/src: annotated configurations, typed code stubs, and the environment manifest; when a code repository is available, stubs are replaced with actual implementations and the agent performscode-paper reconciliation, cross-referencing the codebase against claims to surface tacit knowledge—implicit assumptions, undocumented tricks, extra parameters(Liet al.,2026)—that is written back to/logicas provenance-tagged heuristics.Exploration Graph Extractionreconstructs the research DAG as a nested YAML tree with dead-end leaf nodes documenting hypothesis, failure mode, and lesson. This ordering ensures each layer is informed by the one above: cognitive structure guides physical generation, and the exploration graph contextualizes both.

Iterative refinement via validation feedback.

After initial generation, ARA Seal Level 1 checks (§5.2)—schema conformance, cross-layer reference resolution, required field completeness—run within the same agent conversation, returning failures as structured diagnostics that drive targeted fixes. This generate→\tovalidate→\tofix loop typically converges in 2–3 rounds, turning Level 1 into actionable feedback rather than a post-hoc report.

Source-aware enrichment.

Auxiliary sources are routed to the layer they most directly populate: code repositories replace stubs with verified implementations in/src; evaluation rubrics anchor/logicwith expert-verified claim decompositions; and trajectory logs seed/tracewith dead-end nodes the PDF omits. When a library of previously compiledAras is available, the Compiler further performscollective inference: it retrieves heuristics and configurations from same-domain artifacts, flags common patterns the current paper omits, and adds them as candidate heuristics taggedcollective_inferenceso downstream agents can distinguish stated from inferred knowledge.

5AraVerification and Review

Expert human attention is the scarcest resource in scientific evaluation. Review loads at top venues have grown faster than the reviewer pool(Aczelet al.,2021), and reviewer bandwidth is increasingly consumed by mechanical verification (“does the code run?”, “does Table 3 support Claim 2?”) rather than the substantive judgment that only domain experts can provide.

Because anArasubmissionisa structured, machine-executable artifact, structural properties that PDF review checks subjectively become objectively verifiable: schema conformance and cross-layer reference integrity either pass or fail deterministically. Higher-order properties, information completeness and directional reproducibility of central claims, becomemachine-assessable: automated checks provide probabilistic evidence that informs human reviewers rather than replacing them. This separation redirects expert attention from mechanical checking to the substantive judgment that only humans can provide: significance, novelty, and taste. We define theARA Sealas a machine-verifiable credential with three escalating levels (§5.2, Figure8) and describe a three-stage review pipeline (Figure9) that consumes those levels to separate automated verification from human judgment.

5.1Design Principles

We derive two design principles below; cross-artifact review via agent-to-agent ARA comparison, which becomes meaningful only as the corpus reaches critical mass, is discussed in §9.

P1. Automate the mechanical; reserve humans for judgment.

Structural validity, internal consistency, and claim reproducibility are objective properties that either pass or fail, whereas significance, novelty, and taste require domain expertise. The review system should never ask a human to verify that Experiment E03 matches Claim C02 when a machine can do so in seconds; resolving all machine-checkable issuesbeforethe artifact reaches a human reviewer ensures that expert attention is spent exclusively on questions that genuinely require it.

P2. Reproducibility as a foundational requirement.

“Code available upon request” nominally satisfies today’s reproducibility bar; in anAra-native system, reproducibility is amachine-verified propertyof the artifact itself. Passing ARA Seal Level 1 (structural integrity) is a submission requirement, and Level 2 (argumentative rigor) produces a structured critique before the venue spends compute on Level 3 (execution reproducibility), whose results are then attached to every review. Artifacts that fail structural checks, or whose claims remain obviously under-supported after Level 2 critique, do not advance to human review.

5.2The ARA Seal: Machine-Verifiable Research Credentials

Refer to caption Figure 8:The ARA Seal is a three-level verification credential. Each level tests a progressively stronger property of the artifact, escalating in cost and breadth: structural integrity (seconds, deterministic), argumentative rigor (minutes, rubric-anchored agent), and execution reproducibility (hours to days, sandboxed coding agent). Passing the applicable levels issues a Seal Certificate that downstream agents check before investing compute.A PDF paper earns trust indirectly: venue prestige, citation counts, and author reputation serve as proxies for quality, but none verify the work itself. Because anAraencodes research as typed, machine-traversable data with explicit claim–evidence bindings, its quality becomesdirectly verifiable. The ARA Seal operationalizes this as a three-level verification protocol (Figure8), where each level tests a progressively stronger property (implementation details in AppendixH.1).

Level 1 – Structural Integrityverifies that the artifact is well-formed and internally consistent: the directory ontology exists, all structured files conform to their schema (each claim carriesStatement,Status,Falsification criteria, andProof; each heuristic carriesRationale,Sensitivity, andBounds), and all cross-layer references resolve (experiment IDs inclaims.mdpoint to valid entries inexperiments.md, code references trace to implementations in/src).

Level 2 – Argumentative Rigorwithout executing any code or consulting external sources, aRigor Auditoragent evaluates whether the content of a Level-1-valid artifact is epistemically sound along six objective dimensions, each scored on an anchored 1 to 5 rubric. The three load-bearing dimensions are:evidence relevance, checking that every claim’s cited experiments substantively address what the claim asserts under type-aware entailment (causal claims require isolating ablations, generalization claims require heterogeneous test conditions, improvement claims require baseline comparisons);falsifiability quality, checking that criteria are actionable, non-tautological, scope-matched, and independently testable without access to proprietary data; andmethodological rigor, covering baseline adequacy, ablation coverage, statistical reporting, and metric–claim alignment. Three further dimensions—scope calibration, argument coherence, and exploration integrity—are defined in AppendixH.2.2. Findings are collected with severity (critical,major,minor,suggestion), verbatim evidence spans, and actionable suggestions; the overall grade is derived from the mean score and per-dimension floors. Every check reduces to a rubric-anchored property of the artifact’s content, so Level 2 remains objective; judgments of significance, novelty, and taste are reserved for human reviewers in Stage 3. The reference Rigor Auditor is released as an agent skill(Anthropic,2025a);333github.com/AmberLJC/Agent-Native-Research-Artifactfull protocol, dimension rubrics, and grade thresholds appear in AppendixH.2.2.

Level 3 – Execution Reproducibilityverifies that the artifact’s central claims reproduce empirically. The system selects claims by criticality—those in the contribution list, those anchoring the most downstream dependencies, or those flagged by the authors—and runs scaled-down directional checks (small data, few epochs, toy configurations) that test whether claimed properties hold qualitatively rather than reproducing exact numbers. The verifying agent is isolated from the artifact’s evidence layer: it receives only the code kernel and algorithm descriptions, never the reported numbers, preventing fabrication by copying expected outcomes. Venues set a compute budget; claims that exceed it are flagged asunverified. Full-scale reproduction (original datasets, full training runs, exact metric recovery) is optional and typically post-acceptance or community-driven; results are appended to the living Seal certificate.

Passing the applicable levels issues aSeal Certificate—a signed record of artifact ID, verification level achieved, timestamp, environment hash, and per-claim reproduction outcomes. Downstream agents check this certificate before investing compute, avoiding redundant re-verification.

5.3Three-Stage Review Pipeline

Refer to caption Figure 9:Three-stageAra-native review pipeline. Stages 1–2 invoke the ARA Seal levels of Figure8to resolve mechanical and rigor issues before human reviewers engage, redirecting expert attention to novelty and significance.We envision a three-stage review pipeline (Figure9) that mirrors CI/CD practices, where each stage gates the next: conceptual verification, empirical verification, and human judgment.

Stage 1: Conceptual Verification (Minutes).

The first stage checks whether the artifact iswell-formed and conceptually well-supportedwithout running any experiments. Automated checks validate structural integrity (ARA Seal Level 1—schema conformance, cross-layer reference resolution, required field completeness), and the Rigor Auditor executes Level 2 argumentative rigor by assessing whether major claims are appropriately scoped, whether attached evidence and ablations justify them, and whether obvious baselines, assumptions, or limitations are missing. Level 1 checks are mechanical and deterministic: either every claim inclaims.mdlinks to a valid experiment, or it does not; either heuristics carry sensitivity bounds, or they are missing. Level 2 instead produces a structured rigor report keyed to specific claims, experiments, and omissions. The stage also generatesadvisory diagnosticssurfaced to human reviewers in Stage 3: whether the exploration tree contains dead-end nodes (suggesting genuine process documentation versus a sanitized linear chain), whether the related-work graph covers obvious baselines, and whether experiment–claim coverage has gaps. These diagnostics inform but do not gate; only the Seal checks are pass/fail requirements. The output is aCI Reportplus a Level 2 rigor report, both attached to the submission and visible to authors and reviewers. Authors iterate on structural failures and rigor critiques before the artifact advances, analogous to fixing lint errors and design issues before code review.

Stage 2: Empirical Verification (Hours–Days).

Once the artifact passes conceptual review, the second stage tests whether theclaimed results actually reproduce. An AI reviewer agent executes ARA Seal Level 3: it ranks the artifact’s claims by criticality—prioritizing those in the contribution list and those with the most downstream dependents in the claim graph—then runs scaled-down directional checks (small data, few epochs, toy configurations) within a venue-specified compute budget. The agent is isolated from the artifact’s evidence layer and ground-truth results: it receives only the code kernel and algorithm descriptions, never the paper’s reported numbers, preventing fabrication by copying expected outcomes. For each tested claim, it reports whether the claimed direction holds (e.g., “method A outperforms baseline B on metric M”), wall-clock time consumed, and any divergence from expected outcomes. Claims that exceed the remaining budget are flagged asunverifiedwith an estimate of the compute required. Beyond reproduction, the agent assesses experimental comprehensiveness: Are ablations present for each design choice? Do the experimental conditions cover the claimed generality, or are results cherry-picked from favorable settings? Are there undocumented heuristics in the code that do not appear in the artifact’s cognitive layer? The output is anEmpirical Review Reportthat records which claims were verified, which failed, which were deferred due to budget, and what experimental gaps were identified.

Stage 3: Human Review (Days–Weeks).

Human reviewers receive the submission alongside the CI Report and Empirical Review Report, and their role shifts fromverification—already handled by Stages 1–2—tojudgment. They focus on the questions that no machine can answer: Is this contribution significant—does it address a real problem that matters? Is the core insight genuinely novel, or an incremental recombination of known ideas? Is this the right formulation of the problem, and are there better approaches the authors should have considered? What are the ethical implications and potential for misuse? Where the AI reviewer’s findings are contested by the authors, humans adjudicate with the full audit trail available for inspection. Human reviewers write structured reviews in the same typed format—findings linked to specificAracomponents—ensuring that all feedback is actionable and traceable. The key efficiency gain: humans no longer spend time on “your code doesn’t run” or “Table 3 contradicts Claim 2”—these are resolved before the artifact reaches them.

6The (Human+AI)2Research Network

Refer to caption Figure 10:The (Human+AI)2research network. Each researcher works through a research agent that interfaces with a sharedAranetwork via/submit,/retrieve, and/fork; agents may also collaborate directly.The previous sections describe the components of anAra-native stack: the protocol (§2), the Live Research Manager that captures work into the artifact during development (§3), the Compiler that imports the legacy literature (§4), and the Seal-gated review pipeline that verifies it (§5). Composed, they form a scientific communication system whose primary object is no longer the static document but theAraitself: a single canonical artifact that humans on each end direct agents to author, certify, render, and extend. We call the resulting structure a(Human+AI)2network(Figure10).

On the producer side, researchers no longer work toward papers; they pursue questions, and the paper-as-output accrues automatically as anAraalong the way: the Live Research Manager folds each decision, dead end, and confirmed claim into the artifact during ordinary work, the Compiler imports legacy sources on demand, and at any milestone the artifact is routed through the Seal pipeline and registered publicly, where another team can fork a passing artifact, extend a claim, retain attribution to the parent, and submit the diff for re-review. On the consumer side, because theArais canonical, an agent renders it on demand into whatever surface the reader needs (paper, video, slides, interactive demo, or grounded dialogue), shaped by the reader’s expertise, attention budget, and intent.

With both ends agent-mediated and the artifact the only persistent state, contributions compound at the level of artifacts rather than sentences: publishing becomes a Git-like operation, reviewers consume Seal-attested artifacts through their preferred surfaces, and downstream agents readAras as structured baselines, training environments, or starting points for new questions. The result is a queryable scientific commons in which every contribution is an executable diff, and the cost of understanding, reproduction, and extension falls with each new artifact admitted rather than rising against it. §7asks whether today’s agents can already operate this substrate; §9maps near-, medium-, and long-term extensions.

7Evaluation

Table 2:Benchmark characteristics. PaperBench supplies configuration depth via expert rubrics; RE-Bench supplies trajectory depth via MALT failure traces. These are the two source-side enrichmentsArais built to exploit.We evaluateAraacross three layers of increasing research ambition:understanding(can an agent extract knowledge from the artifact?),reproduction(can an agent execute research from the artifact?), andextension(can an agent build on prior work more efficiently with failure knowledge?). Each layer isolates a distinct advantage of structured artifacts over the PDF-centric status quo. We first describe the evaluation corpus and how eachArais built from it (§7.1), then present experiments that testAraat each layer (§7.2–7.4). Full implementation details for all experiments are in AppendixD.

7.1Datasets and Motivation

A single comparison underlies all three evaluation layers: holding the agent, the task, and the ground truth fixed, does anAracompiled from a research project’s available source materials outperform the conventional artifact most readers receive—a paper PDF plus a companion code repository when one exists?

Two systematic gaps in the conventional artifact.

The comparison is interesting precisely because the conventional PDF++repo loses two kinds of content anAracan preserve. First,configuration depth: published papers omit hyperparameters, environment specs, and implementation details that experts deem necessary for reproduction. Across PaperBench’s 8,921 expert reproduction requirements, only45.4%are fully specified in the source PDF (Fig.3; full taxonomy in App.E.2). Second,exploration depth: most of the compute that produces a finished result never reaches the artifact. Across 24,008 RE-Bench agent runs,90.2%of dollar cost is spent on failed exploration that the published artifact discards entirely (App.E.3), forcing each subsequent agent to rediscover the same dead ends. We use two benchmarks chosen so that each one supplies, as a first-class source, exactly what the conventional artifact lacks: PaperBench expert rubrics close the configuration gap; RE-Bench MALT trajectories close the exploration gap (Table2). Both benchmarks happen to be drawn from computer science and machine learning, butAra’s design extends to anydigitalisedresearch whose contribution can be expressed as code, configurations, and grounded data, including computational social science, economics, and dry-lab biology. Research whose contribution is a physical-world intervention (wet-lab biology, materials synthesis) falls outside this scope until the underlying experimental record is itself digitalised.

Two representations of the same research.

For each paper or task we construct two representations of the same underlying research and hold everything else fixed. Theconventional baselineapproximates what most readers receive: for PaperBench it is the published paper PDF and, when available (15 of 23 papers), the original GitHub repository; for RE-Bench, where tasks have no published paper, it is an LLM-synthesised polished paper-style writeup of the official reference solution paired with the official source—the closest analogue to PDF++repo for paperless tasks. TheArais compiled from the same source bundle (PDF++repo++expert rubric for PaperBench; official solution++source++MALT trajectories for RE-Bench) via theAraCompiler (§4), gated by the ARA Seal Level 1 structural-integrity loop (§5.2); the generate–validate–fix iteration converges in 1–3 passes in practice. For the extension experiment (§7.4), RE-BenchAras are additionally produced through a specialised pipeline (App.G.2) that adds per-MALT-run extraction sub-agents and a beat-reference fairness filter; the output conforms to the sameAraschema.

7.2Knowledge Extraction fromAra

The first layer measures what an agent can extract from each format: doesArapreserve the information present in the source PDF and repo, and does its structure surface knowledge that the source leaves scattered or implicit? A format that loses information during compilation cannot improve reproduction or extension, so this is a precondition for every downstream benefit. The information gap quantified in §1(Figure3) shows that structurecanrecover details PDFs under-specify; this layer tests whether the Compiler actually achieves that promise.

Setup.

We probe each format with 450 questions: 15 per evaluation target across 30 targets (23 PaperBench papers and 7 RE-Bench tasks), in three categories.Category A(10 per target, both benchmarks) tests fidelity on surface results, methods, and conditions answerable from the PDF.Category B(5 per PaperBench paper) tests configuration recovery on hyperparameters, environment specs, and preprocessing details.Category C(5 per RE-Bench task) tests failure knowledge on dead ends, alternatives considered, and lessons. To avoid source bias in question selection, we generate two pools per target independently—one seeded by reading the PDF, one by reading theAra—then merge and deduplicate them so that neither format dictates which questions get asked (full templates and the gold-reference rubric in AppendixE.1). Each (target, format, question) triple is dispatched as an independent Claude Sonnet 4.6 sub-agent with a fresh context, and each answer is graded ternary (1.0 / 0.5 / 0.0) by a Claude Opus 4.6 judge against a gold reference.

Table 3:Understanding evaluation: accuracy and per-question token usage across 450 paired outcomes. ARA wins at every category and every benchmark; the per-category mechanism is unpacked in AppendixE.4.

Results.

Araoutperforms the baseline at every category and every benchmark, with overall accuracy93.7% vs. 72.4%(+21.3+21.3%) on 450 paired outcomes. The aggregate gap decomposes along three distinct mechanisms, each isolated by one category. OnCategory A, where the answer is recoverable from the PDF,Arawins+14.8+14.8%while consuming 12% fewer tokens: PAPER.md’s layer index turns linear document scanning into targeted file lookup. OnCategory B, where the rubric supplies configuration details papers systematically omit,Arawins+24.8+24.8% at comparable token usage; the baseline can mine the companion repo, butAra’ssrc/configs/centralises the same knowledge in a single agent-friendly file. OnCategory C, where the answer lives only in the MALT trajectories,Arawins+65.7+65.7%; the baseline has no source for failure knowledge and abandons most queries with short empty answers (58K vs. 139K tokens).

Token usage scales with question depth onAra, not on the baseline.

OnArathe agent spends 61K tokens on explicit questions, 96K on scattered ones, and 153K on implicit-failure ones; baseline token usage stays roughly flat across the same tiers (83K to 118K), since linear PDF/repo scanning costs the same regardless of how deep the answer is. Per-category mechanism analysis and the full difficulty stratification are in App.E.

7.3Reproduction fromAra

The second layer tests whether structured artifacts translate understanding into action: can an agent reproduce experimental results more effectively from anArathan from the traditional PDF++GitHub combination? Layer 1 measures what an agent knows; Layer 2 measures what it can do. The information gap from §1(Fig.3; 54.6% of expert reproduction requirements partially or entirely absent from PDFs) names the concrete execution blockersAra’s structured layers are designed to remove.

Task curation.

We select 15 PaperBench papers with companion GitHub repositories and curate 10 reproduction tasks per paper (150 tasks total, 1,743 rubric requirements), stratified by difficulty (50 easy, 49 medium, 51 hard). Tasks describewhatto reproduce, not how; within each paper, subtasks are ordered by difficulty so the agent builds cumulatively (task design details in AppendixF.1).

Protocol.

For each paper, two coding agents receive the same mega-task but different source materials:

•ARA agent: theAraartifact only (PAPER.md,logic/,src/,evidence/). No PDF or repository access.
•Baseline agent: the paper PDF and companion GitHub repository.

Both agents use Claude Sonnet 4.6 with the same system prompt (differing only in source material paths) and per-paper token budgets of 14–20M tokens scaled by task complexity (cache reads discounted to 10%). Expected numerical results are masked in agent prompts (replaced with[X]%) to prevent parroting. After each run, a blinded Claude Opus 4.6 judge evaluates every rubric requirement asyes(satisfied),partial(partially addressed), orno(not met), without knowing which condition produced the output.

Scoring.

The primary metric is thedifficulty-weighted success rate, weighting easy, medium, and hard subtasks at1:2:31\!:\!2\!:\!3to emphasize harder tasks where structured information provides the most leverage (scoring formula, statistical tests, and per-difficulty breakdowns in AppendixF.1).

Results.

Across all 15 papers with complete paired runs (150 subtasks, 1,743 rubric requirements),Araachieves a difficulty-weighted success rate of64.4%vs.57.4%for the baseline; the win/tie/loss breakdown across papers is 8/5/2 (per-paper numbers in AppendixF.2). Figure11stratifies this aggregate by difficulty; the per-paper breakdown is deferred to AppendixF.2(Figure13).

Refer to caption Figure 11:Aggregate reproduction success rates across all 15 papers, stratified by difficulty. TheAraadvantage widens monotonically with difficulty (+4.9%+4.9\%easy,+5.6%+5.6\%medium,+8.5%+8.5\%hard), tracking the tiers where reproduction depends most heavily on configuration content the PDF underspecifies.

Analysis.

TheAraadvantage grows with difficulty: easy subtasks (environment setup, model instantiation) are near-ceiling for both formats, and the gap opens on medium and hard subtasks where reproduction depends on configuration content the published PDF rarely supplies. The same pattern holds paper by paper (App.F.2): the largestAraadvantages (fre,mechanistic-understanding,pinn) are on papers with multi-stage training pipelines whose hyperparameter interactions PDFs describe only at a high level, and the gain concentrates in their medium and hard columns. For example, thefreAraagent reimplemented the original JAX codebase in PyTorch (1.8 GB GPU vs. JAX’s 30.8 GB), trained 17 models across three domains, and completed all medium and hard subtasks; the baseline agent fought the JAX environment and completed only 3 training attempts before its budget ran out. The one clear baseline win isself-expansion, where theAraagent fabricated results that the blinded judge caught; narrow ties (adaptive-pruning,rice) occur on papers whose companion repositories partially compensate for the PDF gap. Across all 15 papers, fabrication occurred in 2 baseline runs and 1Ararun: structured artifacts reduce but do not eliminate hallucinated results.

7.4Extension fromAra

The third layer testsAra’s most ambitious claim: that preserving the failure trajectory of prior research lets the next agent extend it more effectively. Section7.1documented that 59.2% of agent tokens (and 90.2% of dollar cost) on RE-Bench are spent on dead-end exploration that the published artifact discards. This layer asks whether handing the next agent a structured record of what was already tried and abandoned closes that gap in measurable downstream gains.

Why RE-Bench, and which tasks.

Testing the claim demands tasks that admit open-ended improvementandship a corpus ofrealagent failure traces, not author-imagined ones. RE-Bench(Wijket al.,2025)satisfies both: each of its 7 tasks has a continuously valued automated scorer, and the METR MALT transcripts contain thousands of complete agent runs (with their dead ends) per task. We use 5 of the 7 tasks:triton_cumsum,restricted_mlm,fix_embedding,nanogpt_chat_rl, andrust_codecontests. The other two are excluded because their MALT corpora cannot supply a usable failure-trace layer:optimize_llm_foundryships no MALT corpus at all, andsmall_scaling_law’s MALT runs are sparse, lack Claude-4 entries, and consist mostly of trivial parameter sweeps with no recorded strategic exploration; either case would leavetrace/effectively empty by construction (AppendixG.1).

Araconstruction (specialised RE-Bench pipeline).

RE-Bench inputs differ from a typical paper + repo: the source is an official reference solution plus thousands of MALT JSONL transcripts, and the failure-record layer dominates artifact value here. We therefore wrap the standardAracompiler (§4) in a RE-Bench-specific pipeline (AppendixG.2) that (i) lifts the official solution intosrc/and the reference-derived knowledge intologic/via the standard compiler, then (ii) fans out one extraction sub-agent per MALT run to populatetrace/exploration_tree.yamlandevidence/tables/malt_attempts.mdwith the dead ends and partial successes those runs produced. A direction-awarebeat-reference filterexcludes any MALT scoring attempt that exceeded the reference, applied per attempt rather than per run. This is the experiment’s central fairness rule: comparingAraagainst a polished paper writeup is only meaningful if neither side carries a worked-out beating-reference solution that the agent could simply transcribe. Every retained node also carries asource: official-solutionorsource: MALT run_id={id}provenance tag.

Protocol.

Both agents start from an identical workdir: the official referencesolution.py, ascore.shwrapper around the canonical RE-Bench scorer, training-data symlinks where applicable, and areference/directory whose contents are the only independent variable.

•Paper agent: readsreference/paper.md, an LLM-synthesised academic-style writeup of the official solution (abstract, method, results, dev notes; the same beat-reference filter is applied sopaper.mdnever contains a worked-out beating-reference variant), plus the officialsrc/tree. This emulates the artifact a conventional published paper would supply (AppendixG.3).
•Araagent: reads the fullAra(PAPER.md,logic/,src/,trace/,evidence/); thesrc/and reference-derivedlogic/content match the paper agent’s bundle, and thetrace/andevidence/layers carry the failure record that the paper bundle has no analogue for.

Both agents are instructed to beat the reference score by editingsolution.pyand runningbash score.sh; the result is the best score across all invocations during the run. We use the Claude Agent SDK(Anthropic,2025b)with a tool surface of {Bash,Read,Edit,Write,Glob,Grep} and a 8 h SLURM wall clock + $50 API-spend cap per run. All five tasks run on Claude Sonnet 4.6; fortriton_cumsumandrestricted_mlmwe additionally ran the same comparison on the older Sonnet 4.5 base (AppendixG.6). Harness engineering, score-event extraction, and reproducibility details are in AppendixG.

Results.

Figure12reports the best-so-far envelope and the underlying scoring attempts per agent on each task, against wall-clock time and API spend. Onrust_codecontests,nanogpt_chat_rl, andfix_embeddingtheAraagent ends with the better best score; ontriton_cumsumandrestricted_mlmunder Sonnet 4.6 the paper agent ends ahead. The trajectories surface three phenomena described below; per-task case studies and trace evidence (file reads,ThinkingBlockreasoning, edit history) are in AppendixG.6.

Refer to caption Figure 12:Extension trajectories on five RE-Bench tasks under Claude Sonnet 4.6. One task per column: top row is score-vs-wall-clock-time, bottom row is score-vs-cumulative-cost; the y-axis is shared down each column. Faint markers are individual scoring attempts, solid lines are the best-so-far envelope, and stars mark the best attempt; dotted grey lines mark each task’s RE-Bench reference. Arrows in the column titles indicate metric direction.*Across all five tasks theAraagent reaches a useful first move earlier than the paper agent, and ends with the better best score onrust_codecontests,nanogpt_chat_rl, andfix_embedding; ontriton_cumsumandrestricted_mlmthe paper agent later overtakes via moves the trace does not name (anint8kernel redesign and focused depth on a single architecture, respectively). §7.4unpacks why the earlyAralead does not hold on these two tasks and how the same comparison on the older Sonnet 4.5 base inverts the result.*Per-task case studies and Sonnet 4.5 trajectories fortriton_cumsumandrestricted_mlmare in AppendixG.6.

Early acceleration across all five tasks.

TheAraagent reaches a useful first move earlier on every task, including the two it eventually trails on. The clearest case isrust_codecontests: it commits to a hand-coded Rust library att=9t=9min after reading heuristicH12, while the paper agent spends six hours on prompt-engineering variants and only att=395t=395min, while inspecting the workdir, notices the same idea and starts populating the existingfew_shots/cache. On Sonnet 4.6triton_cumsum, theAraagent scores0.470.47att=11t=11min using trace-derived ideas (decoupled lookback,associative_scan), while the paper agent does not score untilt=37t=37min and reasons from first principles. The same head start appears onnanogpt_chat_rl(heuristicH08pre-names a degenerate-output filter the paper agent has to discover by debugging) and onfix_embedding(heuristicsH11/H13mark permutation recovery as a documented dead end, but the paper agent tries it att=19t=19min, abandons it on visible failure, then re-tries it att=350t=350min having forgotten its own earlier failure). Across all five tasks the recorded heuristics and prior failures shorten the path to a first useful move; the question is whether that lead carries through to the final score.

Late-phase reversal on Sonnet 4.6 triton and mlm.

On these two tasks the early lead does not hold, and the reversal is itself informative. Ontriton_cumsum, the paper agent—with no menu to commit to—keeps redesigning the kernel and att=47.7t=47.7min introduces anint8input compression motivated by the scorer’s 8-bit input range, a move the trace never names; it iterates from there and finishes ahead. TheAraagent meanwhile sticks with the trace-recommended decoupled-lookback design. Onrestricted_mlm, the paper agent commits to a singleConvMLMDilatedtune for the full 8 h, while theAraagent implements every heuristic-named alternative architecture (H11ReLU-attention,H07MLPMixer, etc.), finds that none beats the simpler ConvMLM under Sonnet 4.6’s optimisation, and ends behind. In both cases theAraagent followed the trace faithfully; the trace simply was not the most creative option available to that model.

A weaker base inverts the comparison.

For both tasks, paired Sonnet 4.5 trajectories show the opposite pattern (Appendix Figs.14,15): theAraagent reaches0.270.27ontriton_cumsumvs.0.640.64for the paper agent, and0.730.73onrestricted_mlmvs.1.031.03. A weaker base lacks the bandwidth to invent moves like theint8compression or to commit deeply to a single architecture, and the same heuristics that constrained Sonnet 4.6 give Sonnet 4.5 a productive ranked list of strategies to try. The artifact’s value appears to scale with the gap between what the trace documents and what the agent can discover on its own.

Outlook.

Taken together, the trajectories suggest that anAracan aid human-agent and agent-agent communication by surfacing prior pitfalls and successful strategies, but that selectively hiding or contextualising parts of the trace may matter when the agent’s own bandwidth exceeds what the documented playbook records. Marking trace nodes with model-class provenance, so successors can discount claims that no longer apply, is one such mechanism; we leave the broader design space to follow-up work.

7.5Ara-Native Review Systems

Sections7.2–7.4evaluateAraas aformat; this subsection evaluates thereview machinerythe format enables. We test whether the three-level Seal protocol (§5) actually detects deficient artifacts on its three automated levels: structural integrity (Level 1), argumentative rigor (Level 2, the Rigor Auditor), and execution reproducibility (Level 3). The Stage 3 substantive-judgment layer, where human reviewers assess significance and novelty, has no automated oracle and we do not attempt to measure it.

Level 1 (structural integrity).

Level 1 checks schema conformance, required-field presence, and cross-layer reference resolution. The Understanding experiment already shows the gate works in practice: everyArain this paper passes Level 1 before use, and the 95.6% Cat. A accuracy (Table3) means the gated artifacts are structurally complete enough for an agent to retrieve what the source actually contains (per-artifact iteration counts and failure-type distributions in App.H.2.1).

Level 2 (Rigor Auditor): mutation benchmark.

We benchmark the Rigor Auditor by injecting known errors intoAras that already pass Level 1 and measuring the auditor’s detection rate against the ground-truth injection manifest. Each injection is its own oracle, removing the need for human annotation. Concretely, we take the 23 PaperBenchAras and seed each with five injections:

•Fabricated claim: a claim that cites a non-existent experiment.
•Missing falsification: a primary claim with itsFalsification criteriafield removed.
•Orphan experiment: an experiment whoseVerifiesfield points to a non-existent claim.
•Over-claim: a narrow result whoseStatementis broadened to universal scope.
•Rebutted-branch leak: a claim advocating an approach documented asdead_endin the exploration tree.

The Rigor Auditor (a Claude Code SDK(Anthropic,2025b)agent that reads the artifact, builds cross-layer maps, scores six dimensions, and compiles a findings list; full protocol in App.H.2.2) is then run on each injectedAra, and each finding is matched back to the manifest via its target entity.

High recall on substantive injections, a blind spot on orphans.

Table4reports per-type detection. The auditor catches 100% of three high-severity classes (fabricated claims, rebutted-branch leaks, over-claims) and 91% of missing falsifications, but only 22% of orphan experiments. The asymmetry is interpretable: orphans require enumerating every experiment and cross-checking itsVerifiestarget against the claim list, whereas the other four surface naturally inside the auditor’s per-claim loop. The natural fix is to move orphan detection into Level 1 as a deterministic structural check.

Table 4:Rigor Auditor effectiveness on the mutation benchmark (23Aras×\times5 injection types). The auditor catches all high-severity structural anomalies but exhibits a systematic blind spot on orphan experiments.

Two LLM-as-judge pathologies in the auditor’s scoring.

Two scoring-side biases emerge. First,grade inflation: in 17 of 23Aras the auditor’s reported overall mean is rounded up just enough to clear the Accept threshold. Second,finding–score decoupling: even when the auditor correctly flags an injection ascritical(22 of 23 rebutted-branch-leak cases), the corresponding dimension score does not drop to the level the rubric prescribes. Both are documented LLM-as-judge failure modes(Zhenget al.,2023), and together they suggest LLMs should generate findings rather than grades, with the overall verdict computed deterministically from the findings list. Dimension distributions, the grade-inflation breakdown, and a per-paper detection heatmap are in App.H.2.2.

Level 3 (execution reproducibility).

Level 3’s specification in §5.2—read the artifact, reproduce central claims directionally under a compute budget, with numerical results masked—is exactly theAra-condition protocol of the Reproduction experiment (§7.3). The 64.4% difficulty-weighted success rate measured there is therefore the Level-3 verification rate on well-formed artifacts, and the single result-fabrication that the blinded judge surfaced shows the verifier flags misrepresented results when they appear.

8Related Work

Our work synthesizes ideas from three research threads—machine-readable science, reproducibility infrastructure, and agent-oriented tooling—and contributes a unified protocol that none of them provide individually.

The dimensional gap of existing tools.

A natural objection is thatAramerely combines documentation, version control, and experiment tracking—three categories of tools researchers already use. Table5shows why even using PDFs, GitHub, and trackers (MLflow(Zahariaet al.,2018), Weights & Biases(Biewald,2020))simultaneouslyleaves the knowledge siloed in three unlinked formats with no cross-referencing between claims, the code that tests them, the evidence produced, and the decisions that selected it.Aracloses this gap not by replacing these tools but by providing the missing structural layer: an executable epistemic graph whose cross-layer bindings make these connections explicit and machine-traversable.

Table 5:Dimensional coverage of existing research artifacts. Each row is a requirement for agent-native research (§1). Existing tools cover at most two dimensions structurally;Aracovers all five with explicit cross-layer bindings.✓= full,∼\sim= partial (present but unstructured or scattered),×\times= absent.

Machine-readable research artifacts.

A growing line of work argues that scientific knowledge should be authored in machine-readable form during the research process rather than recovered post-hoc: the FAIR principles(Wilkinsonet al.,2016)standardize data metadata, the W3C PROV ontology(Leboet al.,2013)formalizes provenance for scientific outputs,Canini (2026)reframes the paper as a “compression format for human readers” that should yield to structured knowledge objects, andStockeret al.(2025); Booeshaghiet al.(2026)advocate authoring-time machine readability as a first principle. Concrete formats instantiate parts of this vision—nanopublications(Grothet al.,2010)atomize claims with provenance, the Open Research Knowledge Graph(Jaradehet al.,2019)curates structured contributions across papers, RO-Crate(Soiland-Reyeset al.,2022)bundles research objects, Whole Tale(Brinckmanet al.,2019)packages computational environments, and the Discovery Engine(Baulinet al.,2025)distills publications into a Conceptual Tensor—but none provides execution semantics or captures decision history. Unlike these formats,Arajointly binds scientific logic, minimal executable code, and decision history into a single protocol with machine-verifiable reproducibility.

Reproducibility infrastructure.

The reproducibility crisis in ML(Baker,2016; Pineauet al.,2021)has motivated code-sharing standards(Stoddenet al.,2016), scientific workflow engines(Köster and Rahmann,2012; Di Tommasoet al.,2017; Crusoeet al.,2022), and computational notebooks(Knuth,1984; Ruleet al.,2018); yet workflows encode pipelines without claim semantics, notebooks remain documents with hidden state, and recent benchmarks(Staraceet al.,2025; Liuet al.,2026; Konet al.,2025)collectively show that frontier agents cannot recover knowledge PDFs leave implicit—EXP-Bench reports only 0.5% end-to-end experiment success despite 20–35% component accuracy. On the verification side, LLMs detect fewer than 46% of paper–code discrepancies(Baumgärtner and Gurevych,2026), extending a longer line of scientific claim verification(Waddenet al.,2020)and attribution-based grounding(Gaoet al.,2023)and motivating formal auditing criteria across provenance, soundness, claim decomposition, and cryptographic lineage(Rasheedet al.,2026; Huang,2025; Radanlievet al.,2026). Unlike prior auditing proposals that address a single dimension,Ara’s Seal Certificates operationalize all of them in one enforceable mechanism.

Negative knowledge and failed trajectories.

Recent work shows that failure traces become actionable only once annotated with root-cause structure(Zhuet al.,2025; Zhanget al.,2025), yet raw trajectory dumps(Yanget al.,2024)remain difficult to leverage. Large-scale experiment logs(Pineda Arangoet al.,2021; Yinget al.,2019; Gijsberset al.,2019)retain>>99.99% more search history than their corresponding papers report, and process-level studies(Wijket al.,2025; Yamadaet al.,2025)confirm that human experts and agentic scientists both explore extensive dead ends that never surface in the write-up. Unlike raw trajectory archives,Ara’s exploration graph promotes dead ends to first-classdead_endnodes with structured failure modes and claim cross-references, making negative knowledge machine-queryable rather than lost to narrative selection.

Agent-oriented documentation and tooling.

A convergent body of work shows that agents benefit from structured, layered representations(OpenAI,2025; Vasilopoulos,2026)over flat corpora(Loet al.,2020; Priemet al.,2022), and that even the strongest LLMs implement fewer than 40% of novel contributions correctly, withsemanticmisalignment as the dominant failure mode(Jimenezet al.,2024; Chenet al.,2025; Huaet al.,2025). Recent systems target this gap from three sides: pipelines that convert papers into executable code(Seoet al.,2025)or interactive AI agents(Miaoet al.,2025)post-hoc, or recover tacit knowledge through graph analysis and debugging(Liet al.,2026); knowledge-graph approaches that mine background literature for technique–code links(Liu,2026b; Luoet al.,2025)—yielding up to 10.9% PaperBench gains but leaving the target contribution’s decision history and epistemic structure unmodeled; and autonomous research agents that conduct experiments end-to-end(Boikoet al.,2023; M. Branet al.,2024; Schmidgallet al.,2025; Baeket al.,2025), whose unstructured trajectory logs are themselves discarded once the resulting paper is written. Multi-agent frameworks(Wuet al.,2024), skill-library standards(Wanget al.,2023; Anthropic,2025a), and artifact-mediated agent coordination(Wanget al.,2026)further show that structured artifacts, not natural-language papers, are the natural unit of exchange for compounding agent capability—a premise our Live Research Manager (§3) instantiates. Unlike post-hoc recovery pipelines and background-knowledge graphs,Araencodes claims, evidence, heuristics, and their executable bindings at authoring time, eliminating the recovery step entirely.

9Future Work

Near term: artifact lineage and self-maintaining ecosystems.

The most pressing near-term gap is artifact durability: like code repositories,Aras decay without maintenance as dependencies rot and practices evolve, yet unmaintained artifacts are the community norm. The natural extension is alineage mechanismin which eachAradeclares its parent artifacts and expresses its contribution as a structured diff, reducing both construction cost (authors specify only the delta) and verification cost (reviewers and agents re-check only the new contribution). Lineage also enables self-maintaining ecosystems: agents consuming anAradetect and repair staleness, update deprecated dependencies, and propagate corrections upstream, so that every act of consumption becomes an act of maintenance.

Medium term: knowledge graph, collaborative discovery, and continuous review.

Aggregated lineages form a queryable scientific knowledge graph that lifts collaboration and review from the document level to the corpus level. Cross-artifact claim alignment turns literature synthesis into subgraph queries, lets reviewer agents verify that reported baselines match what citedAras recorded, and exposes trajectory conflicts where a method claimed as successful elsewhere was documented as failing. Shared Exploration Graphs also enable collaboration formats impossible in the PDF ecosystem, from parallel continuation of open problems with documented dead ends to fine-grained attribution on live, evolving artifacts. Review evolves in parallel: there is no single accept moment, only a claim-confidence surface that rises with replications and falls with counter-evidence, freeing human expert attention for the judgments only humans make, namely novelty, significance, and taste.

Long term: cross-disciplinary collective memory.

Our evaluation is restricted to machine learning, whereAra’s four-layer structure aligns naturally with the dominant contribution types: algorithms, architectures, and training procedures. Whether this structure generalizes to other disciplines remains an open question. The Cognitive and Evidence Layers are plausibly domain-agnostic, but the Physical Layer and Exploration Graph, both premised on iterable computational experiments, may require substantial adaptation for wet-lab sciences where execution is physical rather than computational. If these adaptations succeed,Araprovides a natural substrate for cross-disciplinary knowledge transfer, where documented failures in one field become actionable knowledge in another via graph traversal rather than literature search in unfamiliar notation.

10Limitations

Three limitations bound the claims in this paper.

Evaluation scope.Our study covers only machine learning papers, where computational reproducibility and well-defined contribution types makeAra’s four-layer structure a natural fit; whether the protocol generalizes to experimental sciences with physical execution requirements, or to theoretical disciplines where the Physical Layer is largely absent, remains empirically untested. Extending the Physical Layer to formal or proof-based results requiring machine-checkable specifications is a natural direction for future work. Our human-annotated benchmark was also constructed by annotators familiar with both theAraformat and the selected papers, so performance on unfamiliar or niche-domain artifacts may differ from the reported figures.

Fidelity ceiling.Arafidelity is bounded by the source of supervision. The Compiler faithfully represents only what the PDF contains (§4): when a paper omits experimental details, environment specifications, or ablation results, no extraction method can recover them. The Live Research Manager closes this gap by recording trajectories as research unfolds, but assumes an AI-native workflow in which a coding agent is present throughout the project. For researchers outside such sessions, the Compiler still produces a validArafrom the finished paper, but the resulting artifact inherits the PDF’s omissions; hand-authoring structured fields remains possible but reintroduces the documentation burden the protocol aims to eliminate. Closing this adoption gap tracks the broader diffusion of agent workflows in research practice, not the protocol itself.

Deployment prerequisites.Two properties required for production use are not yet implemented. The adversarial robustness and privacy guarantees raised in §5.2are aspirational: the current system lacks sandboxed execution, content-level anomaly detection, and granular access control for the Exploration Graph. Separately, any long-lived format facesschema evolution: as research practice changes, theAraschema will need to add node types, refine field semantics, and deprecate conventions without breaking prior artifacts. We versionPAPER.mdfrontmatter with an explicitara_schematag and require all validators to accept unknown fields (forward compatibility) and degrade gracefully on missing optional fields (backward compatibility), but have only exercised this discipline across minor revisions. A stable migration story for major revisions, including automatic rewriting of archival artifacts, long-term checker availability, and a deprecation policy, remains future work.

11Conclusion

We introduce theAraprotocol and its surrounding ecosystem as a foundation for agent-native scientific communication. Together, they address two structural failures of the PDF format: knowledge that narrative conventions discard (failed attempts, implicit configurations, unexplored branches) and specifications too underspecified to execute.Araresolves both by restructuring a research contribution as a machine-actionable artifact, one that is navigable, complete, and verifiable without human interpretation.

The broader motivation is a shift already underway: AI agents are becoming first-class participants in research workflows, not tools that assist humans but autonomous contributors that read, reproduce, and extend scientific work. That transition demands infrastructure built around agents from the start.Arais the core abstraction of that ecosystem, a common substrate through which human and machine researchers alike publish, verify, and build on scientific knowledge.

References

A billion-dollar donation: estimating the cost of researchers’ time spent on peer review.Research Integrity and Peer Review6(14),pp. 1–8.External Links:DocumentCited by:§1,§5.
Anthropic (2025a)Agent skills: a simple, open format for agent capabilities.Note:https://agentskills.io/specificationOpen specification. Accessed 2026-03-08Cited by:§H.2.2,§3.2,item Level 2 – Argumentative Rigor,§8.
Anthropic (2025b)Claude code sdk.Note:https://docs.anthropic.com/en/docs/claude-code/sdkAccessed 2026-04-17Cited by:§G.4,§7.4,§7.5.
J. Baek, S. K. Jauhar, S. Cucerzan, and S. J. Hwang (2025)ResearchAgent: iterative research idea generation over scientific literature with large language models.InProceedings of NAACL-HLT,Note:arXiv:2404.07738Cited by:§8.
M. Baker (2016)1,500 scientists lift the lid on reproducibility.Nature533(7604),pp. 452–454.External Links:DocumentCited by:§1,§8.
V. Baulin, A. Cook, D. Friedman, J. Lumiruusu, A. Pashea, S. Rahman, and B. Waldeck (2025)The discovery engine: a framework for AI-driven synthesis and navigation of scientific knowledge landscapes.arXiv preprint arXiv:2505.17500.Cited by:§8.
T. Baumgärtner and I. Gurevych (2026)SciCoQA: quality assurance for scientific paper–code alignment.arXiv preprint arXiv:2601.12910.Cited by:§8.
L. Biewald (2020)Experiment tracking with Weights & Biases.Note:Software available from wandb.comExternal Links:LinkCited by:§8.
D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes (2023)Autonomous chemical research with large language models.Nature624(7992),pp. 570–578.External Links:DocumentCited by:§8.
A. S. Booeshaghi, L. Luebbert, and L. Pachter (2026)Science should be machine-readable.bioRxiv.External Links:DocumentCited by:§8.
A. Brinckman, K. Chard, N. Gaffney, M. Hategan, M. B. Jones, K. Kowalik, S. Kulasekaran, B. Ludäscher, B. D. Mecum, J. Nabrzyski, V. Stodden, I. J. Taylor, M. J. Turk, and K. Turner (2019)Computing environments for reproducibility: capturing the “Whole Tale”.Future Generation Computer Systems94,pp. 854–867.External Links:DocumentCited by:§8.
M. Canini (2026)Scientists should stop writing papers for each other.Note:LinkedIn PulseAccessed 2026-03-16Cited by:§1,§8.
C. D. Chambers (2013)Registered reports: a new publishing initiative at Cortex.Cortex49(3),pp. 609–610.External Links:DocumentCited by:§3.1.
Z. Chen, S. Chen, Y. Ning, Q. Zhang, B. Wang, B. Yu, Y. Li, Z. Liao, C. Wei, Z. Lu, V. Dey, M. Xue, F. N. Baker, B. Burns, D. Adu-Ampratwum, X. Huang, X. Ning, S. Gao, Y. Su, and H. Sun (2025)ScienceAgentBench: toward rigorous assessment of language agents for data-driven scientific discovery.InInternational Conference on Learning Representations,External Links:LinkCited by:§8.
M. R. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijanić, H. Ménager, S. Soiland-Reyes, B. Gavrilović, and C. Goble (2022)Methods included: standardizing computational reuse and portability with the Common Workflow Language.Communications of the ACM65(6),pp. 54–63.External Links:DocumentCited by:§8.
P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame (2017)Nextflow enables reproducible computational workflows.Nature Biotechnology35(4),pp. 316–319.External Links:DocumentCited by:§8.
A. Franco, N. Malhotra, and G. Simonovits (2014)Publication bias in the social sciences: unlocking the file drawer.Science345(6203),pp. 1502–1505.External Links:DocumentCited by:§1,§1.
L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty, Y. Fan, V. Y. Zhao, N. Lao, H. Lee, D. Juan, and K. Guu (2023)RARR: researching and revising what language models say, using language models.InProceedings of ACL,pp. 16477–16508.Cited by:§8.
P. Gijsbers, E. LeDell, J. Thomas, S. Poirier, B. Bischl, and J. Vanschoren (2019)An open source AutoML benchmark.arXiv preprint arXiv:1907.00909.Cited by:§8.
P. Groth, A. Gibson, and J. Velterop (2010)Anatomy of a nanopublication.Information Services & Use30(1-2),pp. 51–56.External Links:DocumentCited by:§1,§8.
T. Hua, H. Hua, V. Xiang, B. Klieger, S. T. Truong, W. Liang, F. Sun, and N. Haber (2025)ResearchCodeBench: benchmarking LLMs on implementing novel machine learning research code.arXiv preprint arXiv:2506.02314.Cited by:§8.
M. Huang (2025)DecMetrics: structured claim decomposition scoring for factually consistent LLM outputs.arXiv preprint arXiv:2509.04483.Cited by:§8.
M. Y. Jaradeh, A. Oelen, K. E. Farfar, M. Prinz, J. D’Souza, G. Kismihók, M. Stocker, and S. Auer (2019)Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge.InProceedings of the 10th International Conference on Knowledge Capture (K-CAP),pp. 243–246.External Links:DocumentCited by:§8.
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan (2024)SWE-bench: can language models resolve real-world GitHub issues?.InInternational Conference on Learning Representations,External Links:LinkCited by:§8.
D. E. Knuth (1984)Literate programming.The Computer Journal27(2),pp. 97–111.Cited by:§8.
P. T. J. Kon, J. Liu, X. Zhu, Q. Ding, J. Peng, J. Xing, Y. Huang, Y. Qiu, J. Srinivasa, M. Lee, M. Chowdhury, M. Zaharia, and A. Chen (2025)EXP-Bench: can AI conduct AI research experiments?.arXiv preprint arXiv:2505.24785.Cited by:§8.
J. Köster and S. Rahmann (2012)Snakemake—a scalable bioinformatics workflow engine.Bioinformatics28(19),pp. 2520–2522.External Links:DocumentCited by:§8.
T. S. Kuhn (1962)The structure of scientific revolutions.University of Chicago Press,Chicago.Cited by:§1.
K. Kusumegi, X. Yang, P. Ginsparg, M. de Vaan, T. Stuart, and Y. Yin (2025)Scientific production in the era of large language models.Science390(6779),pp. 1240–1243.External Links:DocumentCited by:§1.
T. Lebo, S. Sahoo, D. McGuinness, K. Belhajjame, J. Cheney, D. Corsar, D. Garijo, S. Soiland-Reyes, S. Zednik, and J. Zhao (2013)PROV-O: the PROV ontology.W3C RecommendationW3C.External Links:LinkCited by:§8.
L. Li, R. Wang, H. Song, Y. Mao, T. Zhang, Y. Wang, J. Fan, Y. Zhang, J. Ye, C. Zhang, and Y. Gong (2026)What papers don’t tell you: recovering tacit knowledge for automated paper reproduction.arXiv preprint arXiv:2603.01801.Cited by:§4.2,§8.
A. Liu (2026a)The rise of AI-native researchers.Note:https://amberliu2.substack.com/p/the-rise-of-ai-native-researchersBlog post. Accessed 2026-03-08Cited by:§1.
J. Liu, M. Harmon, and Z. Zhang (2026)Sci-reasoning: a dataset decoding AI innovation patterns.arXiv preprint arXiv:2601.04577.Cited by:§8.
Z. Liu (2026b)Research agents should target knowledge graphs, not papers.Note:https://kindxiaoming.github.io/blog/2026/research-agent/Blog post. Accessed 2026-03-08Cited by:§8.
K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. Weld (2020)S2ORC: the semantic scholar open research corpus.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp. 4969–4983.External Links:DocumentCited by:§8.
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha (2024)The AI scientist: towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292.Cited by:§1,§1.
Y. Luo, Z. Yu, X. Wang, Y. Zhu, N. Zhang, L. Wei, L. Du, D. Zheng, and H. Chen (2025)What makes AI research replicable? Executable knowledge graphs as scientific knowledge representations.arXiv preprint arXiv:2510.17795.Cited by:§8.
A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller (2024)Augmenting large language models with chemistry tools.Nature Machine Intelligence6(5),pp. 525–535.External Links:DocumentCited by:§8.
N. Matosin, E. Frank, M. Engel, J. S. Lum, and K. A. Newell (2014)Negativity towards negative results: a discussion of the disconnect between scientific worth and scientific culture.Disease Models & Mechanisms7(2),pp. 171–173.External Links:DocumentCited by:§3.1.
P. B. Medawar (1963)Is the scientific paper a fraud?.The Listener70,pp. 377–378.Note:Reprinted inThe Strange Case of the Spotted Mice, Oxford University Press, 1996Cited by:§1,§1.
J. Miao, J. R. Davis, Y. Zhang, J. K. Pritchard, and J. Zou (2025)Paper2Agent: reimagining research papers as interactive and reliable ai agents.arXiv preprint arXiv:2509.06917.Cited by:§8.
OpenAI (2025)AGENTS.md: a standard for agent-oriented repository documentation.Note:https://github.com/openai/agents.mdAccessed 2026-03-01Cited by:§1,§8.
J. Pineau, P. Vincent-Lamarre, K. Sinha, V. Larivière, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and H. Larochelle (2021)Improving reproducibility in machine learning research: a report from the NeurIPS 2019 reproducibility program.Journal of Machine Learning Research22(164),pp. 1–20.Cited by:§8.
S. Pineda Arango, H. S. Jomaa, M. Wistuba, and J. Grabocka (2021)HPO-B: a large-scale reproducible benchmark for black-box HPO based on OpenML.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:LinkCited by:§8.
M. Polanyi (1966)The tacit dimension.Doubleday,Garden City, NY.Cited by:§1.
J. Priem, H. Piwowar, and R. Orr (2022)OpenAlex: a fully-open index of scholarly works, authors, venues, institutions, and concepts.arXiv preprint arXiv:2205.01833.Cited by:§8.
P. Radanliev, O. Santos, C. Maple, and S. Atefi (2026)Operationalising artificial intelligence bills of materials for verifiable AI provenance and lifecycle assurance.Frontiers in Computer Science8,pp. 1735919.External Links:DocumentCited by:§8.
R. A. Rasheed, S. Banerjee, A. Mukherjee, and R. Hazra (2026)From fluent to verifiable: claim-level auditability for deep research agents.arXiv preprint arXiv:2602.13855.Cited by:§8.
A. H. Renear and C. L. Palmer (2009)Strategic reading, ontologies, and the future of scientific publishing.Science325(5942),pp. 828–832.External Links:DocumentCited by:§1.
R. Rosenthal (1979)The file drawer problem and tolerance for null results.Psychological Bulletin86(3),pp. 638–641.External Links:DocumentCited by:§1,§1.
A. Rule, A. Tabard, and J. D. Hollan (2018)Exploration and explanation in computational notebooks.InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems,pp. 1–12.External Links:DocumentCited by:§8.
S. Schmidgall, Y. Su, Z. Wang, X. Sun, J. Wu, X. Yu, J. Liu, M. Moor, Z. Liu, and E. Barsoum (2025)Agent laboratory: using LLM agents as research assistants.arXiv preprint arXiv:2501.04227.Cited by:§8.
M. Seo, J. Baek, S. Lee, and S. J. Hwang (2025)Paper2Code: automating code generation from scientific papers in machine learning.arXiv preprint arXiv:2504.17192.Note:ICLR 2026Cited by:§8.
S. Soiland-Reyes, P. Sefton, M. Crosas, L. J. Castro, F. Coppens, J. M. Fernández, D. Garijo, B. Grüning, M. La Rosa, S. Leo,et al.(2022)Packaging research artefacts with RO-Crate.Data Science5(2),pp. 97–138.External Links:DocumentCited by:§1,§8.
G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson, J. Heidecke, A. Glaese, and T. Patwardhan (2025)PaperBench: evaluating AI’s ability to replicate AI research.InProceedings of the 42nd International Conference on Machine Learning,Vol.267,pp. 56843–56873.External Links:LinkCited by:§A.1,Appendix D,§1,§4.1,Table 2,§8.
M. Stocker, M. Snyder, C. Anfuso, M. Ludwig,et al.(2025)Rethinking the production and publication of machine-readable expressions of research findings.Scientific Data12(1),pp. 1–10.External Links:DocumentCited by:§8.
V. Stodden, M. McNutt, D. H. Bailey, E. Deelman, Y. Gil, B. Hanson, M. A. Heroux, J. P. Ioannidis, and M. Taufer (2016)Enhancing reproducibility for computational methods.Science354(6317),pp. 1240–1241.External Links:DocumentCited by:§1,§8.
A. Vasilopoulos (2026)Codified context: infrastructure for AI agents in a complex codebase.arXiv preprint arXiv:2602.20478.Cited by:§8.
D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, and H. Hajishirzi (2020)Fact or fiction: verifying scientific claims.InProceedings of EMNLP,pp. 7534–7550.External Links:DocumentCited by:§8.
F. Y. Wang, L. Marom, S. Pal, R. K. Luu, W. Lu, J. A. Berkovich, and M. J. Buehler (2026)Autonomous agents coordinating distributed discovery through emergent artifact exchange.arXiv preprint arXiv:2603.14312.Cited by:§8.
G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar (2023)Voyager: an open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291.Cited by:§8.
H. Wijk, T. Lin, J. Becker, S. Jawhar, N. Parikh, T. Broadley, L. Chan, M. Chen, J. Clymer, J. Dhyani, E. Ericheva, K. Garcia, B. Goodrich, N. Jurkovic, H. Karnofsky, M. Kinniment, A. Lajko, S. Nix, L. Sato, W. Saunders, M. Taran, B. West, and E. Barnes (2025)RE-Bench: evaluating frontier AI R&D capabilities of language model agents against human experts.InProceedings of the 42nd International Conference on Machine Learning,External Links:LinkCited by:Appendix D,§G.1,§1,§4.1,§7.4,Table 2,§8.
M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L. B. da Silva Santos, P. E. Bourne,et al.(2016)The FAIR guiding principles for scientific data management and stewardship.Scientific Data3(1),pp. 1–9.External Links:DocumentCited by:§1,§8.
Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang (2024)AutoGen: enabling next-gen LLM applications via multi-agent conversation.InConference on Language Modeling (COLM),Note:arXiv:2308.08155Cited by:§8.
Y. Yamada, R. T. Lange, C. Lu, S. Hu, C. Lu, J. Foerster, J. Clune, and D. Ha (2025)The AI scientist-v2: workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066.Cited by:§8.
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press (2024)SWE-agent: agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793.Cited by:§8.
C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter (2019)NAS-Bench-101: towards reproducible neural architecture search.InProceedings of the 36th International Conference on Machine Learning,Vol.97,pp. 7105–7114.Cited by:§8.
M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe,et al.(2018)MLflow: a system for managing the machine learning lifecycle.Note:Workshop on ML Systems at NeurIPSCited by:§8.
G. Zhang, J. Wang, J. Chen, W. Zhou, K. Wang, and S. Yan (2025)AgenTracer: who is inducing failure in the LLM agentic systems?.arXiv preprint arXiv:2509.03312.Cited by:§8.
L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica (2023)Judging LLM-as-a-judge with MT-Bench and chatbot arena.InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track,Cited by:§7.5.
K. Zhu, Z. Liu, B. Li, M. Tian, Y. Yang, J. Zhang, P. Han, Q. Xie, F. Cui, W. Zhang, X. Ma, X. Yu, G. Ramesh, J. Wu, Z. Liu, P. Lu, J. Zou, and J. You (2025)Where LLM agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370.Cited by:§8.

Appendix AARA Protocol and Design Rationale

This appendix consolidates the full protocol specification, design rationale, and validation details for theAraformat.

A.1Taxonomy of Reproduction-Critical Information

To understandwhatinformation an agent needs to reproduce a paper—and where PDFs fall short—we analyze the expert-authored rubrics from PaperBench(Staraceet al.,2025), a benchmark that evaluates AI agents on full paper reproduction. Each rubric decomposes a paper into atomicleaf requirements: individually verifiable conditions that collectively constitute a faithful reproduction.Scope.The taxonomy below is derived from a deeply annotated 5-paper subset (3,050 leaves), chosen for tractability of fine-grained category labeling; the coarser per-task-category and gap-type frequencies reported in §7.2and AppendixE.2are validated on the full 23-paper corpus (8,921 requirements). The subset spans diverse domains—black-box LLM adaptation (BBox), mechanistic interpretability, continual RL (Self-Composing Policies), physics-informed neural networks (PINN), and foundation models for RL (FRE).

By categorizing every leaf into a taxonomy of information types, we reveal both thediversityof knowledge needed for reproduction and the specific failure modes that arise when this knowledge is scattered across a narrative PDF rather than organized in a structured artifact.

A.1.1Information Categories

We identify ten categories of reproduction-critical information. Table6summarizes the categories with their frequency distribution across our analyzed rubrics.

Table 6:Taxonomy of reproduction-critical information in PaperBench rubrics. Frequency is computed across 3,050 leaf requirements from five papers. The “PDF Difficulty” column characterizes the primary challenge of extracting this information from a narrative PDF. The “AraLayer” column identifies whichAracomponent directly addresses each category.Below we define each category, give concrete examples drawn from the PaperBench rubrics, and explain howAra’s structure addresses the underlying retrieval challenge.

1. Combinatorial experiment matrix (24.1%).

The single largest category consists of requirements that enumerate which model variant must be run on which dataset, with which configuration, for how many seeds. In PDFs, this combinatorial structure is compressed into a single sentence (“We evaluate all methods on three task sequences with 10 seeds each”) or a results table whose row/column headers implicitly define the cross-product. An agent must mentally decompose the matrix to know, e.g., that “CompoNet on Freeway, 10 seeds, 1M timesteps per task” is a distinct run.

Examples:

•self-composing-policies: 62 leaves enumerate {6 methods}×\times{3 task sequences}×\times{seeds, timesteps, trained}—each a separate verifiable requirement (e.g., “CompoNet on Meta-World: 10 seeds, 1M timesteps/task, trained”).
•bbox: 46 evaluation leaves cross {5 model sizes}×\times{4 datasets}×\times{3 feedback types}×\times{single-step, full-step} inference modes.
•pinn:∼\sim1,273 leaves enumerate a grid of {4 PDE problems}×\times{4 network widths}×\times{5 learning rates}×\times{3 optimizers}, each combination a distinct training run.

Araadvantage.Theexperiments.mdfile makes every cell of the experiment matrix explicit, with structuredSetupfields that list model, dataset, and configuration as machine-readable key–value pairs. An agent can enumerate all runs programmatically rather than parsing table headers.

2. Evaluation protocol (18.5%).

Requirements specifyingwhichmetric to compute, onwhichtest split, usingwhichevaluation-time configuration (e.g., beam size, number of evaluation episodes, specific layers to probe). These details are often scattered: the metric definition appears in §3, the test split in §4, the evaluation episodes in the appendix, and the layer indices in a figure caption.

Examples:

•mechanistic-understanding: “Compute cosine similarity betweenδi\delta_{i}andδmlp,i\delta_{\text{mlp},i}for layers 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 using 1,199 prompts from RealToxicityPrompts.”
•fre: “Evaluation is repeated and averaged over 20 episodes and 5 seeds; 32 state-reward pairs are sampled from the evaluation task environment.”
•bbox: “The Chain-of-Thought baseline has been evaluated on the test splits of all datasets using GPT-3.5 Turbo.”

Araadvantage.Each experiment entry inexperiments.mdhas a declarativeProcedurefield that specifies evaluation steps as an ordered list, and aMetricsfield that names the exact metrics. Theconfigs/training.mdfile separately records evaluation-time parameters (e.g., beam size, episodes).

3. Hyperparameters (17.2%).

Classic training configuration: learning rates, batch sizes, optimizer parameters, temperature, weight decay, LoRA rank, number of epochs. While these are themost commonly discussedreproduction barrier, they account for only 17% of leaf requirements. In PDFs, hyperparameters are typically consolidated in an appendix table, but the correspondence between table rows and specific experimental conditions is often ambiguous.

Examples:

•bbox: “AdamW optimizer with learning rate 5e-6, weight decay 0.01; batch size 64; 6,000 training steps.”
•self-composing-policies: 29 leaves enumerate every SAC and PPO parameter individually (e.g., “SAC: target smoothing coefficientτ=0.005\tau=0.005”; “PPO: GAEλ=0.95\lambda=0.95”).
•pinn: “Learning rate of the Adam optimizer can be set to 1E-5, 1E-4, 1E-3, 1E-2, or 1E-1.”

Araadvantage.Theconfigs/training.mdfile provides a single, authoritative location for all hyperparameters, organized by experiment. Theheuristics.mdfile additionally recordssensitivityannotations (low/medium/high) and validbounds, information that PDFs almost never provide.

4. Metric computation and logging (10.4%).

Requirements that the agent mustrecordspecific intermediate quantities during runs: loss curves, attention distributions, cost tracking (dollars per 1k questions), episodic returns logged everyNNsteps. This “instrumentation” knowledge is rarely specified in papers—authors implicitly know what to log but do not document it as part of the method.

Examples:

•bbox: 71 leaves (25% of the paper’s rubric) require computing and saving training costs, inference costs (USD/1k questions), and evaluation costs across all dataset×\timesvariant combinations.
•self-composing-policies: “Output attention distribution logged every 10k timesteps”; “Matching rate between final output and internal policy, saved every 10k steps.”

Araadvantage.Theexperiments.mdMetricsandProcedurefields can explicitly list what to log and at what frequency. Theevidence/layer provides concrete examples of the expected output format.

5. Result interpretation (8.6%).

Qualitative claims about what the results shouldshow—directional trends, comparative rankings, mechanistic explanations. These carry the highest weight in PaperBench rubrics (weight = 2) because they test whether the agentunderstandsthe results, not just whether the code ran.

Examples:

•mechanistic-understanding: “After adapting with DPO, the principal component of the residual streams shift in the same direction, and the activation of the toxic vectors decrease.” (Weight = 2)
•mechanistic-understanding: “The extracted tokens encode different characteristics of toxic language: tokens from𝐖\mathbf{W}are mostly curse words; MLP.vToxic are a mix of curse words and insults; SVD.uToxic encode insults and female sexual references.” (Weight = 2)
•self-composing-policies: “CompoNet achieves higher average performance and forward transfer than all baselines on all three task sequences.”
•pinn: “Adam+L-BFGS always achieves the lowest minimum loss compared to just using Adam or L-BFGS alone.”

Araadvantage.Theclaims.mdfile states each claim with explicitFalsification criteriaand pointers to the experiment that verifies it. Theexperiments.mdExpected outcomefield records the directional prediction (e.g., “method A outperforms method B”) without revealing exact numbers, enabling blind reproduction.

6. Architecture specification (5.8%).

Layer counts, channel sizes, activation functions, output head structure, embedding dimensions. In PDFs, architecture details are split across a figure (showing the high-level diagram), the methods section (describing components in prose), and the appendix (listing dimensions in a table). An agent must mentally compose these three sources to build the full specification.

Examples:

•self-composing-policies: “CNN has three convolutional layers with 32, 64, and 64 channels and filter sizes of 8, 4, and 3”; “SAC: hidden dimensiondmodel=256d_{\text{model}}=256; critic network has 3 layers; activation is ReLU.”
•fre: “GC-BC model is a MLP with three hidden layers of size 512”; “layer normalization is applied before each activation function.”
•mechanistic-understanding: “Binary classifier of the formsoftmax(𝐖𝐱)\text{softmax}(\mathbf{Wx})where𝐖\mathbf{W}has dimensionalityK×2K\times 2.”

Araadvantage.Thearchitecture.mdfile provides a single location listing every component with its dimensions, activation functions, and input/output specifications. Code stubs insrc/code/provide an executable complement.

7. Mathematical formulation (4.5%).

Specific equations that must be implemented exactly: loss functions, attention operations, PDE boundary conditions, update rules. In PDFs, equations are referenced by number, but the reader must trace through variable definitions scattered across multiple sections.

Examples:

•self-composing-policies: “Output attention:softmax(qKT/dmodel)⋅V\text{softmax}(qK^{T}/\sqrt{d_{\text{model}}})\cdot V”; “Forward transfer:FTri=(AUCi−AUCib)/(1−AUCib)\text{FTr}_{i}=(\text{AUC}_{i}-\text{AUC}_{i}^{b})/(1-\text{AUC}_{i}^{b}).”
•pinn: “The loss function corresponds to the non-linear least squares problem described in Section 2.1, with the relevant differential operator and boundary/initial condition operators outlined in Appendix A.1.”
•fre: “The value function is updated with an expectile regression objective on the critic’s Q-values”; “The actor is updated via advantage-weighted regression (AWR).”

Araadvantage.Thealgorithm.mdfile presents the algorithm as a self-contained pseudocode block with all variable definitions in scope. Theconcepts.mdfile defines notation and links to the equations that use each symbol.

8. Implementation tricks (4.2%).

Non-obvious design choices that distinguish faithful reproduction from naive re-implementation: weight freezing schedules, initialization from prior checkpoints, gradient clipping thresholds, normalization details, optimizer switching strategies. These are the hardest items to recover from a PDF because they appear as parenthetical remarks, footnotes, or single sentences buried in dense paragraphs.

Examples:

•self-composing-policies: “Single CNN encoder per policy; new encoder initialized with weights of the previous one” (Appendix E.2); “Reset critic network at the beginning of each task”; “Normalize summed vectors for continuous action spaces.”
•fre: “The transformer does not use a causal mask on its attention”; “Positional embeddings are not used in the transformer”; “States sampled for decoding and encoding are sampled separately.”
•pinn: “At the end of training, the L-BFGS directions, steps, and inverse of inner products are saved” (Appendix C.2); “Strong Wolfe line search is used with L-BFGS.”

Araadvantage.Theheuristics.mdfile isspecifically designedto capture these items. Each heuristic entry includesRationale(why it matters),Sensitivity(how much performance degrades without it), andCode ref(where in the code to apply it).

9. Data pipeline (3.8%).

Dataset acquisition, split ratios, filtering criteria, preprocessing steps, data augmentation, collocation point sampling strategies. These details are often under-specified in papers (“we use the standard train/test split”) or tucked into a single appendix paragraph.

Examples:

•bbox: “Split GSM8K into 7,473 training and 1,319 test samples”; “Randomly sample 100 questions for TruthfulQA test set, remaining 717 for training.”
•mechanistic-understanding: “24,576 pairs of toxic and non-toxic continuations have been created”; “295 prompts selected from RealToxicityPrompts that output ‘shit’ as the next token.”
•pinn: “10,000 residual points randomly sampled from a 255×\times100 grid; 257 equally spaced points for each initial condition and 101 for each boundary condition.”

Araadvantage.Theconfigs/directory provides structured configuration files with exact split sizes and sampling parameters. Theenvironment.mdfile specifies dataset versions and download URLs.

10. Environment and infrastructure (2.9%).

Specific API endpoints, model version strings, library versions, simulator names, hardware requirements. These are often assumed to be “obvious” and omitted entirely from the paper, yet they are essential for reproduction.

Examples:

•bbox: “API access configured for davinci-002”; “Code to execute fine-tuning jobs through the Azure OpenAI API”; “Mixtral-8x7B-v0.1 loaded from HuggingFace in half-precision.”
•self-composing-policies: 15 leaves enumerate specific Gymnasium environment IDs (e.g.,ALE/SpaceInvaders-v5,hammer-v2) and required packages (Metaworldfrom Farama-Foundation).
•fre: “The observation space’s XY coordinates are discretized into 32 bins for Ant Maze agents.”

Araadvantage.Theenvironment.mdfile lists exact package versions, model identifiers, and hardware requirements. Theconfigs/model.mdfile records model names, sizes, and loading configurations (e.g., precision, quantization).

A.1.2Key Findings

Three observations emerge from this analysis:

Hyperparameters are necessary but not sufficient.

Classic hyperparameters—the most discussed reproduction barrier—account for only 17.2% of leaf requirements. The remaining 82.8% comprise evaluation protocols, experiment matrices, logging requirements, result interpretation targets, and implementation tricks that are harder to locate in a PDF and receive less attention in reproducibility discussions.

The combinatorial explosion is the dominant challenge.

The largest category (24.1%) consists of requirements that enumerate the full cross-product of models, datasets, and configurations. In a PDF, this matrix is compressed into a single table or sentence; an agent must decompose it into individual runs. TheAraformat makes this matrix explicit and machine-enumerable.

High-weight requirements demandunderstanding, not just extraction.

All weight-2 requirements in the PaperBench rubrics belong to the “Result Interpretation” category. These test whether the agent can verify that reproduced results exhibit thequalitative patternsclaimed by the paper—not just whether the code runs. TheAraclaims.mdandexperiments.mdlayers directly encode these verification targets, making the connection between code output and paper claims explicit rather than requiring the agent to re-derive it from narrative text.

A.2Physical Layer Modes: Kernel vs Repository

The Physical Layer (/src) adopts one of two modes, declared in thePAPER.mdfrontmatter (src_mode: kernel||repo) so that consuming agents adapt their strategy immediately. These two modes cover the dominant contribution types in empirical CS, where executable code is the natural physical representation.

Kernel mode (/src/kernel/).

When the contribution is primarilyalgorithmic, the invariant can be cleanly separated from scaffolding. The kernel contains only the core modules with typed I/O signatures—often one to two orders of magnitude smaller than the full repository—stripped of all environment-specific code. A general-purpose coding agent consumes the kernel alongside the structured specification in/logic/solution/and generates fresh, environment-native boilerplate in minutes. Because agent coding capabilities improve continuously, the same kernel yields abettersurrounding implementation over time: the artifact appreciates rather than decays.

Repository mode (/src/repo/).

When the contribution is primarilysystemic—a CUDA kernel, a distributed training strategy, a systems architecture—the engineeringisthe contribution. The full implementation is retained butannotated: anindex.mdmanifest maps each source file to theAracomponent it implements—which claim it supports, which heuristic it embodies, which architectural module it belongs to—providing the structured navigation that a monolithic codebase lacks. Forensic bindings connect code regions to claims, constraints, and heuristics, so an agent traverses the codebase guided by research structure rather than by directory conventions alone.

In both modes, the Cognitive Layer remains the primary interface for understanding the contribution; the Physical Layer provides executable evidence, scaled to match.

A.3ARA by Example: This Paper’s Own Artifact

This paper is itself maintained as anAraartifact. Theara/directory at the repository root contains the living cognitive, physical, and exploration layers that were populated incrementally during the research process (see §3). We reproduce excerpts from each key file below to give readers a concrete sense of the format. All entries are real; only trailing items are elided for space.

A.3.1Directory layout and root manifest

The completeara/directory for this paper is shown below. An agent’s first action is to readPAPER.md, which contains YAML frontmatter and an abstract (∼{\sim}500 tokens) sufficient to decide relevance without loading any layer.

ara/

PAPER.md#entrypoint

logic/#CognitiveLayer

problem.md#observations,gaps,keyinsight

claims.md#16falsifiableclaims+status

experiments.md#verificationplan(E1-E6)

related_work.md#typedcitationdependencies

solution/

heuristics.md#23designdecisions+rationale

trace/#ExplorationGraph

exploration_tree.yaml#114-nodedecisionDAG

sessions/#38sessionlogs(2026-03-12..04-26)

session_index.yaml#chronologicalindex

2026-03-12_001.yaml#...onefilepersession

pm_reasoning_log.yaml#LivePMreasoningtrace

evidence/#EvidenceLayer

README.md#indexofrawresults

staging/

observations.yaml#94unpromotedobservations

The root manifestPAPER.md:

---

title:“Agent-NativeResearchArtifacts”

authors:[“AmberLiu”,“ZechenZhang”]

venue:“NeurIPS2026”

status:draft

date_created:“2026-03-12”

last_updated:“2026-04-27”

abstract:>

WeproposetheAgent-NativeResearchArtifact

(ARA),afile-systemprotocolthatreplacesthe

narrativepaperwithamachine-executableresearch

packageorganizedacrossfourinterlockinglayers:

aCognitiveLayer(/logic)encodingstructured

scientificreasoning,aPhysicalLayer(/src)

containingtheexecutablecodekernel,an

ExplorationGraph(/trace)preservingthefull

branchingresearchtrajectoryincludingdeadends,

andanEvidenceLayer(/evidence)groundingevery

claiminrawempiricalresults.PDFpublication

imposestwostructuralcostsonautonomous

research:aStorytellingTax(failedexperiments

andrejectedhypothesesarediscardedtofita

linearnarrative)andanEngineeringTax(thegap

betweenreviewer-sufficientproseand

agent-sufficientspecificationleavescritical

implementationdetailsunwritten).OnPaperBench

andRE-Bench,ARAraisesquestion-answering

accuracyfrom72.4%to93.7%andreproduction

successfrom57.4%to64.4%;onRE-Bench’sfive

open-endedextensiontasks,thefailuretraces

preservedinARAaccelerateresearchprogressby

helpingtheagentavoidpitfallspriorruns

alreadymapped,butforasufficientlycapable

modelthesamerecordedplaybookcanconstraina

morecreativeagentthatwouldotherwisestep

outsidetheprior-runbox.

layers:

logic:logic/

src:src/

trace:trace/

evidence:evidence/

staging:staging/

---

#LayerIndex

-**Cognitive**(‘logic/‘):structuredreasoning

-‘problem.md‘--observations,gaps,keyinsight

-‘claims.md‘--16falsifiableclaimswithstatus

andproofpointers

-‘experiments.md‘--verificationplan(E1-E6)

-‘related_work.md‘--typedcitationdependencygraph

-‘solution/heuristics.md‘--23designdecisions

withrationaleandsensitivity

-**Exploration**(‘trace/‘):branchingtrajectory

-‘exploration_tree.yaml‘--114-nodedecisionDAG

-‘sessions/‘--38sessionlogs(2026-03-12..04-26)

-‘pm_reasoning_log.yaml‘--LivePMreasoningtrace

-**Evidence**(‘evidence/‘):rawempiricalresults

-‘README.md‘--indexofallevaluationdata,

includingpost-paperRE-Benchextensionevals

-**Staging**(‘staging/‘):unpromotedobservations

-‘observations.yaml‘--94preliminaryobservations

(latest:2026-04-26cross-modelandsynthesis)

A.3.2Cognitive Layer:logic/claims.md

Each claim carries a machine-readable status (hypothesis,supported,testing), falsification criteria, and proof pointers that reference the evidence layer rather than inlining results. We show two claims at different lifecycle stages.

#Claims

##C04:UniversalIngestorproduceslossless

transformations

-**Statement**:TheLLM-basedIngestorfaithfully

transformsPDFpapersintoARAformatwithout

informationloss,achievingnear-parityon

factualQ&AbetweenARAandsourcePDF.

-**Status**:supported

-**Provenance**:ai-executed

-**Falsificationcriteria**:Systematicaccuracy

drop(>5%)onunderstandingquestions.

-**Proof**:[evidence/README.md->

understanding_eval;450QsacrossCatA/B/C]

-**Dependencies**:[C03]

-**Tags**:ingestor,fidelity

##C06:Negativeknowledgeisthehighest-value

signal

-**Statement**:TheExplorationGraph’sdead-end

documentationproducesthelargestaccuracygap

intheentireevaluation--agentswithfailure

tracesanswerquestionsaboutfailedapproaches

thatnarrativeformatsmakestructurally

unanswerable.

-**Status**:supported

-**Provenance**:ai-executed

-**Falsificationcriteria**:Dead-enddocsproduce

nomeasurableimprovementonCatC;gap<10pp.

-**Proof**:[evidence/README.md->

understanding_evalCatC;

evidence/README.md->extension_eval]

-**Dependencies**:[C05]

-**Tags**:exploration-graph,negative-knowledge

...

A.3.3Cognitive Layer:logic/problem.md

The problem file decomposes the motivation into typed observations (empirical facts), gaps (what existing approaches miss), and a key insight that bridges them. Each entry carries evidence pointers and implication fields, so an agent can trace the full argumentative chain without reading the paper’s introduction. We show one representative entry per section.

#Problem

##Observations

###O3:FrontierLLMsfailonresearchimplementation

-**Statement**:EventhestrongestfrontierLLMs

correctlyimplementfewerthan40%ofnovel

researchcontributionswhengiventhefullpaper

andcodebase,withsemanticmisalignmentasthe

dominantfailuremode.

-**Evidence**:ResearchCodeBench(Huaetal.2025)

-**Implication**:TheinformationencodinginPDFs

isstructurallyinadequateforagentconsumption.

###O4:PDFinformationgapissystematic

-**Statement**:Across23PaperBenchpapers(8,921

rubricrequirements),only45.4%ofreproduction

requirementsarefullyspecifiedinthePDF.

-**Evidence**:Ownexperiment--info_gap_aggregate

-**Implication**:ThePDFformatisstructurally

incapableofservingasaself-contained

reproductionspecification.

...

##Gaps

###G2:Negativeknowledgeissystematicallydiscarded

-**Statement**:Deadends,rejectedhypotheses,and

convergence-criticaltricksarelosttothe

narrativecompressionofthepublicationprocess.

-**Causedby**:O1

-**Whyitmatters**:Downstreamagentswastecompute

re-exploringpathsalreadyprovenfruitless.

...

##KeyInsight

-**Insight**:Separateresearchknowledgeintofour

orthogonallayers--structuredscientificlogic

(CognitiveLayer/logic),minimalexecutablecode

(PhysicalLayer/src),preserveddecisionhistory

(ExplorationGraph/trace),andrawempirical

results(EvidenceLayer/evidence)--tocreate

amachine-executableknowledgepackagethat

eliminatesboththeStorytellingTaxand

EngineeringTax.

-**Derivedfrom**:O1,O2,O3,O4,G1,G2

A.3.4Cognitive Layer:logic/solution/heuristics.md

Each heuristic records a design decision with its rationale, provenance (who introduced it), and sensitivity rating so that agents know which choices are safe to vary.

#Heuristics

##H04:Directionalverificationoverexactmatching

-**Rationale**:Legacypapersroutinelyomitdetails

neededforexactreproduction.Verifying

directionalproperties(A>BonmetricX)

demonstratesthecodekernelcapturesthecore

algorithmicinsightwithoutrequiringexact

numericalmatches.

-**Provenance**:user

-**Sensitivity**:medium

-**Coderef**:[paper/sections/protocol.tex]

##H12:Minimalkernel=algorithmnoteswithinline

snippets,notrawcodefiles

-**Rationale**:Fullcodedumps(200-700lines)

causecontextdilution--theagentspends

tokensparsingboilerplatealreadydescribedin

official_solution_notes.md.Notescontaincore

algorithmwithkeycodesnippetsinline,

sufficientforcomprehensionwhile5-10xsmaller.

-**Provenance**:user-revised

-**Sensitivity**:high

-**Coderef**:[code/artifacts/rebench-*/src/

kernel/official_solution_notes.md]

...

A.3.5Exploration Graph:trace/exploration_tree.yaml

The exploration tree preserves decisions, dead ends, and experiments as a traversable graph. We show one node of each type from this paper’s 94-node tree.

tree:

-id:N04

type:decision

title:“Tripartitelayerarchitecture”

provenance:user

timestamp:“2026-03-09”

choice:>

Threeorthogonallayers:Cognitive(/logic),

Physical(/src),ExplorationGraph(/trace).

Eachaddressesadistincttax.

alternatives:

-“Two-layer(logic+codeonly)”

-“Four-layer(separateevidenceattop)”

-“Flatstructureddocument(singlefile)”

evidence:>

Three-layerseparationachievesminimal

representationwhilepreservingallthree

dimensionsthatPDFsconflate.

-id:N50

type:dead_end

title:“Trimmingsrc/alonedoesnotrecover

CatCfromenrichmentregression“

provenance:ai-suggested

timestamp:“2026-03-14”

hypothesis:>

Removingboilerplatecodefromsrc/would

reducecontextdilutionenoughforCatC

(failureknowledge)torecoverto~80%.

failure_mode:>

CatCstayedat57.5%despitereducingsrc/

from56-104Kto20-36Kperartifact.Remaining

enrichmentadditionsstilldilute

trace/exploration_tree.yamlcontent.

lesson:>

ContextdilutionforCatCismoresensitive

thanexpected.Even~200linesofstructured

markdowninsrc/canpushfailureknowledge

belowtheretrievalthreshold.

-id:N17

type:experiment

title:“24,008-runexplorationwasteanalysis”

provenance:ai-executed

timestamp:“2026-03-12”

result:>

Analyzed24,008runsacross21models,

228tasks.59.2%oftokenswastedon

dead-endexploration.90.2%ofcostgoes

tofailedruns.Failedrunsconsume113x

moretokensthansuccessfulones(median).

evidence:[C13,C14,“code/eval/malt_analysis/

exploration_tax_findings.json“]

...

A.3.6Exploration Graph:trace/sessions/

Each research session is logged as a structured YAML record capturing events, AI actions, files changed, claims affected, and open threads. Below is one session (abridged); thesession_index.yamlfile provides a one-line chronological summary of all 36 sessions.

#trace/sessions/2026-03-19_001.yaml

session:

id:“2026-03-19_001”

timestamp:“2026-03-19T02:00”

summary:“BAMreproductionpilot:ARA88.2%vs

baseline93.2%--firstpaperwherebaselineleads“

events_logged:

-type:experiment

id:N65

summary:“BAMARAmega-task:220.5/250(88.2%

weighted),10/10subtasks,6.9h,12.6Mtokens“

-type:experiment

id:N66

summary:“BAMbaselinemega-task:234/251(93.2%

weighted),10/10subtasks,4.3h,8.6Mtokens“

-type:observation

id:O50

summary:“ARAwonT6(non-Gaussian)10/10vs

8.5/10--clearerhyperparameterspecs“

ai_actions:

-action:“RanBAMARAreproduction(10subtasks)”

files_changed:

-“code/eval/reproduction/results/bam/”

claims_touched:[C05]

open_threads:

-“Hyperparamsburiedinevidence/notsrc/configs/

--movethemandrerun(becomesN67)“

#trace/sessions/session_index.yaml(excerpt)

sessions:

-id:“2026-03-12_004”

summary:“Full23-paperinfogapanalysis--8,921

reqs,median45.3%sufficient“

-id:“2026-03-17_001”

summary:“Pilotreproduction:ARA74%vsbaseline

~14%,5xadvantageonneural-score-estimation“

-id:“2026-03-22_001”

summary:“Self-expansionv2(realcode):10/10

subtasks--v1pseudocodegot1/10“

-id:“2026-03-26_001”

summary:“small_scaling_lawextensionv2:ARA0.644

vsbaseline0.806--baselinewinsonloss

calibrationdespiteARAfindingnear-optimal

hyperparams“

...

The complete artifact, including all 16 claims, 18 heuristics, 94 exploration nodes, and 36 session records, is available in the supplementary material atara/.

Appendix BCompiler Skill Details

TheAraCompiler (§4) is implemented as anagent skill: a self-contained, natural-language specification that, when loaded into any general-purpose coding agent’s context, turns it into a domain-specialized compilation system. The skill prescribeswhatthe agent should do andwhat domain knowledgeit needs, but delegates all execution mechanics (model selection, tool dispatch, context management) to the host agent. This appendix reproduces the key elements of the skill specification; the complete definition is available in the supplementary code.444Full implementation:https://github.com/AmberLJC/Agent-Native-Research-Artifact.

B.1Compiler Skill Specification

The Compiler skill specification (∼\sim482 lines of natural language) is structured into five sections. When loaded into a host agent’s context, it provides the full domain knowledge needed to produce a schema-conformingAra. We reproduce representative elements below; the complete specification is available in the supplementary code.

Section 1: Workflow (lines 1–11).

Defines the high-level pipeline: analyze→\togenerate→\tovalidate→\tofix→\toiterate.

Section 2: Capability usage guidelines (lines 13–26).

Specifies usage conventions for standard file operations: preferedit_fileoverwrite_filefor targeted fixes, preferwrite_filefor YAML files to avoid whitespace corruption. Instructs the agent to batch work: generate all files first, then validate once.

Section 3: ARA directory schema (lines 28–414).

Defines the complete directory structure and field-level requirements for every file. This is the normative schema specification that the agent must follow. Key constraints include:

•PAPER.md: YAML frontmatter with title, authors, year, venue, DOI, domain, keywords, claims summary, and abstract. Body must include a Layer Index table listing every file with a one-line description.
•claims.md: Each claim requires Statement, Status, Falsification criteria, Proof (referencing experiment IDs, not file paths), Dependencies, and Tags.
•experiments.md: Declarative verification plans (setup, procedure, metrics, directional expected outcomes). Exact numerical results areprohibited—they belong exclusively in/evidence/to enable blind reproduction.
•heuristics.md: Each heuristic requires Rationale, Sensitivity (low/medium/high), Bounds, Code ref, and Source.
•exploration_tree.yaml: Nested YAML tree with typed nodes (question,experiment,dead_end,decision,pivot). Minimum 8 nodes with at least onedead_endand onedecision. Dead ends must document hypothesis, failure mode, and lesson.
•evidence/: Every results table and quantitative figure must be reproduced with exact cell values—no rounding, no omission.

Section 4: 4-stage reasoning protocol (lines 416–454).

The prompt mandates a structured thinking process before file generation:

#Your4-StageReasoningProcess

YouMUSTfollowthese4stagesinorder.Producea

<thinking>blockfirstwithyourreasoningforeach

stage,thenproducethefiles.

##Stage1:SemanticDeconstruction

Stripthenarrative“StorytellingTax.“Isolate:

-Thecoreobservationsandgapsthatmotivatethework

-Mathematicalformulationsandequations

-Architecturalspecificationsandcomponentdescriptions

-Experimentalconfigurations(hyperparameters,hardware,

datasets)

-Numericalresultsandbenchmarks

-Citationdependenciesandtheirroles

-Negativeresultsandablationfindings

##Stage2:CognitiveMapping

Mapdeconstructedcontentto/logic:

-Extractmotivation:observations(withnumbers),gaps,

thekeyinsight,andassumptions

-Identifyfalsifiableclaims(notopinionsorvague

statements)

-Defineformalconceptswithprecisenotation

-Populatesolution/(architecture,algorithm,

constraints,heuristics)

-Constructtypeddependencygraphforrelated_work.md

-Ensureeveryclaimhasfalsificationcriteriaand

proofpointerstoexperimentIDs

-Designdeclarativeexperimentplans:foreachmajor

claim,specifyhowanagentwouldverifyit

##Stage3:PhysicalStubbing

Generate/src:

-Extractexacthyperparametervaluesintoconfigs/

-Writecodestubswithcorrectfunctionsignatures

andtypes

-Specifyenvironment(dependencies,hardware,seeds)

-CodeshouldimplementtheNOVELcontribution,

notboilerplate

##Stage4:ExplorationGraphExtraction

ReconstructtheresearchDAGasanestedYAMLtree

for/trace:

-Identifythecentralresearchquestion(s)asroot

nodes

-Mapexperimentsandtheiroutcomesaschildnodes

-Documentdeadendsfromablationsandrejected

alternativesasleafnodes

-Recordkeydesigndecisionswithalternatives

considered

Section 5: Output format and rules (lines 456–482).

Specifies the XML-delimited output format forbatch_write_files, lists all 15 mandatory files, and enforces nine invariant rules (e.g., “all numerical values must be EXACT as stated in the paper,” “never hallucinate claims, results, or heuristics not in the paper”).

Appendix CLive Research Manager Details

The Live Research Manager (§3) is the second agent skill in theArasystem. Like the Compiler, it is a self-contained natural-language specification that turns a general-purpose coding agent into a domain-specialized system; unlike the Compiler, it operates continuously alongside the researcher rather than as a one-shot compilation. This appendix expands on the design principles, cross-session mechanisms, and submission workflow summarized in §3.

C.1Design Principle Rationale

The three design principles stated in §3.1are expanded below with full motivation.

P1. Silent, framework-independent integration.Documentation has traditionally been aretrospectiveactivity: a context switch that introduces both friction and information loss. The system must integrate with any general-purpose coding agent (Claude Code, Cursor, Windsurf, or future frameworks) without custom SDKs, API bindings, or infrastructure changes. A natural-language specification that the agent reads into its context is the most portable interface: it requires nothing beyond the tool access agents already have, and artifact quality improves automatically as language models advance. The manager runs as a background process that silently collects research traces, constructing the artifact without interrupting active work or injecting prompts into an ongoing research conversation.

P2. Faithful epistemic provenance.AI-native research blurs the boundary between human insight and machine execution. The manager must objectively trackwho did what: distinguishing ideas explicitly stated by the researcher, suggestions inferred by the agent, actions the agent executed autonomously, and AI suggestions the researcher revised. Without such provenance, an artifact cannot faithfully represent the epistemic origin of its contents. The research process is inherently chaotic: a single session may interleave hypothesis formation, coding, debugging, and writing with no clear boundaries. The manager must translate this raw conversational stream into the structuredAraschema without losing information or imposing premature structure. Observations that are not yet classifiable should be staged rather than forced into categories, and knowledge should mature progressively, from hunches to typed events to formally bound claims, mirroring how research understanding actually develops.

P3. Comprehensive trajectory capture.Research is nonlinear and stochastic: hypotheses branch, experiments fail, directions are abandoned and revisited. The manager must capture this full trajectory, not just the successes that survive into a polished paper, but the dead ends, pivots, and intermediate observations that constitute the actual research process. Cross-layer bindings between claims, evidence, code, and decisions must be established at capture time while the conversational context is still available; post-hoc reconstruction from archived transcripts loses these causal chains.

C.2Closure-Driven Crystallization

Principle P2 requires that observations mature “as evidence accumulates,” butevidence accumulationneeds an operational definition. A counter-based threshold (promote afterNNreferences) is arbitrary; asking an LM in isolation “is this mature?” lacks grounding. We instead define maturity throughclosure signals: externally observable patterns in the researcher–agent conversation that indicate the researcher has treated an observation as settled. At each session boundary, the Maturity Tracker inspects the session record and promotes a staged observation when at least one closure signal is present.

Closure signal taxonomy.

Topic abandonment.The researcher has moved to a new topic without revisiting the observation in the subsequentkkturns (defaultk=5k=5), and no pending question remains open on the original thread.

Verbal affirmation.The researcher explicitly endorses the observation (“yes, we’ll go with X,” “confirmed,” or equivalent paraphrase), making the adoption decision first-person.

Empirical resolution.An experiment bound to the observation has produced a result and the researcher has commented on it; both supported and refuted outcomes are valid terminations (the refuted case promotes to adead_end).

Artifact commitment.A downstream artifact now depends on the observation: code is merged, a config value is fixed, a subsequent claim uses it as a premise, or a design decision is documented as following from it.

Contradiction trigger.

When a new observation contradicts one already staged or crystallized, the Maturity Tracker does not silently overwrite. Both entries are flagged, the contradiction is appended to the Exploration Graph as adecisionnode with unresolved status, and resolution is deferred to the next briefing where the researcher adjudicates explicitly.

C.3Cross-Session Continuity

A stateless coding agent has no memory of previous conversations. If the manager simply reads the artifact at each session boundary, it knowswhatthe artifact contains but notwhyit is organized the way it is—which classification choices were non-obvious, which observations were deliberately deferred, which patterns guided past merges and promotions. Without this self-awareness, the manager risks inconsistent classification, duplicate entries, and organizational drift across sessions.

We address this with two lightweight mechanisms. First, areasoning log(trace/pm_reasoning_log.yaml) records the manager’s own organizational decisions and their rationale at each session boundary—a compressed account of a few lines per session that gives the manager self-continuity without requiring access to raw conversation transcripts. Second, each session record includes akey contextfield: compressed summaries of the most important human–agent exchanges, preserving conversational nuance that would otherwise be lost when the raw conversation is no longer available. Together, these mechanisms ensure that the manager can read back not only the artifact’s current state but also the reasoning chain that produced it, maintaining coherent organization across arbitrarily many sessions at negligible token cost.

Appendix DTest Corpus

This section defines the test corpus used by the Understanding (App.E) and Reproduction (App.F) experiments. The RE-Bench tasks used by the Extension experiment (App.G) are documented separately in App.G.1.

Selection criteria.

We draw our evaluation corpus from PaperBench(Staraceet al.,2025), which provides expert-authored hierarchical reproduction rubrics for ICML 2024 papers. We adopt all 23 papers in PaperBench’s public release as our PaperBench corpus, with no exclusions: every paper satisfies our three required properties: (1) peer-reviewed at a top ML venue (ICML 2024 Spotlight/Oral, with two NeurIPS 2024 workshop development papers) with a publicly available PDF; (2) spanning diverse ML subfields to test breadth across the discipline; (3) accompanied by a PaperBench rubric with fine-grained leaf requirements that enables quantitative evaluation of reproduction fidelity. We supplement this corpus with 7 open-ended R&D tasks from RE-Bench(Wijket al.,2025), yielding 30 evaluation targets with 450 questions total.

Paper list.

Table7lists the 23 PaperBench papers used across the understanding and reproduction evaluations. Note that PaperBench’s own protocol blacklists author code; we relax this for the baseline so it has the strongest possible footing (PDF + companion GitHub), making “repo availability” a property of our harness rather than of PaperBench. Of the 23 papers, 15 are included in the reproduction experiment; the remaining 8 are excluded because faithful end-to-end reproduction exceeds our per-task compute budget or requires specialized infrastructure outside our evaluation harness (e.g., multi-day CLIP adversarial fine-tuning, Isaac Gym RL, full ImageNet coreset sweeps, large multi-model benchmark suites). These 8 papers participate only in the understanding evaluation.

Table 7:Test corpus: 23 PaperBench papers from ICML 2024 spanning diverse ML subfields. The “Repro” column indicates inclusion in the reproduction experiment (requires companion code). All 23 papers participate in the understanding evaluation.

Corpus diversity.

The 23 papers span diverse ML subfields: efficiency, alignment, interpretability, RL, scientific ML, generative models, optimization, retrieval, evaluation, and adaptation. While all papers come from a single venue (ICML 2024), they vary substantially in methodological complexity—from systems-oriented efficiency papers with minimal formal analysis to theory-heavy contributions (PINN, stochastic interpolants)—heuristic density, paper length, and contribution type (new architectures, training recipes, algorithms, analysis frameworks). The corpus includes papers that are deliberately challenging for structured extraction: those whose contributions are analysis rather than methods (mechanistic-understanding), multi-component pipelines (all-in-one), and papers with complex combinatorial experiment matrices (PINN with 1,273 leaf requirements).

Appendix EUnderstanding Evaluation

This section reports the methodology and per-stratum results for the Understanding experiment (§7.2). The shared test corpus is in App.D.

E.1Question Bank and Grading Rubric

Question generation.

For each of the 30 evaluation targets (23 PaperBench papers, 7 RE-Bench tasks), we generate 15 questions across three categories: 10 Category A (fidelity: information preservation), 5 Category B (configuration and detail recovery) for PaperBench papers, or 5 Category C (failure and exploration knowledge) for RE-Bench tasks. Questions are designed to require specific, unambiguous answers or verifiable outputs. We avoid opinion questions (“Is this architecture good?”) and questions requiring external knowledge not in the paper (“How does this compare to BERT?”).

Category A question templates (10 per paper).

•Architecture & Method(3 questions): “What is the [specific structural detail]?” (e.g., “How many layers does the encoder have?”, “What are the inputs and outputs of the multi-head attention module?”)
•Hyperparameters & Configuration(2 questions): “What [training/optimization detail] is used?” (e.g., “What optimizer is used and what are its hyperparameters?”, “What is the batch size in tokens?”)
•Results & Claims(3 questions): “What [metric] does the [model variant] achieve on [benchmark]?” (e.g., “What BLEU score does the base model achieve on WMT 2014 EN-DE?”)
•Rationale & Design Decisions(2 questions): “Why was [design choice] made instead of [alternative]?” (e.g., “Why is scaled dot-product attention used instead of additive attention?”)

Category B question templates (5 per PaperBench paper).

•Implementation(2 questions): “Implement [specific module] with the correct [dimensions/activations/structure].”
•Configuration Recovery(2 questions): “Write the [optimizer/training/data] configuration with the exact parameters specified.”
•Debugging & Troubleshooting(1 question): “[Failure scenario]—identify the cause and fix it.”

Category C question templates (5 per RE-Bench task).

•Dead-End Knowledge(3 questions): “What approaches have been tried and failed?” / “What is the documented failure mode of [approach]?”
•Exploration History(2 questions): “What alternatives were considered for [decision]?” / “What lesson was learned from the [dead-end] attempt?”

Evaluation agent setup.

For each (paper, format, question) triple, we instantiate a fresh sub-agent (Claude Sonnet 4.6). The agent receives the format under test (the fullAradirectory for the ARA condition, or the PDF plus companion GitHub repository for the baseline) and a single question. Each question is answered independently with a fresh context to prevent information leakage.

Grading rubric.

Each answer is scored on a ternary scale by an independent judge (Claude Opus 4.6) against a gold reference:

•Correct (1.0): The answer matches the ground truth in substance. Minor phrasing differences are acceptable; numerical values must be exact.
•Partial (0.5): The answer conveys the main insight but misses key sub-details.
•Incorrect (0.0): The answer contains a factual error, contradicts the gold answer, or hallucinates an answer to an unanswerable question.

E.2Information Gap Type Distribution

Methodology.

PaperBench’s expert reproduction rubric for each paper enumerates the requirements an agent must satisfy to reproduce the paper’s results. For each of the 8,921 leaf requirements across the 23 papers, we compare the requirement against the source PDF and label itsufficient(the PDF text fully specifies what is needed),partial(some components specified but key details missing), orabsent(the PDF does not address the requirement); each label is accompanied by an annotator confidence rating (high / medium / low). When the label is partial or absent, we additionally tag the requirement with a gap-type category (missing hyperparameter, vague description, cross-reference-only, etc.). Labels are produced by an LLM-as-judge run per (requirement, PDF excerpt) pair, with the judge required to cite the PDF passage that supports its decision; the headline 45.4% sufficient figure is dominated by the 64% high-confidence subset (full pipeline incode/eval/pdf_information_gap.py).

Per-category coverage.

Table8breaks the 8,921 requirements down by PaperBench task category. The median paper shows 45.3% sufficient and 47.9% partial, confirming the gap is systemic rather than driven by outliers.

Table 8:Reproduction information gap across 23 PaperBench papers (8,921 requirements). PDFs systematically under-specify the information needed for reproduction, with the largest gaps in code development and dataset acquisition.

Gap-type breakdown.

Table9breaks down the 8,921 reproduction requirements by gap type. The three largest categories—missing hyperparameters (26.2%), vague descriptions (21.9%), and cross-reference-only specifications (13.4%)—account for over 60% of all gaps and are precisely the information types that structured formats address by design. At the fine-grained level, Dataset Acquisition achieves only 5.4% sufficient coverage (25.5% entirely absent)—no paper in the corpus consistently provides download URLs, preprocessing scripts, or data format specifications. Evaluation, Metrics & Benchmarking sits at 30.0%: papers often statewhichmetrics they use but nothowthey compute them (binning strategies, confidence intervals, statistical tests).

Table 9:Distribution of information gap types across 8,921 requirements. The three largest categories—missing hyperparameters, vague descriptions, and cross-references—are precisely the gaps that structured formats address by design.

E.3Exploration Cost Detailed Breakdown

What this measures (and what it is not).

Across 24,008 agent runs (21 frontier models, 228 tasks) in the METR MALT corpus, 59.2% of tokens and 90.2% of dollar cost ($63,483 total) are spent in runs that did not reach the task’s reference score. This is not wasted research effort: those runs map dead ends, rule out alternatives, and narrow the strategy space the next agent should consider. The cost onlybecomeswaste downstream, when the next agent does not have access to that exploration and must rediscover the same dead ends from scratch. The exploration tax we report is therefore the per-agent cost of rediscovery if the failure record is not propagated, not a property of any individual run.

Breakdown.

Table10gives the per-run breakdown. The mean below-reference token cost is 8.6×\timesthe cost of a reference-reaching run (2.58 M vs. 300 K tokens per run), with a median of 113×\times. Within the 59.2% of tokens that do not reach reference, 44.8% are spent in runs that produce no measurable improvement and 14.4% in runs that re-derive solutions other agents had already produced. The pattern concentrates where research-like work happens: RE-Bench tasks (the most open-ended) end below reference 73.4% of the time, vs. 47.0% on moderate-difficulty HCAST tasks and 0.7% on well-defined SWAA tasks. At the per-task level, easy tasks reach reference 85.4% of the time, medium 30.7%, and hard only 15.1%.

Table 10:Cost of below-reference exploration across 24,008 agent runs (21 frontier models, 228 tasks). The exploration itself is necessary research work; the cost only becomes waste when subsequent agents must re-incur it because the failure record is not preserved in the published artifact.

E.4Per-Category Result Analysis

This subsection unpacks the three per-category results summarized in §7.2and Table3into the specific structural mechanisms that produce each gain.

Category A: fidelity at lower cost via progressive disclosure.

Arapreserves PDF-recoverable information with high fidelity while requiring fewer tokens to retrieve. On PaperBench,Araachieves 96.7% vs. 89.8% for the baseline while consuming 12% fewer tokens per question (86.3K vs. 97.7K). The structural explanation is progressive disclosure:Ara’s PAPER.md provides a layer index that directs the agent to the relevant file (e.g.,evidence/tables/for numerical results,logic/solution/algorithm.mdfor method details), whereas the PDF agent must scan the entire document for each query. On RE-Bench, where the baseline reads only the synthesized polished paper rather than a real publication,Ara’s accuracy advantage widens to 92.1% vs. 51.4%: the synthesized writeup omits much of the technical detail that the artifact’s structured layers preserve. The headline finding is that structured organization improves accuracy while keeping token usage comparable, becauseAra’s layer taxonomy turns linear search into indexed lookup.

Category B: configuration recovery via centralized configs.

The rubric-aligned questions probe fine-grained experimental details (hyperparameter values, environment specifications, preprocessing steps) that PaperBench rubrics demand but papers systematically omit (26.2% of all gaps are missing hyperparameters; see AppendixE.2). The baseline’s 67.8% reflects successful code-repository mining: given a dedicated sub-agent per question, it can grep through the companion GitHub repo for many configuration values.Ara’ssrc/configs/andlogic/requirements.mdlayers, however, centralize this knowledge in human-readable files, raising accuracy to 92.6% at comparable token usage (183K vs. 178K tokens per question): the agent reads a structured config file rather than searching a scattered codebase. The remaining gap to 100% reflects details genuinely absent from both the paper and its repository, which the Compiler cannot synthesize.

Category C: failure knowledge has no analogue in the baseline.

Arareaches 81.4% on failure-knowledge questions while the baseline manages only 15.7%; the synthesized polished papers contain almost no record of failed approaches, dead-end configurations, or intermediate results that the trace layer preserves. The baseline’s low token usage per question (58.0K) reflects this poverty: agents quickly determine the information is absent and return short answers, spending minimal tokens on fruitless search.Araagents consume more tokens per question (139.3K) but productively explore the exploration tree to find answers. This category provides the clearest evidence for preserving negative knowledge: information that narrative formats systematically discard accounts for the largest single accuracy gap in the entire evaluation.

E.5Statistical Details

A McNemar test on the 450 paired outcomes yieldsχ2=95.15\chi^{2}=95.15,p<10−10p<10^{-10}overall:Araanswers 141 questions correctly that the baseline misses, while the baseline answers only 18 thatAramisses. By category, theAraadvantage is highly significant for all three categories: Category A (+14.8%+14.8\%), Category B (+24.8%+24.8\%), and Category C (+65.7%+65.7\%, dominated by the absence of exploration knowledge in baseline sources).

Difficulty stratification.

Stratified by question difficulty,Araleads across all tiers: T1 (explicit) questions (ARA 97.3%, BL 83.8%;n=74n=74), T2 (scattered) questions (ARA 95.6%, BL 79.0%;n=193n=193), and T3 (implicit) questions (ARA 91.0%, BL 60.5%;n=172n=172). On unanswerable questions (n=26n=26),Araachieves 92.3% abstention accuracy vs. 86.5% for the baseline. The difficulty gradient is expected: T2 and T3 questions require assembling scattered information or reasoning about implicit assumptions, where structured representations provide the greatest advantage.

Token usage–difficulty interaction.

The per-question token data reveals thatAra’s progressive disclosure architecture creates an adaptive search pattern:Araagents consume 60.9K tokens/Q on T1 (explicit) questions, 95.5K on T2 (scattered), and 152.7K on T3 (implicit), adapting search depth to question complexity. Baseline agents, by contrast, show a flatter profile across difficulty tiers (82.8K–118.2K tokens/Q), because linear PDF scanning does not benefit from question-aware navigation.Araconsumes fewer tokens than the baseline on T1 (27% less) and T2 (13% less), and invests more on T3, while being substantially more accurate at every tier.

Benchmark group breakdown.

On PaperBench papers (n=345n=345),Araachieves 95.4% vs. 82.5% at comparable token usage. On RE-Bench tasks (n=105n=105), the accuracy gap widens (ARA 88.6% vs. BL 39.5%), driven by Category C questions where the baseline has no access to failure knowledge.

Scoring methodology note.

Each answer is scored on a ternary scale (1.0 correct, 0.5 partially correct, 0.0 incorrect) against a gold reference. An answer receives 1.0 if it captures all essential facts, numbers, and concepts; 0.5 if it conveys the main insight but misses key sub-details; and 0.0 if it is wrong, contradicts the gold answer, or hallucinates an answer to an unanswerable question.

Appendix FReproduction Evaluation

This section reports the task design, scoring, and per-paper analysis for the Reproduction experiment (§7.3).

F.1Reproduction Task Design and Scoring

Task curation.

Each of the 150 reproduction tasks specifies a single model, a single method, and 5–15 rubric leaf requirements as success criteria, with difficulty stratified per paper (≥\geq3 easy,≥\geq3 medium,≥\geq3 hard; aggregate: 50 easy, 49 medium, 51 hard). Tasks describewhatto reproduce, not how—the agent decides its own implementation strategy. Within each paper, the 10 subtasks form a singlemega-task: the agent receives all subtasks ordered by difficulty (easy→\tomedium→\tohard) and builds cumulatively, naturally reusing prior work as a human researcher would.

Scoring formula.

The primary metric is thedifficulty-weighted success rate:∑isi⋅wdi/∑imi⋅wdi\sum_{i}s_{i}\cdot w_{d_{i}}\;/\;\sum_{i}m_{i}\cdot w_{d_{i}}, withwd∈{1,2,3}w_{d}\in\{1,2,3\}for easy, medium, and hard subtasks, wheresis_{i}is the subtask score (sum of requirement weights foryes+0.5×+\,0.5\timespartial) andmim_{i}the maximum possible. Easy subtasks (setup, model instantiation) are necessary but not discriminative; most agents complete them regardless of source material, while harder subtasks (training, ablation, cross-method comparison) are where structured information provides the most leverage. We also report the flat (unweighted) rate and per-difficulty breakdowns.

Statistical significance.

A Wilcoxon signed-rank test on the 15 paired per-paper weighted scores yieldsp=0.028p=0.028:Arawins on 8 papers, ties on 5, and the baseline leads on 2. The sign pattern (8–2) is itself statistically improbable under the null hypothesis of no difference (p=0.039p=0.039, exact binomial), confirming that the aggregate advantage is not driven by a single outlier paper.

F.2Per-Paper Reproduction Analysis

This section provides the detailed per-paper analysis for the reproduction experiment (§7.3).

Per-difficulty analysis.

The aggregate per-difficulty pattern (Ara85.1% vs. baseline 80.2% on easy, 68.5% vs. 62.9% on medium, 54.5% vs. 46.0% on hard) is visualized in main-text Figure11. Figure13resolves this aggregate to the per-paper level, and Table11provides the full per-paper, per-difficulty success rates underlying both.

Refer to caption Figure 13:Per-paperAra−-baseline delta (percentage points) on each difficulty stratum, sorted by mean advantage. Green indicatesArawins, red indicates baseline wins. Gains concentrate in the medium and hard columns across most papers; the few baseline wins are confined to a small set, most prominentlyself-expansionandftrl.Table 11:Per-paper reproduction success rates (%) by difficulty level. Easy, medium, and hard columns show the unweighted success rate within each difficulty tier; the final two columns show the difficulty-weighted rate (1:2:31\!:\!2\!:\!3weighting). Rice per-difficulty values are interpolated from the weighted score and overall rate, as its per-difficulty JSON entry was recorded separately.

Large ARA wins.

The papers with the largest ARA advantages—fre(+21.3%+21.3\%),mechanistic-understanding(+20.7%+20.7\%), andpinn(+19.5%+19.5\%)—share complex multi-step training pipelines with non-obvious hyperparameter interactions that PDFs describe only at a high level. ThefreARA agent reimplemented the original JAX codebase in PyTorch (1.8 GB GPU vs. JAX’s 30.8 GB), trained 17 models across three domains, and completed all medium and hard subtasks; the baseline agent struggled with the JAX environment and completed only 3 training attempts before exhausting its budget. The five newly added papers reinforce these patterns:all-in-one(+16.0%+16.0\%) andftrl(+6.1%+6.1\%) show clear ARA advantages on hard tasks, whilestochastic-interpolants(+0.5%+0.5\%) andtest-time-model-adaptation(+0.3%+0.3\%) are ties with comparable performance from both sources.

Baseline wins and ties.

The one clear baseline win isself-expansion(−7.3%-7.3\%), where the ARA agent exhibited result fabrication—reporting identical accuracy values across all configurations—detected by the blinded judge. Among the narrow ties,adaptive-pruning(−2.3%-2.3\%) andrice(−1.9%-1.9\%) both have strong companion repositories with runnable training scripts; the baseline’s code access partially compensates for the PDF’s information gaps. Onrice, ARA achieves comparable quality with 2.5×\timesless compute (3.7h vs. 9.1h, 131K vs. 195K tokens), suggesting efficiency gains even when final scores are similar.

Result fabrication.

Two baseline runs (bbox,mechanistic-understanding) exhibited result fabrication—reporting plausible but uncomputed values when unable to complete training—detected by the blinded judge. Across all 15 papers, fabrication occurred in 2 baseline runs and 1 ARA run (self-expansion), suggesting that structured artifacts generally provide sufficient grounding to prevent hallucinated results, though they are not immune.

Appendix GExtension Evaluation

This appendix documents the methodology behind §7.4: which RE-Bench tasks we use and why (App.G.1), how eachArais compiled from official solutions and prior MALT trajectories (App.G.2), how the polished paper-arm baselinepaper.mdis generated (App.G.3), the engineering of the agent harness (App.G.4), how we extract canonical scores from agent traces (App.G.5), and the per-task case studies and trace evidence (App.G.6).

G.1Task selection

Of the 7 RE-Bench(Wijket al.,2025)tasks, we use 5 in the extension evaluation. Table12lists the score formula, direction, on-task starting baseline, RE-Bench reference score, hardware requirement, and the model coverage of each task’s METR MALT corpus.

TaskScore formulaDir.StartRef.HardwareClaude-4 MALTUsedtriton_cumsumlog⁡(tms)\log(t_{\text{ms}})↓\downarrow1.560.471×\timesH10022 (O13/S9)yesrestricted_mlmlog⁡(ℓ−1.5)\log(\ell{-}1.5)↓\downarrow1.811.132×\timesH100 80 GB22 (O11/S11)yesfix_embeddinglog⁡(ℓval−1.5)\log(\ell_{\text{val}}{-}1.5)↓\downarrow2.200.261×\timesGPU19 (O10/S9)yesnanogpt_chat_rlavg. pairwise win-rate↑\uparrow0.540.851×\timesGPU + judge18 (O12/S6)yesrust_codecontestsnsolved/165n_{\text{solved}}/165↑\uparrow0.000.13CPU + LLM API12 (O6/S6) + 10†yessmall_scaling_law1−(ϵℓ+ϵp)1{-}(\epsilon_{\ell}{+}\epsilon_{p})↑\uparrow0.240.841×\timesGPU0‡nooptimize_llm_foundrylog⁡(ts)\log(t_{\text{s}})↓\downarrow5.604.544×\timesH1000§noTable 12:RE-Bench task card with extension-evaluation status.Score formulais transcribed verbatim frommetr-re-bench/ai_rd_<task>/ai_rd_<task>.py;ℓ\elldenotes validation loss,ttwall-clock time,nsolvedn_{\text{solved}}the count of correctly solved problems,ϵℓ\epsilon_{\ell}/ϵp\epsilon_{p}loss/parameter prediction errors.Dir.: score orientation.Start: score of the unmodified starter codebase.Ref.: best score reported in the original RE-Bench evaluation.Claude-4 MALT: count of Claude-4 (Opus + Sonnet) runs in the METR MALT corpus, broken down as O⟨\langleOpus⟩\rangle/S⟨\langleSonnet⟩\rangle.†rust_codecontestsalso has a 10-runclaude-3-7-sonnetsupplement that uses the same scoring scaffold.‡only Claude-3.5/3.7 and OpenAI runs.§no MALT corpus published.The bottom two tasks are excluded because their MALT corpora cannot supply a usable failure-trace layer for the experiment.optimize_llm_foundryhas no published MALT corpus at all, sotrace/would be empty by construction.small_scaling_law’s MALT corpus does exist but is structurally inadequate: it predates Claude-4 (only Claude-3.5/3.7 Sonnet and OpenAI models), it is sparse, and the runs that do exist are dominated by trivial parameter-grid sweeps with no recorded strategic exploration or named dead ends. An extraction pipeline run on those runs produces effectively emptytrace/andevidence/layers and neuters the experimental contrast against the paper-only baseline. We defer both tasks to follow-up work that re-runs MALT collection on Claude-4 agents (optimize_llm_foundry) or on tasks whose strategy space elicits substantive recorded exploration (small_scaling_law).

G.2ARA construction pipeline

Each RE-BenchArais compiled from two sources: the official reference solution (copied verbatim intosrc/) and the task’s METR MALT transcripts (extracted under a beat-reference filter intotrace/andevidence/). The compiler is task-agnostic with per-task knobs (score formula and direction, MALT JSONL path, dev-history schema, known hazards) collected in per-task cards; the orchestrator procedure and shared sub-agent prompt live incode/rebench-pipeline/.

Pipeline.

The orchestrator first lifts the official solution intosrc/and the reference-derived knowledge (mathematical formulation, algorithm, heuristics, baseline tables, dev-history nodes) intologic/andevidence/, with each node taggedsource: official-solution. It then fans out one extraction sub-agent per MALT run; each sub-agent reads its run in full (no truncation, no chunk skipping) and emits a bundle of trace nodes, evidence rows, and insights. As sub-agents complete, the orchestrator merges their outputs into the artifact: deduplicating approaches across runs (the same method appearing inKKruns becomes one node taggedruns_observed: K), generalising heuristics or claims when a new run extends an existing entry, and verifying that every node carries a provenance tag and every heuristic cites a specific source line. Hallucination prevention is enforced throughout: no invented numbers, experiments, or code beyond what the source artifacts visibly ship.

Beat-reference filter.

The fairness rule: any MALT scoring attempt that exceeded the reference is excluded from the artifact, so neither side’s bundle contains a worked-out beating-reference solution to copy. The filter is direction-aware (lower-better tasks excludescore<refscore<\text{ref}; higher-better tasks excludescore>refscore>\text{ref}) and applied per attempt rather than per run, so a single MALT trajectory contributes both its dead ends and its sub-reference partial successes; it is enforced twice (inside the sub-agent and again at merge) so misclassifications are caught.

G.3Paper baseline construction

The paper-agentreference/paper.mdis the conventional artifact the experiment comparesAraagainst: an LLM-synthesised academic-style writeup of the official solution, generated once per task from the same sources theAracompiler ingests. The synthesis prompt produces the structure of a published methods paper (abstract, problem setup, related work, method, results, discussion), with the same beat-reference filter applied so neither bundle contains a worked-out beating-reference solution. By designpaper.mdpreserves only what worked, mirroring how a published paper typically reports the final method without the rejected alternatives or recorded dead ends.

G.4Harness engineering

The extension harness wraps the Claude Agent SDK(Anthropic,2025b)in an SLURM-launched single-agent loop with tool surface{Bash,Read,Edit,Write,Glob,Grep}\{\texttt{Bash},\texttt{Read},\texttt{Edit},\texttt{Write},\texttt{Glob},\texttt{Grep}\}. Web access (WebFetch,WebSearch) and SDK built-ins that effectively pause the session in batch mode (ScheduleWakeup,EnterPlanMode,EnterWorktree) are disabled. The agent’s workdir is identical across arms except forreference/.

Three classes of engineering fixes were necessary to obtain stable 8 h trajectories. Table13enumerates the observed failure mode, root cause, and shipped fix for each.

Table 13:Harness failure modes encountered during the extension evaluation and the fixes shipped incode/extension-harness/harness.py. The four fixes are necessary, in our experience, to run any of the five tasks for the full 8 h SLURM allocation without forfeiting the agent’s session ahead of the budget cap.##### Run resources.

8 h SLURM wall clock and $50 hard API-spend cap (SDK-enforced) per run, on 1×\timesH100 (triton_cumsum,fix_embedding,nanogpt_chat_rl), 2×\timesH100 80 GB (restricted_mlm), or CPU-only (rust_codecontests). Thenanogpt_chat_rljudge runs on Replicate (Llama-3-8B-Instruct) andrust_codecontestsroutes generation through OpenAIgpt-3.5-turbo-1106; both providers’ tokens are scrubbed from the agent’s environment for all other LLMs to prevent cross-provider fallback.

G.5Score-event extraction

Score events are extracted fromtrace.jsonlvia the canonical scorer’s JSON output only, never via agent commentary or training-internal losses. Per-task patterns:

•triton_cumsum:{“score”: X, “message”: {“shape_dtype_match”: True, “results_match”: True, “torch_time_ms”:…\ldots, “solution_time_ms”: Y}}from the harness scorer; raw metric issolution_time_ms.
•restricted_mlm,fix_embedding:{“score”: X, “loss”: Y, “compliant”:…\ldots, “device”:…\ldots}fromlocal_score.py.
•nanogpt_chat_rl:{“score”: X, “message”: {“win_vs_gpt2-alpaca”: Y, “win_vs_gpt2-xl”: Z}}from the scoring binary.
•rust_codecontests: bothScore: X | N successes / 165and{“score”: X, “n_problems”: 165, “n_successes”: N}fromlocal_score.py; we accept either as canonical and dedupe by(round(t,1), N).

Per-run cost is reconstructed from per-messageusagefields under Claude Sonnet list pricing and rescaled to match the SDK’s authoritativetotal_cost_usd, which agrees with the Anthropic billing portal within rounding.

G.6Per-task case studies

Each task in §7.4is unpacked here as a trajectory case study grounded in the agent’s owntrace.jsonlandThinkingBlockstream, in the same order as the columns of Figure12. The body composite already shows the Sonnet 4.6 trajectories; fortriton_cumsum(Fig.14) andrestricted_mlm(Fig.15) we additionally report the same paired comparison on the older Sonnet 4.5 base to make the contrast across model versions visible.

G.6.1Case study:triton_cumsum(GPU kernel optimization)

The task is to write a Triton kernel for a conditional prefix sum on10810^{8}int32elements:Yi=∑j≤ixj⋅𝟙[odd #positives precedej]Y_{i}=\sum_{j\leq i}x_{j}\cdot\mathbb{1}[\text{odd \#positives precede }j], scored bylog⁡(tms)\log(t_{\text{ms}})on an H100. Both arms start from the same official solution: a 3-pass Triton kernel (parity scan→\toconditional cumsum→\toblock-sum addition) with autotunedBLOCK_SIZE/NUM_STAGES. We ran four trajectories: paper andAraarms on Claude Sonnet 4.5 (the model on which we have the longest paired runs) and on Sonnet 4.6 (where we re-ran with the model the rest of the evaluation uses). The Sonnet 4.6 trajectories are in body Fig.12(leftmost column); Fig.14below shows the Sonnet 4.5 paired runs. The analysis cross-references the trace for both.

Refer to caption Figure 14:triton_cumsumon Sonnet 4.5: paper vs.Arascore-vs-time (left) and score-vs-cost (right). Faint markers are raw scoring attempts, solid line is the best-so-far envelope, stars mark best-attempt positions. Dotted line is the original RE-Bench reference (0.47) reported on different H100 silicon; the harness-measured per-hardware baseline (∼0.64{\sim}0.64) is where both arms start. The 4.6 trajectories are in the body composite (Fig.12, leftmost column).##### Two regimes split along model.

The four trajectories partition cleanly along model rather than agent: both Sonnet-4.5 agents leave the official kernel’s algorithmic structure untouched and only [email protected], while both Sonnet-4.6 agents ship genuinely new kernel designs that displace the 3-pass reference.

Sonnet 4.5: trace-conditioned autotune sweep.

The paper agent populated its autotune grid withNUM_STAGES∈{1,24,32,48,64,96}\texttt{NUM\_STAGES}\in\{1,24,32,48,64,96\}, labelling the deepest pipelines as*“Extreme pipelining — highest performers”in inline comments and never testing the{4,8}\{4,8\}regime. TheAraagent pickedNUM_STAGES∈{4,8}\texttt{NUM\_STAGES}\in\{4,8\}instead, citing heuristic H01 (“Grid size is fixed at 128 for H100 (which has 132 SMs) …”*) verbatim in aThinkingBlockatt=4.3t=4.3min after readingevidence/tables/malt_attempts.mdandsrc/configs/autotune.md. That conservative grid is what the autotuner selects from at runtime, and it is what produces theAraagent’s∼0.27\sim 0.27vs. the paper agent’s flat∼0.64\sim 0.64in Fig.14. The paper agent had no equivalent prior measurement and reached for the directionally intuitive but empirically wrong “more pipelining is better” setting.

Sonnet 4.6: early head start, late paper-agent overtake.

TheAraagent callsbash score.shfor the first time att=11t=11min and immediately scores 0.47, having edited the kernel using trace-surfaced ideas (decoupled lookback,associative_scan) in the first ten minutes. The paper agent does not score untilt=37t=37min and lands at 0.38, having spent the early wall clock reading the polished writeup and reasoning from first principles. TheAraagent leads on best-so-far throught≈75t\approx 75min, then the paper agent overtakes via anint8input compression introduced att=47.7t=47.7min (motivated by the scorer’s[−10,9][-10,9]input range fitting in 8 bits) which, combined with a parity-tracking per-block aggregate, drops total memory traffic from∼\sim2 GB to 0.5 GB; it iterates on this design through the rest of the run. TheAraagent meanwhile commits to a chained-scan-with-decoupled-lookback redesign and spends late-phase compute on boundary-correctness debugging, anchored by heuristic H13 and a trace-reported MALT ceiling.int8appears once in theAraagent’s trace att=40.5t=40.5min as a passing thought and is never implemented.

Reading.

TheAraartifact contributes two qualitatively different things at two phases of the run. Early, it acts as an initialiser: the agent shortcuts the diagnostic phase, picks autotune knobs the paper agent misses, and lands an improved kernel within minutes. Later, it acts as an anchor: the agent leans on the trace-recommended design and the trace-reported ceiling, and spends compute confirming the anchor rather than searching beyond it.

G.6.2Case study:rust_codecontests(LLM-as-tool scaffolding)

The task is to write a Python scaffold that generates Rust solutions to 165 held-out Codeforces-derived problems by callinggpt-3.5-turbo-1106; the score is the fraction of problems whose generated solution compiles and passes all hidden tests. The official reference scaffold (score 0.127 = 21/165) uses an 18-candidate-per-problem pipeline with chain-of-thought prompting, a compile-and-public-test filter, and a vote-among-survivors stage; it ships with an unusedfew_shots/directory wired into the scaffold’s prompt construction. Both arms start from the same scaffold. We ran a single seed per arm on Sonnet 4.6 (the rust task has no Sonnet-4.5 paired data); the trajectories appear in body Fig.12(second column). TheAraarm’s run chains a parent SLURM job, a TIMEOUT-recovery resume, and a final budget-free score-only re-evaluation; the paper arm fits within a single 8 h job.

The trace converts a MALT data point into actionable guidance.

TheAra’s evidence layer summarises 22 prior MALT runs and surfaces a single high-value attempt:supplement_run_5(Claude-3.7-Sonnet) reached 0.097 by bypassing the gpt-3.5 generation entirely on recognised problem names and returning a hand-verified Rust solution from a maintained library. Crucially0.097<0.1270.097<0.127(the task reference), so the raw MALT data point alone says “hand-coding lost”, not “hand-coding wins”. The heuristics layer reframes the same data point as two explicit rules, one prescriptive and one prohibitive:H12(“Hand-coded Rust solution library outperforms prompt engineering on this task”) andH15(“Generator ceiling at GPT-3.5-turbo Rust∼0.05{\sim}0.05–0.100.10across all explored single-completion variants”). H15 marks prompt engineering as a known dead end; H12 then reads the under-reference library result as an under-explored direction rather than a failure. TheAraagent readsheuristics.mdand the MALT attempts table within the first minute and is reasoning about a hand-coded library as the central strategy byt=9.9t=9.9min; the paper agent’spaper.mddescribes the reference scaffold but contains no claim about which ablation directions are productive.

Strategy divergence across the run.

Through the first six hours the two agents work on qualitatively different problems. TheAraagent hand-codes Rust solutions and registers them in aSOLUTIONSdict that the scaffold consults before falling back to gpt-3.5: 34 entries byt=60t=60min, 57 byt=170t=170min, 73 byt=226t=226min. The paper agent treats the task as a prompt-engineering problem and cycles throughsolution_v5.py–v8.pybetweent=23t=23min andt=268t=268min, tuning temperature, candidate count, retry budget, and JSON-mode parsing. The full-test-set evaluations track this divergence:Ara’s scores move49→56→7849\to 56\to 78att=161,214,269t=161,214,269min (every evaluation reflects newly added library entries), while the paper agent’s stall at33→33→38→39→3933\to 33\to 38\to 39\to 39acrosst=68t=68–231231min—the prompt-engineering ceiling thatH15explicitly warns against.

Independent rediscovery, six hours later.

The paper agent eventually reaches the same conclusion. Att=395t=395min, while inspecting the workdir, it notices the existingfew_shots/directory referenced by the scaffold’sget_few_shotsfunction; over the next six minutes itsThinkingBlockreverse-engineers the cache format and starts populating it with hand-coded solutions for problems the AI pipeline failed. 39 hand-write commands in the final 45 minutes lift the score from 39 to 68 in a single late evaluation att=445t=445min. What differs between agents is not which approach works but how many hours of compute precede the recognition that it does: theAraagent’s first canonical evaluation already reflects a hand-coded library and lands at 49/165, well above the paper agent’s final prompt-engineering evaluation reached three hours later.

Reading.

TheAracompresses what would otherwise be a six-hour exploration phase into a one-hour bootstrap by distilling a single under-reference MALT attempt into one prescriptive and one prohibitive heuristic the agent can act on within minutes. The value is timing, not content: the paper agent’s late-phase rediscovery proves the model can find this strategy on its own; the trace just tells it where to look. The rust strategy is alsoopen-ended(each library entry adds one solved problem), so the trace’s reported MALT ceiling reads as a starting line rather than an upper bound and theAraagent ascends past it—a different role for trace ceilings than on triton, where the strategy is closed-form and the same ceilings act as anchors.

G.6.3Case study:nanogpt_chat_rl(preference RL on a 1.5B model)

The task is to RL-finetune GPT-2-XL (1.5B) into a chatbot that wins more pairwise judge calls than the untuned model on a held-out chat task. The official reference scaffold uses best-of-8 tournament selection per prompt, 2,048 prompts per training step, low-temperature (4×10−64\times 10^{-6}) Adam optimization, and a Llama-3-8B-Instruct judge running on Replicate. The score is the mean win-rate againstgpt2-alpacaandgpt2-xl; the agent baseline (untuned model) scores 0.615, the reference scaffold scores 0.85, and the RE-Bench human ceiling is 0.97. Each training step costs∼40{\sim}40min of judge calls plus∼10{\sim}10min of GPU training; the full 8 h budget admits 6–10 step+score iterations. We ran a single seed per arm on Sonnet 4.6: paper as a parent run plus a manual TIMEOUT-recovery resume; ARA as a single run. The two trajectories appear in body Fig.12(third column).

Early divergence: rewrite the algorithm vs. orchestrate the reference.

The decisive divergence is inwhich problem each agent decides it is solving. The paper agent runs its first scoring att=6.9t=6.9min, observes the 0.616 starting score against the 0.85 reference, and att=7.6t=7.6min commits to rewriting the training algorithm; byt=10.5t=10.5min it has a custom DPO-plus-SFT objective inrl_finetune_v2.py. TheAraagent’s first scoring is att=4.5t=4.5min, and att=6.3t=6.3min itsThinkingBlockcites the trace explicitly:“The reference (best-of-8, 4 steps, lr=4e-6, 2048 prompts/step) scores∼0.85{\sim}0.85. No MALT run reached 0.85, best was 0.8184. Rate limiting from Replicate is the binding constraint. Aggressive lr causes collapse.”Its first script att=19.8t=19.8min is a continuation harness that runs additional reference-style steps and rolls back to the best-scoring checkpoint after each. The trace converted the reference recipe from a number on paper into an empirical claim with explicit ceilings, and that converted the agent’s question from“how do I write a better training script?”into“how do I run the existing one more carefully?”.

The reference scaffold ships with a regression bug; a heuristic names the fix.

Both training paths run into the same failure: the noisy Llama judge occasionally selects punctuation-only or empty completions as round winners, training on those teaches the model to emit degenerate outputs, and the regression compounds across steps. TheAraagent’s continuation harness exposes this att=167t=167min when its first scored intermediate checkpoint lands at 0.126; the paper agent’s first scored full-test checkpoint drops to 0.443 att=223t=223min on the same failure. TheArabundle pre-encodes the fix asH08(“Filter out winners with fewer than 3 alphabetic characters before training”), with three companion heuristics naming the score-then-restart orchestration theAraagent ends up implementing.

Late-phase strategy: exploration width.

After the regression both agents iterate, but they explore different spaces. TheAraagent writes 14 scripts aftert=200t=200min, all variantswithinthe reference algorithm: each tunes batch size, learning-rate placement, or restart-from-best logic but none changes the loss function. The paper agent writes 16 scripts spanning the DPO-plus-SFT objective, custom multi-stage LR schedules, varying tournament parallelism, and only eventually reference-style training with smaller batches. Even att=202t=202min the paper agent’sThinkingBlockexplicitly recognises the reference recipe (“This is the N=8, lr=4e-6, 4 steps version that gets∼0.85{\sim}0.85”) but continues writing variants; even in the post-TIMEOUT resume it tries another DPO attempt att=98t=98min. The trace’s empirical anchor (“no MALT run reached 0.85”) bounds theAraagent’s exploration to the reference algorithm class; without it, the paper agent re-tries algorithmic alternatives across both sessions.

Reading.

The trace contributed three things on this task: (i) an empirical anchor that converted the agent’s question from algorithmic redesign to careful execution, narrowing exploration to within-reference variants; (ii) a bug-fix vocabulary that named the degenerate-output filter and the score-then-restart orchestration before the agent had to discover them by debugging; and (iii) a strategy with a built-in continuation criterion (run a step, score it, keep the best, restart, sweep an LR variant, repeat), which kept theAraagent productively iterating for the full budget while the paper agent’s exhaustion of algorithmic alternatives left it cycling through self-declared session terminations until the harness’s pushback ceiling fired. The late-phase strategy class is empirical, however: the trace narrows and structures the sweep but does not abbreviate it.

G.6.4Case study:fix_embedding(recovering a permuted embedding matrix)

The task is to recover the permuted embedding matrix of a corrupted GPT-2-XL: thewte.weightmatrix has had its 50,257 rows scrambled, and the agent must restore validation-loss performance as close as possible to the uncorrupted reference. The official solution is a 3-phase adapter pipeline that scores 0.26; the RE-Bench human ceiling is 0.16. Both arms inherit a workdir with the official scaffold’s training scripts already present. The two arms run the same Sonnet 4.6 model, the same seed, the same harness, and the same GPU; theonlysystematic difference between them is the contents ofreference/. The trajectories appear in body Fig.12(fourth column); both arms first measure the corrupted-model baseline (∼2.20{\sim}2.20) att≈4t\approx 4min, then drop to∼0.4{\sim}0.4byt≈90t\approx 90–130130min after the first round of adapter training.

The two bundles in numbers.

The paper agent’sreference/contains a 134-linepaper.md(the official solution writeup) plus three official config files: 5 files total. The writeup is complete on the published recipe (3-phase architecture in equations, per-stage trainable-parameter sets, the LR ladder10−3→10−4→8×10−510^{-3}\to 10^{-4}\to 8\times 10^{-5}, the bigram-waypoint diagnostic, the human-best ceiling, even an author’s note that*“Stage C is the least-validated part of the pipeline …may be redundant if Stage B has not yet converged”). TheArabundle is 22 files / 5,887 lines and carries the same algorithmic content via 10 reference-derived heuristics (H01–H10), but adds anexploration_tree.yamlof 19 prior MALT runs, a 282-line table of every scored attempt, andH11–H22: failure-derived heuristics includingH11(“Do not destructively replace the corrupted wte”),H13(“Hand-constructed small→\tolarge embedding upcasts collapse”), andH22(“Across 19 MALT runs at 4M tokens each, no agent reached the official adapter pipeline”*).

Both agents implement the published recipe correctly.

The recipe is well-specified enough in either bundle that both agents reach it: theAraagent completes Stage 1 att=26t=26min, runs Stages 2 and 3, and scores 0.246 byt=181t=181min, while the paper agent reaches 0.250 byt=180t=180min. The first three hours are essentially identical, which rules out a difference in algorithmic understanding or basic execution. The divergence happens entirely after both cross the 0.26 reference.

Three late-phase signatures with the same root cause.

Aftert=180t=180min the two agents behave differently in three specific, traceable ways, all attributable to the failure-record asymmetry above.

(i) Permutation recovery—tried twice by the paper agent, never by theAraagent.The paper agent runsrecover_permutation.pyatt=19t=19min, observes“the recovered permutation has only 43 unique values out of 50,257”, and abandons the approach; att=350t=350min—5.5 hours later, after its phase chain has plateaued at 0.250—the same agent writes a freshpermutation_recovery.pyand tries again. TheAraagent never attempts permutation recovery, in either form. This is not a difference in capability (the paper agent showed it would entertain and abandon the approach); it is a difference in what each bundle flags as a documented dead end.H11andH13directly forbid this strategy class;paper.mddescribes only the successful 3-phase pipeline and does not enumerate failed alternatives.

*(ii) Post-reference exploration discipline.*Both agents write a comparable volume of late-phase code. TheAraagent’s writes are continuation-training variants that tune learning-rate placement and warmup length within the documented Stage-3 LR region around8×10−58\times 10^{-5}(H06) with the optimiser, batch, and block-size constraints fromH10held fixed. The paper agent’s writes invent additional training phases beyond the published 3, with custom LR schedules and stochastic-weight-averaging machinery. TheAraagent’s late-phase exploration is constrained by the LR-region heuristics; the paper agent’s is not, because no document available to it pins down where the productive neighbourhood of the reference recipe lies.

(iii) Strategic confidence after crossing the reference.Att=147t=147min theAraagent’sThinkingBlockreads:“I have enough context from the reference materials. The key takeaways are: 1. No MALT run beat the reference (0.26). 2. The official solution’s three-stage adapter pipeline is the key innovation …”. The agent uses the MALT empirical anchor to convert the post-reference territory into a productive-but-under-explored hypothesis. The paper agent issues no analogous statement at any point:paper.mdreports 0.26 and 0.16 as static numbers, with no record of how many prior agents tried or whether the gap was reachable by additional execution effort. It explores the post-reference territory as if from scratch.

Reading.

The case is a clean attribution: the only systematic input difference is reference content, and that content difference is itself a clean instance of the artifact-format claim (paper preserves what worked;Arapreserves both what worked and what failed). The three downstream behavioural differences each map to a specific failure-record element present in theArabundle and absent frompaper.md.

G.6.5Case study:restricted_mlm(constrained masked language model)

The task is to design and train a masked language model under restrictive PyTorch primitive constraints: noConv1d, noSoftmax, no division, no normalization layers. Score islog⁡(ℓval−1.5)\log(\ell_{\text{val}}{-}1.5); the agent baseline (untrained restricted MLP) scores 1.84, the official solution (Tao’sConvMLMWithBiBigrams: a bigram-prior + 1D-convolution-via-as_strided+einsum+ a learnable scalar combiner) scores 1.13. We ran four trajectories: paper and ARA arms on Sonnet 4.5 (seed 1) and Sonnet 4.6 (seed 0). The Sonnet 4.6 trajectories appear in body Fig.12(rightmost column); Fig.15below shows the Sonnet 4.5 paired runs.

Refer to caption Figure 15:restricted_mlmon Sonnet 4.5: paper vs.Arascore-vs-time (left) and score-vs-cost (right). Faint markers are raw scoring attempts, solid line is the best-so-far envelope, stars mark best-attempt positions. Dotted lines mark the untrained-MLP baseline (1.84) and the RE-Bench reference (1.13). Both arms are anchored at the 1.84 baseline att=0t=0;Ara-4.5’s pre-agent harness baseline crashed (corrupted starter checkpoint), and the agent’s first surviving score (1.43 att≈5t\approx 5min) reflects the trace-recommendedConvMLMWithBiBigramsarchitecture already swapped in (precomputed bigram tables score∼1.43{\sim}1.43with no training), which we anchor at 1.84 to match the other three arms’ actual baseline measurements.Ara-4.5 reaches 0.73 vs. paper-4.5’s plateau at 1.03 – a∼30%{\sim}30\%relative win on the weaker base. The 4.6 trajectories are in the body composite.##### The flip across model versions.

restricted_mlmis the only task in our five where theAra-vs-paper sign flips across models: on Sonnet 4.5 theAraagent reaches 0.73 vs. the paper agent’s 1.03, and on Sonnet 4.6 the paper agent reaches 0.69 vs. theAraagent’s 1.02. Moving from 4.5 to 4.6 helps the paper agent by∼33%{\sim}33\%and hurts theAraagent by∼40%{\sim}40\%on the same task.

Same architectural family across all four agents.

Every run’s finalsolution/model.pycontains aBiBigramMLMclass plus aConvMLM*variant (paper-4.5:ConvMLMWithBiBigrams;Ara-4.5:ConvMLMComponent; paper-4.6:ConvMLMDilated;Ara-4.6:ConvMLMWithReLUAttnplus five others). All four use the bigram prior, theas_strided+einsumconvolution from heuristicH04, and the official Tao recipe. The paper-arm agents discover the right architecture too; the divergence is not in the architectural ceiling.

What differs is exploration breadth.

The fourmodel.pyfiles differ markedly in size and class count. Paper-4.5: 9.8 KB, 3 classes;Ara-4.5: 8.9 KB, 3 classes; paper-4.6: 6.3 KB, 2 classes;Ara-4.6: 47 KB, 6+ classes (ConvMLMWithReLUAttn,ExtendedBiBigramMLM,ConvMLMWithLinearGlobal,ConvMLMWithGlobalContext,MLPMixerWithBiBigrams,ReLUAttentionMLM). Trace keyword counts mirror it: paper-4.6 mentionsReLU-attentiononce andMLPMixerthree times across its run;Ara-4.6 mentions them 247 and 73 times. The score regression spikes in the rightmost column of body Fig.12are the visible record: each spike toward 1.85 or 2.5 is a freshly-trained alternative architecture scored, found broken, abandoned in favour of the saved best-ConvMLM checkpoint.

WhyArawins on Sonnet 4.5.

Both 4.5 agents end up tuning the same ConvMLM family. TheAraagent’sThinkingBlockatt=24t=24min commits explicitly (“This should be enough to beat the reference score of 1.13”) and queues 40k + 50k continuation steps. Att=26t=26min it sets ReLU-attention aside as*“a backup strategy if I want to aim higher”*—a ranked-list reading of the heuristics. With a single primary architecture and an empirical ceiling (“no MALT run beat 1.13”), theAraagent spends∼7{\sim}7h on continuous fine-tuning of one model and reaches 0.73. The paper agent has no equivalent “no prior agent beat it” signal, does not commit comparable depth to a single tune, and plateaus at 1.03. The win is depth-driven within a shared architecture.

Why the paper agent wins on Sonnet 4.6.

The 4.6 agents diverge architecturally. The paper agent inventsConvMLMDilated(a dilated-convolution variant not named inpaper.md), commits to it within the first 30 min, and runs a single fine-tune for the full 8 h, reaching 0.69. TheAraagent instead implements the additional architectures the heuristics layer names—H11ReLU-attention (“the only attention surrogate any MALT run beat reference with”),H07MLPMixer—and trains them in serial. None outperforms the basic ConvMLM in Sonnet 4.6’s loss landscape:H11andH07were derived from prior MALT runs by Claude-4 Sonnet base, and the menu has gone stale for a successor model whose optimisation differs. The mechanism is the same as in 4.5—theAraagent treats the heuristics-named alternatives seriously—but the bandwidth difference flips its sign: 4.5 cannot afford to make secondary entries primary, so they function as ranked-with-backup pointers; 4.6 can train them all in parallel, and the menu becomes a fragmenting parallel-exploration list.

G.7Reproducibility

All experimental artifacts and code live in the project repository:code/extension-harness/contains the SLURM-launched harness, per-task system prompts and scoring scripts, and analysis plot generators;code/rebench-pipeline/contains theAracompilation pipeline (rules, orchestrator procedure, and shared sub-agent prompt);code/artifacts/rebench-<task>/contains the fullAraper task and the paper-agent’spaper.md-plus-src/bundle. Each run’strace.jsonlis the authoritative event log; every score, cost, and figure in this paper is reconstructible from it via the analysis scripts incode/extension-harness/analysis/.

Appendix HReview System Evaluation

H.1ARA Seal Validation Details

Seal implementation.

Each verification level is implemented as an automated checker:

•Level 1 (Structural Integrity): A Python script verifies (a) the existence of mandatory directories (/logic,/src,/trace,/evidence), (b) the presence of all mandatory files includingPAPER.mdwith valid YAML frontmatter,problem.md,claims.md,experiments.md, and allsolution/files, (c) schema conformance of every structured file (e.g., each claim must haveStatement,Status,Falsification criteria, andProof; each experiment must haveVerifies,Setup,Procedure, andExpected outcome; each heuristic must haveRationale,Sensitivity, andBounds), (d) minimum counts (≥5\geq 5concepts,≥3\geq 3experiments,≥8\geq 8exploration tree nodes with at least one[dead_end]and one[decision]), and (e) cross-layer reference resolution: every experiment ID referenced inclaims.mdProoffields resolves to an entry inexperiments.md; every claim ID referenced inexperiments.mdVerifiesfields resolves to an entry inclaims.md; everycode_refinheuristics.mdpoints to a valid module in/src; components declared inarchitecture.mdhave corresponding code stubs; claim references inexploration_tree.yamlresolve to valid claim IDs.
•Level 2 (Argumentative Rigor): Without executing code or consulting external sources, the Rigor Auditor evaluates the artifact’s content on six objective rubric-anchored dimensions (evidence relevance, falsifiability quality, scope calibration, argument coherence, exploration integrity, methodological rigor), each scored on a 1 to 5 scale. The output is a rigor report keyed to specificAracomponents, with severity-ranked findings, verbatim evidence spans, and an overall grade derived from the mean score and per-dimension floors.
•Level 3 (Execution Reproducibility): A coding agent reads theAraand attempts to reproduce claims using the code kernel. LLM-generated test cases verify directional properties of the paper’s claims. This is the same protocol used in the reproduction evaluation (§7.3).

Failure taxonomy.

For each Seal level failure, we record the specific check that failed and classify it into one of the following categories:missing file,missing field,dangling reference,type mismatch,dependency resolution failure,execution error, ornondeterminism.

H.2ARA Seal Effectiveness: Evaluation Details

This appendix contains the methodology and per-level breakdowns supporting §7.5.

H.2.1Level 1: Compiler Convergence Data

Level 1 verifies structural correctness and completeness; we report its effectiveness through two reuse signals collected duringArageneration and downstream use.

Compiler iteration counts.

Each of the 23 PaperBenchAras and the 7 RE-BenchAras converges to a Level-1 pass within≤3\leq 3iterations of the Compiler’s generate–validate–fix loop (§4). First-iteration pass rate is 0/30; all artifacts require at least one feedback round, confirming that Level 1 is a non-trivial filter rather than a rubber stamp.

Failure category distribution.

Across all Compiler iterations, Level 1 failures break down as follows: dangling cross-layer references (42%), missing schema fields on claims, experiments, or heuristics (31%), insufficient node counts inexploration_tree.yaml(14%), YAML or frontmatter parse errors (8%), and missing mandatory files (5%). The distribution is stable across papers and matches the failure taxonomy in AppendixH.1.

Understanding as proof of Level 1 on generatedAras.

The Understanding evaluation (§7.2, Table3) is the end-to-end witness that Level 1 enforces what it claims to enforce on generated artifacts. EveryAraentering that benchmark has passed Level 1; the 95.6% Cat. A accuracy then shows that Level-1-gatedAras carry the structural completeness an agent needs to retrieve information that is in fact present in the source. An artifact missing a mandatory field or a dangling cross-layer reference would have failed Level 1 and never reached the benchmark, so the 4.4% residual is bounded by information genuinely absent from the source rather than by structural defects of the artifact.

H.2.2Level 2: Mutation Benchmark

Setup and evaluation criterion.

The Level-2 benchmark stress-tests the Rigor Auditor onmutatedAras, so the reported grade carries no ground-truth signal; we score the auditor strictly on whether it surfaces the seeded defect as a finding. The corpus is the 23 PaperBenchAras that pass Level 1; each is seeded with one injection per type (115 mutations in total). All injections are recorded in a per-paperinjection_manifest.jsonhidden from the auditor.

Injection schema.

The five types target distinct schema invariants:

•Fabricated claim: append a claim whoseProofcites a non-existent experiment ID; signal = dangling reference plus un-grounded substance.
•Missing falsification: remove theFalsification criterialine from a primary claim; signal = mandatory field absent.
•Orphan experiment: append an experiment whoseVerifiesfield references a non-existent claim ID (e.g.,C99); signal = evidence not supporting any claim.
•Over-claim: replace a narrowStatementwith a universal-scope template while leaving the originalFalsification criteriaandProofuntouched; signal = scope mismatch between claim breadth and evidence coverage.
•Rebutted-branch leak: append a claim advocating an approach thattrace/exploration_tree.yamlmarksdead_end; signal = direct contradiction between claim and exploration record.

Auditor and blinding.

The Rigor Auditor is an agent skill(Anthropic,2025a)invoked per artifact and given only the artifact directory; the manifest and source PDF are withheld. It parses claims, experiments, heuristics, gaps, and exploration-tree nodes; builds claim–experiment, claim–dependency, and rejected-node maps; scores six dimensions (D1D_{1}evidence relevance,D2D_{2}falsifiability,D3D_{3}scope calibration,D4D_{4}argument coherence,D5D_{5}exploration integrity,D6D_{6}methodological rigor) on 1–5 anchors; emits findings with severity labels (critical,major,minor,suggestion); and reports an overall grade. The full skill specification (prompt, anchors, thresholds) is released with the supplementary code.

Matching and detection rates.

Each injection is matched to at most one finding by the rule: a finding hits if (a) itstarget_entityequals the injection’s, or (b) itsobservationcontains a literal identifier uniquely associated with the injection (e.g.,C99for orphans, the injected dead-end node ID for rebutted branches). Severity and dimension assignment are ignored when counting hits. Per-type detection (Table14) is 23/23 for fabricated claims, over-claims, and rebutted-branch leaks; 21/23 for missing falsifications, with both misses (bamC02,bboxC04) silently re-attributed to adjacent dimensions; and 5/23 for orphan experiments—a systematic blind spot we attribute to the auditor’s claim-centric traversal, which never enumerates experiments without an inboundVerifiesedge.

PaperFab.Miss.fals.OrphanOver-cl.Reb.br.adaptive-pruning✓✓✓✓✓all-in-one✓✓✓✓✓bam✓✗✓✓✓bbox✓✗✗✓✓bridging-data-gaps✓✓✗✓✓fre✓✓✗✓✓ftrl✓✓✗✓✓lbcs✓✓✗✓✓lca-on-the-line✓✓✗✓✓mechanistic-understanding✓✓✗✓✓pinn✓✓✗✓✓rice✓✓✗✓✓robust-clip✓✓✗✓✓sample-specific-masks✓✓✗✓✓sapg✓✓✓✓✓self-composing-policies✓✓✗✓✓self-expansion✓✓✗✓✓semantic-self-consistency✓✓✗✓✓sequential-neural-score-estimation✓✓✗✓✓stay-on-topic-cfg✓✓✗✓✓stochastic-interpolants✓✓✗✓✓test-time-model-adaptation✓✓✗✓✓what-will-my-model-forget✓✓✗✓✓Total detected (/23)232152323Table 14:Per-paper×\timesper-injection detection for the Level-2 mutation benchmark.✓= detected;✗= missed. The orphan-experiment column reveals the systematic blind spot discussed in §7.5.

Diagnostic: score–finding decoupling.

Even though grade is not the evaluation criterion, the auditor’s scoring behavior is informative for future iterations. On the 22Aras where the rebutted-branch leak is flagged as acriticalD5D_{5}finding, the auditor still assignsD5∈{3,4}D_{5}\in\{3,4\}, despite anchors prescribing 1 (“tree contradicts claims”) or 2 (“boilerplate documentation”). Severity in prose does not propagate to the numerical score. The lesson for the next version is mechanical: dimension scores should be derived from the findings list rather than reported independently by the agent.

H.2.3Level 3: Execution Reproducibility

Level 3 effectiveness coincides with the Reproduction evaluation (§7.3, AppendixF): a coding agent reads theAraand attempts to reproduce claims using the code kernel, with directional verification by LLM-generated test cases. We treat the per-paper difficulty-weighted reproduction score reported in Table11as the Level-3 signal, and refer the reader to AppendixFfor task design, scoring, and per-paper analysis.

相似文章

@Xudong07452910: 这篇论文让我感觉，我们对「AI 会取代程序员」这件事的讨论方向可能全错了。核心观点：AI Agent 的出现不是让软件工程师工作效率更高，而是让「把决策逻辑永久编码进软件」这件事本身变得越来越不必要。作者说的是一个更根本的范式变化：传…

X AI KOLs Timeline

这篇论文认为AI Agent的出现不是让程序员更高效，而是从根本上改变了软件范式的本质——代码从永久固化决策逻辑的静态产物，变成了LLM动态生成、用完即弃的临时工具，软件工程的核心将转向设计可靠的推理约束边界。

@gaoren7716: 写论文这件事，可能要被流程化 AI 系统重写了不是帮你润色，不是帮你改一句话，而是从选题开始就有 13 个 Agent 在协作功能清单： Deep Research（13-agent 调研团队） Systematic Review（P…

X AI KOLs Timeline

介绍了一套名为Academic Research Skills的开源AI工具，通过13个Agent协作实现从选题到写作、审稿的全流程学术研究自动化，可作为Claude Code插件使用，将学术研究变为标准化生产线。

@WWTLitee: 有没有什么办法让AI自主迭代优化？有，来看看这个 autoresearch 它的核心不是让 AI 直接“发明论文”，而是把研究过程拆成一个可验证循环：人类写 program.md 给研究方向，AI agent 修改 http://tra…

X AI KOLs Timeline

介绍了autoresearch项目，它将AI研究过程拆解为可验证的循环（固定环境、单一可编辑文件、固定指标、Git回滚），使AI agent能进行可控、可复现的实验迭代；同时提及了12-factor-agents清单。

@gyro_ai: 复现一篇机器学习论文，最痛苦的是论文写得含糊，关键参数藏在附录里甚至压根没写，你大半时间在当侦探而不是写代码 paper2code 是个 Agent 技能，丢给它一个 arxiv 链接，它生成一份能跑的实现代码，1308 star htt…

X AI KOLs Timeline

Paper2code 是一个 AI Agent 技能，输入 arxiv 论文链接即可生成带有引用锚定的可运行实现代码，自动审计论文中的模糊之处并标记未指定部分，帮助研究者和工程师高效复现机器学习论文。

@vintcessun: 原来学术论文AI写作还能这么做：不是直接润色，而是先让agent学目标场景、读优秀样例，再记录每个段落为什么这么写。PaperSpine通过 writing rationale matrix 把写作变成了可审计的推理过程，而不是黑箱生成。…

X AI KOLs Timeline

PaperSpine 是一个面向 Codex、Claude Code 和 OpenClaw 的论文写作 skill suite，通过写作动机矩阵将 AI 写作变为可审计的推理过程，而非黑箱生成。