AI Trading's Alpha Singularity: Emergent Market Reasoning through Agent-to-Agent Self-Evolution

arXiv cs.AI Papers

Summary

The paper introduces Sealed Joint Search (SJS) and the Agora system, where five specialized LLM agent classes collaborate to evolve alpha factors. On a 91-day CSI 1000 holdout, Agora achieves a portfolio Sharpe of +1.87, significantly outperforming baselines, and the discovered metrics appear as emergent properties of the system.

arXiv:2606.29194v1 Announce Type: new Abstract: Automated alpha mining holds the scoring function fixed and varies the search algorithm over it. A search that converges against a fixed scorer overfits whatever the scorer cannot penalize, a primary cause of the out-of-sample generalization gap. We treat the scoring function as a search artifact alongside the alpha factors and study what conditions make this joint search admissible. Sealed Joint Search (SJS) is a framework: a set of structural conditions on information flow in an autonomous-discovery system that prevent joint search from collapsing into self-confirmation while keeping the evaluator sealed. Conditions cover role decomposition, typed inter-role communication, provenance-sealed reads, versioned stores, and substrate-local promotion. Agora tests SJS empirically: five LLM agent classes communicate via three channels, evolving eight skill libraries, with alpha libraries built on AlphaGen operators. Three evaluators write reports aggregated into one brief, carrying forward disagreement instead of voting. We run Agora for 100 rounds on CSI 1000 and evaluate on a 91-day 2026 holdout sealed from all LLM inputs. Agora achieves holdout Sharpe +1.87; best baseline +1.334 at favorable seed and -0.755 cross-seed mean. Pre-loading Agora's two metrics into a frozen-library ablation recovers only +0.40 of the +2.25 Sharpe gap, and adding PPO without library evolution worsens the gap. The two metrics emerge rather than being designed. Caveats: single-seed run, short-side concentrated signal, intended for long-short.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:33 AM

# Emergent Market Reasoning through Agent-to-Agent Self-Evolution
Source: [https://arxiv.org/html/2606.29194](https://arxiv.org/html/2606.29194)
## AI Trading’s Alpha Singularity: Emergent Market Reasoning through Agent\-to\-Agent Self\-Evolution

Siyuan Liu Panda AI liusiyuan@pandaai\.online Bingjun Liu Panda AI liubingjun@pandaai\.online

###### Abstract

Automated alpha mining holds the scoring function fixed and varies the search algorithm over it\. A search that converges against a fixed scorer overfits whatever the scorer cannot penalize, which we argue is a primary cause of the out\-of\-sample generalization gap\. We treat the scoring function as a search artifact alongside the alpha factors and study what conditions make this joint search admissible\.Sealed Joint Search \(SJS\)is the framework we propose: a small set of structural conditions on the information flow inside an autonomous\-discovery system that prevent the joint search from collapsing into self\-confirmation while keeping the external evaluator sealed\. The conditions cover role decomposition, typed inter\-role communication, provenance\-sealed reads, persistent versioned artifact stores, and a substrate\-local promotion rule\.Agorais the system we build to test SJS empirically: five role\-specialized LLM agent classes communicating through three typed channels, jointly evolving eight skill libraries, with the alpha\-side libraries built on the AlphaGen operator vocabulary\. Three independent evaluator instances write narrative reports that the orchestrator aggregates into a single brief, so adjudication carries forward disagreement rather than collapsing it to a vote\. We run Agora for 100 outer rounds on CSI 1000 and evaluate on a 91\-day 2026 holdout sealed from every LLM\-facing input\. Agora reaches a holdout portfolio Sharpe of\+1\.87\+1\.87; the strongest baseline reaches\+1\.334\+1\.334at a favorable seed and−0\.755\-0\.755on the cross\-seed mean\. Pre\-loading the two metrics that Agora discovers into a frozen\-library ablation recovers only\+0\.40\+0\.40of the\+2\.25\+2\.25Sharpe gap, and adding PPO without library evolution makes the gap worse\. The two metrics behave like emergent properties of the system rather than designed components\. Two caveats: the full Agora run is single\-seed, and the signal is short\-side concentrated and therefore intended for long\-short deployment\.

## 1Introduction

Autonomous discovery problems share a common shape: a search procedure proposes candidate artifacts, a scoring function ranks them, and an external evaluator adjudicates the result on data the search never touched\. Quantitative alpha mining is one instance\. Autonomous theorem proving, autonomous program synthesis, and autonomous experimental design are others\. In alpha mining, the methods used to find return\-predictive expressions have evolved through three generations: hand\-crafted factor models \(Fama–French\[[8](https://arxiv.org/html/2606.29194#bib.bib8),[9](https://arxiv.org/html/2606.29194#bib.bib9)\], momentum\[[13](https://arxiv.org/html/2606.29194#bib.bib13),[4](https://arxiv.org/html/2606.29194#bib.bib4)\], the Alpha101 catalog\[[14](https://arxiv.org/html/2606.29194#bib.bib14)\]\); symbolic regression and genetic programming\[[16](https://arxiv.org/html/2606.29194#bib.bib16),[34](https://arxiv.org/html/2606.29194#bib.bib34),[53](https://arxiv.org/html/2606.29194#bib.bib53),[6](https://arxiv.org/html/2606.29194#bib.bib6)\]; and RL\-based search that casts the expression tree as a sequential decision\[[51](https://arxiv.org/html/2606.29194#bib.bib51),[35](https://arxiv.org/html/2606.29194#bib.bib35),[29](https://arxiv.org/html/2606.29194#bib.bib29)\]\.

Across these three generations and across other autonomous\-discovery domains, the scoring function is fixed at design time\. The search procedure cannot detect when its own objective has been overfit to the training segment, and any reported gain is conditional on a hyperparameter that no participant revises\. With the multiple\-testing concerns documented in\[[2](https://arxiv.org/html/2606.29194#bib.bib2),[10](https://arxiv.org/html/2606.29194#bib.bib10),[15](https://arxiv.org/html/2606.29194#bib.bib15),[22](https://arxiv.org/html/2606.29194#bib.bib22)\], this is a sharp constraint: when the bottleneck is the choice of objective rather than the choice of expression, better search procedures cannot help\.

Making the scoring function itself a search artifact is one response\. Existing self\-evolving systems\[[40](https://arxiv.org/html/2606.29194#bib.bib40),[23](https://arxiv.org/html/2606.29194#bib.bib23),[32](https://arxiv.org/html/2606.29194#bib.bib32),[12](https://arxiv.org/html/2606.29194#bib.bib12),[21](https://arxiv.org/html/2606.29194#bib.bib21),[52](https://arxiv.org/html/2606.29194#bib.bib52)\]address the resulting collapse in two ways: either by holding a criterion fixed \(and inheriting its blind spots\) or by querying a strong external evaluator on every proposal \(and paying the sample\-cost and leakage cost that follow\)\. Both routes leave the joint\-search problem itself unaddressed\.

This paper introduces a third regime:Sealed Joint Search \(SJS\)\. The criterion is searched, the external evaluator is sealed, and the connection between them is mediated by substrate\-level constraints that no participating agent controls\. SJS is a set of structural conditions on the search procedure, not a system\. The five conditions are: F1 decomposed proposal, in which proposers and adjudicators are separate roles with no shared state; F2 typed inter\-role communication, with no free\-form chat between roles; F3 a provenance\-sealed substrate, in which every record is tagged with its data segment and reads are filtered by role and content type; F4 persistent versioned artifact stores, role\-owned, with a state machine over builtin / trial / accepted / rejected; and F5 a substrate\-local promotion rule that is non\-self\-judging and outcome\-grounded\. A boundary condition fixes the external evaluator before the run and keeps it inaccessible to every role\.

To validate SJS empirically, we instantiate it asAgora, an A2A LLM system in which five role\-specialized agent classes \(nine clients with isolated contexts\) communicate through three typed channels and jointly evolve eight skill libraries on a sealed substrate\. The substrate routes each agent’s output into a shared typed record that later agents read without seeing each other’s contexts\. Adjudication in this realization is deliberation: independent evaluator instances produce narrative reports that the substrate aggregates rather than reducing to a single vote, so dissenting opinion enters the next round as evidence\. The alpha\-side libraries \(operator, RL\-network, RL\-algorithm, reward\) are built on AlphaGen\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\]; the metric, topic, rubric, and meta\-rubric libraries are seeded with deliberately sparse builtin entries\. Promotion runs against training\-segment Sharpe with\|ρ\|\>0\.3\|\\rho\|\>0\.3\. The 2026 holdout is loaded once after the run completes and is sealed from every LLM\-facing input\.

The central research question is the decomposition of holdout performance under an SJS realization: how much of any out\-of\-sample gain is attributable to \(a\) library evolution, \(b\) the A2A decomposition itself, \(c\) the LLMs in isolation, and \(d\) the underlying RL or symbolic search machinery\. We address this through a 100\-round Agora run on CSI 1000 against seven baselines \(genetic programming, pure PPO, single\-LLM one\-shot, iterative single\-LLM, Alpha101, a frozen\-library ablation, and random search\) plus three within\-Agora ablations\.

#### Contributions\.

- •Sealed Joint Search \(SJS\),a methodological framework for joint search over signals and scoring functions under a sealed external evaluator\. The framework consists of five structural conditions \(F1–F5\) on the information flow of the search procedure plus a boundary condition on the evaluator\. SJS is independent of any particular optimizer, language model, or operator language\.
- •Agora,an instantiation of SJS as an A2A LLM system, satisfying F1–F5 with five role\-specialized agent classes, three typed channels, eight skill libraries, and an empirical promotion rule against train\-segment Sharpe\. Agora is evaluated on a sealed 91\-day 2026 holdout of CSI 1000\.
- •Empirical evidence about SJS, demonstrated on Agora\.Agora reaches holdout portfolio Sharpe\+1\.87\+1\.87, ahead of seven baselines on point estimates\. Across 100 rounds, evolution emerged in exactly one of the eight skill libraries \(metric\), with two promoted entries that the system was not designed to produce\. Neither metric was produced by any single agent: each emerged from aggregate promotion evidence across rounds\. Ablation isolates an upper bound of\+2\.25\+2\.25Sharpe units of total contribution from the joint F4 \+ F5 mechanism \(persistence \+ substrate\-local promotion\), of which the static value of two LLM\-discovered metrics accounts for\+0\.40\+0\.40\. PPO relay added to a frozen system is harmful, indicating that asymmetric upgrades of the search procedure without co\-evolving scoring are counterproductive\.
- •Open\-source release\.We release Agora’s implementation, all seven baseline configurations under a common evaluation harness, the leakage audit, and the per\-round registry snapshots for the eight skill libraries\.

## 2Related Work

#### Symbolic alpha generation\.

Three generations of methods have searched the space of arithmetic expressions over OHLCV variables for ones that correlate with forward returns\. Hand\-crafted formulas\[[14](https://arxiv.org/html/2606.29194#bib.bib14),[8](https://arxiv.org/html/2606.29194#bib.bib8),[9](https://arxiv.org/html/2606.29194#bib.bib9),[4](https://arxiv.org/html/2606.29194#bib.bib4),[13](https://arxiv.org/html/2606.29194#bib.bib13)\]predate any automated search; AutoAlpha\[[53](https://arxiv.org/html/2606.29194#bib.bib53)\]and AlphaEvolve\[[6](https://arxiv.org/html/2606.29194#bib.bib6)\]introduced evolutionary search over operator trees\[[16](https://arxiv.org/html/2606.29194#bib.bib16),[34](https://arxiv.org/html/2606.29194#bib.bib34)\], and AlphaGen\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\]replaced the genetic engine with a Maskable PPO policy\[[35](https://arxiv.org/html/2606.29194#bib.bib35),[31](https://arxiv.org/html/2606.29194#bib.bib31)\]\. Deep symbolic regression\[[29](https://arxiv.org/html/2606.29194#bib.bib29)\]forms a parallel line within symbolic search\. Across these methods, the fitness function \(typically rank IC against a forward return\) is fixed at design time and only the search procedure over expression trees varies\. Our baselines B1 \(a gplearn adaptation of the AlphaGen reference\) and B2 \(AlphaGen pure PPO\) are drawn from the most recent two generations\. The multiple\-testing concerns documented in\[[10](https://arxiv.org/html/2606.29194#bib.bib10),[2](https://arxiv.org/html/2606.29194#bib.bib2)\]motivate the conservative attribution strategy we adopt in Section[5](https://arxiv.org/html/2606.29194#S5)\.

#### LLMs for code generation in finance\.

Recent work prompts language models to emit alpha expressions or trading code, building on the general code\-generation literature\[[5](https://arxiv.org/html/2606.29194#bib.bib5),[38](https://arxiv.org/html/2606.29194#bib.bib38),[3](https://arxiv.org/html/2606.29194#bib.bib3),[27](https://arxiv.org/html/2606.29194#bib.bib27),[1](https://arxiv.org/html/2606.29194#bib.bib1)\]and on finance\-tuned models\[[44](https://arxiv.org/html/2606.29194#bib.bib44),[47](https://arxiv.org/html/2606.29194#bib.bib47)\]\. FinMem\[[50](https://arxiv.org/html/2606.29194#bib.bib50)\]layers structured memory on top of a single agent, and TradingAgents\[[46](https://arxiv.org/html/2606.29194#bib.bib46)\]arranges multiple LLM roles around a trading task\. These systems hold the scoring function fixed and use the LLM to write betterff\. We include the simplest two variants of this setup, single\-LLM one\-shot \(B3\) and single\-LLM iterative against IC feedback \(B4\)\. Agora differs from all of these: it treats the scoring function itself as a search artifact under an empirical promotion rule, not as a fixed objective the LLM is asked to optimize\.

#### Self\-improving LLM agents\.

Outside finance, several systems let an LLM edit its own toolset\. Voyager builds a Minecraft skill library\[[40](https://arxiv.org/html/2606.29194#bib.bib40)\], Eureka writes RL reward functions\[[23](https://arxiv.org/html/2606.29194#bib.bib23)\], ADAS searches over agent designs\[[12](https://arxiv.org/html/2606.29194#bib.bib12)\], FunSearch discovers mathematical constructions\[[32](https://arxiv.org/html/2606.29194#bib.bib32)\], the AI Scientist runs an end\-to\-end research loop\[[21](https://arxiv.org/html/2606.29194#bib.bib21)\], and STaR\[[52](https://arxiv.org/html/2606.29194#bib.bib52)\]bootstraps reasoning chains from a model’s own outputs\. The reasoning and self\-correction building blocks\[[48](https://arxiv.org/html/2606.29194#bib.bib48),[49](https://arxiv.org/html/2606.29194#bib.bib49),[42](https://arxiv.org/html/2606.29194#bib.bib42),[36](https://arxiv.org/html/2606.29194#bib.bib36),[24](https://arxiv.org/html/2606.29194#bib.bib24)\]and tool\-use mechanisms\[[33](https://arxiv.org/html/2606.29194#bib.bib33),[41](https://arxiv.org/html/2606.29194#bib.bib41)\]are studied separately\. Agora adopts the basic shape \(LLM proposes Python, system sandboxes and validates, accepted code joins the runtime\), and adds two elements specific to alpha discovery: a sealed evaluation substrate that distinguishes the segment used to evolve metrics from the segment used to evaluate them, and an empirical promotion rule that admits a candidate metric only when its predictions correlate with realized train Sharpe\.

#### Multi\-agent and agent\-to\-agent LLM systems\.

Several frameworks coordinate multiple LLM instances on a shared task: MetaGPT models a software\-development hierarchy\[[11](https://arxiv.org/html/2606.29194#bib.bib11)\], AutoGen provides a general conversation\-based framework\[[45](https://arxiv.org/html/2606.29194#bib.bib45)\], CAMEL\[[18](https://arxiv.org/html/2606.29194#bib.bib18)\]and ChatDev\[[30](https://arxiv.org/html/2606.29194#bib.bib30)\]use role\-playing dialogues, and Generative Agents\[[28](https://arxiv.org/html/2606.29194#bib.bib28)\]demonstrate emergent dynamics over a shared memory\. AgentBench\[[20](https://arxiv.org/html/2606.29194#bib.bib20)\]evaluates role\-specialized agents on a range of tasks, and capability surveys\[[19](https://arxiv.org/html/2606.29194#bib.bib19)\]catalog the design space\. The coordination substrate in most of these systems is a chat history that the participating agents read directly\. We use the term agent\-to\-agent \(A2A\) for the narrower setting in which each agent runs on an isolated context \(its own client, message history, and system prompt\) and all cross\-agent information is forced through a typed substrate that the agents do not control\. Agora is A2A in this sense: the LLM Wiki and the directed advisory bundles are typed channels with a role\-scoped read filter, and no agent can read another’s full context\. With this isolation, the substrate filters holdout\-segment information at the channel level rather than by instruction; shared\-history multi\-agent designs cannot enforce that constraint\.

#### Statistical inference and out\-of\-sample evaluation\.

The data\-leakage and out\-of\-sample\-evaluation literature predates the LLM\-agent literature and supplies the inferential tools we use: Newey–West HAC standard errors\[[26](https://arxiv.org/html/2606.29194#bib.bib26)\]for serially correlated daily returns, the reality\-check and data\-snooping framework of\[[43](https://arxiv.org/html/2606.29194#bib.bib43)\], the predictive\-accuracy test of\[[7](https://arxiv.org/html/2606.29194#bib.bib7)\], and the leakage taxonomy of\[[15](https://arxiv.org/html/2606.29194#bib.bib15),[25](https://arxiv.org/html/2606.29194#bib.bib25)\]\. The deflated Sharpe and backtest\-overfitting diagnostics of\[[2](https://arxiv.org/html/2606.29194#bib.bib2),[22](https://arxiv.org/html/2606.29194#bib.bib22)\]inform our reporting choices\. In the LLM\-agent setting, contamination is a structural property of any pipeline in which agents can read each other’s contexts; forbidding agents from reading the holdout by instruction is not sufficient, and the substrate must enforce the constraint\.

#### Open\-ended search\.

Expanding a system’s search vocabulary can outperform optimizing within a fixed one; this observation has a long history in evolutionary computation\[[17](https://arxiv.org/html/2606.29194#bib.bib17),[37](https://arxiv.org/html/2606.29194#bib.bib37),[39](https://arxiv.org/html/2606.29194#bib.bib39)\]\. Agora inherits the basic idea, in that artifacts the system produces \(new metrics, new operators, new rubrics\) become part of the search space for later rounds, but applies it under the sealing constraints imposed by financial out\-of\-sample evaluation\[[22](https://arxiv.org/html/2606.29194#bib.bib22)\]\. Open\-ended vocabulary expansion under a sealed evaluation segment does not appear in this literature\.

## 3Methodology

### 3\.1Methodological Reorientation

For thirty years, the methodology of automated discovery has had a fixed shape\. A search algorithm proposes candidate artifacts, a scoring function rates them, and a held\-out evaluator decides whether the search produced something of value\. Methodological progress means improving the search algorithm\. The scoring function is inherited from prior work, and the held\-out evaluator is a passive recipient of the final output\. Genetic programming, reinforcement learning, and prompt\-engineered language\-model proposers are all methodologically commensurable in this sense: they are different search algorithms operating on the same three\-component template\.

We argue that this template has run out of methodological room\. Improvements to the search algorithm under a fixed scoring function amount to faster exploitation of the scoring function’s blind spots\. The empirical evidence is uniform across decades of factor\-mining, program\-search, and policy\-learning literature: when the scoring function is held fixed, the search converges to artifacts that are maximally aligned with whatever the scoring function fails to penalize\[[2](https://arxiv.org/html/2606.29194#bib.bib2),[10](https://arxiv.org/html/2606.29194#bib.bib10),[15](https://arxiv.org/html/2606.29194#bib.bib15)\]\. The limit of search\-algorithm research, under a fixed scorer, is the limit of overfitting that scorer\.

Making the scorer itself a learned object appears in pieces in the recent self\-evolving\-LLM literature\[[40](https://arxiv.org/html/2606.29194#bib.bib40),[23](https://arxiv.org/html/2606.29194#bib.bib23),[32](https://arxiv.org/html/2606.29194#bib.bib32),[12](https://arxiv.org/html/2606.29194#bib.bib12),[21](https://arxiv.org/html/2606.29194#bib.bib21),[52](https://arxiv.org/html/2606.29194#bib.bib52)\]and in the multi\-agent\-LLM literature\[[11](https://arxiv.org/html/2606.29194#bib.bib11),[45](https://arxiv.org/html/2606.29194#bib.bib45),[18](https://arxiv.org/html/2606.29194#bib.bib18),[30](https://arxiv.org/html/2606.29194#bib.bib30),[28](https://arxiv.org/html/2606.29194#bib.bib28)\]\. The pieces have not been combined into a methodology, because the obstacle is not mechanical\. It is conceptual\. A search procedure that proposes both the artifact and the scorer collapses trivially: any procedure with control over both objects can satisfy itself by relaxing the second to admit the first\. Multiplying agents, layering critique loops, or querying a stronger evaluator at training time do not address the collapse; they relocate it\.

This obstacle dissolves under a methodological reorientation\. The object of methodological study is no longer the search algorithm\. It is the*information topology*of the search: which roles produce information of which type, which roles are permitted to consume it, what fidelity it has when it crosses a boundary, and what state persists across the boundary\. Under this reorientation, generalization is a property of the topology, not of any participant’s behavior\. The substrate becomes a first\-class methodological concern, on equal footing with the optimizer that runs on it\.

The methodology we propose,*Sealed Joint Search*\(SJS\), is the analytical framework that follows from this reorientation\. SJS does not specify a search algorithm\. It specifies the information topology under which a joint search over\(f,g\)\(f,g\)pairs admits a non\-degenerate solution\. Section[3\.2](https://arxiv.org/html/2606.29194#S3.SS2)formalizes the framework and Section[3\.3](https://arxiv.org/html/2606.29194#S3.SS3)states the five topological properties that characterize the sealed regime\. Section[3\.6](https://arxiv.org/html/2606.29194#S3.SS6)introduces the*agent\-to\-agent*\(A2A\) realization of SJS, which instantiates the framework on existing language models\. Section[3\.7](https://arxiv.org/html/2606.29194#S3.SS7)introduces Agora, the A2A system on which we report empirical evidence; further implementation detail is in Section[4](https://arxiv.org/html/2606.29194#S4)\.

### 3\.2Information\-Topology Formulation of Joint Search

Letℱ\\mathcal\{F\}be a space of candidate artifacts and𝒢\\mathcal\{G\}a space of candidate scoring functions\. Letuube an external evaluator on a sealed segment𝒟holdout\\mathcal\{D\}\_\{\\text\{holdout\}\}\. Joint search optimizes

\(f⋆,g⋆\)=arg⁡max\(f,g\)⁡u​\(f,𝒟holdout\),\(f^\{\\star\},g^\{\\star\}\)=\\arg\\max\_\{\(f,g\)\}\\;u\(f,\\mathcal\{D\}\_\{\\text\{holdout\}\}\),under structural constraints on which we focus\.

A search procedure is, abstractly, a directed graph\. Its nodes are*loci*: producers \(offf, ofgg\), adjudicators, and stores of artifacts\. Its edges are typed information channels, each labeled with a content type, a fidelity, and a provenance tag\. Its temporal structure is a schedule that determines when each node may write to each outgoing edge and when each may read from each incoming edge\. Under this view, the search algorithm is the local update rule at a single node; the scoring function is the local update rule at another\. Generalization is determined by the global graph, not by any single rule\.

The collapse described in Section[3\.1](https://arxiv.org/html/2606.29194#S3.SS1)is a topological property: it occurs whenever the producer offfand the producer ofggshare an unfiltered edge\. Under such a topology, no choice of local update rules at either node prevents the search from relaxingggto admit anyff\. Conversely, generalization\-respecting topologies have a consistent shape across instantiations\. SJS is the characterization of that shape\.

### 3\.3The Sealed Topology

A topology is*sealed*when its edge structure satisfies five properties simultaneously\. We state each property in topological terms; alternative realizations satisfying the same property are admissible\.

#### P1\. Edge\-typed asymmetric delivery\.

Every edge carries a content type and a recipient role\. The content that traverses an edge is determined by the recipient, not by the sender\. The same source node, writing what it considers a single output, produces distinct payloads on edges to distinct recipients; the differences are computed by the substrate at the edge, not by the source\. Topologies in which the producing node controls what its recipients see, including chat\-style topologies in which a message is broadcast verbatim, fail this property\.

#### P2\. Bounded\-capacity inter\-temporal edges\.

Time is partitioned into rounds\. Edges that cross a round boundary have a capacity bounded by the substrate, irrespective of how much the source node generated\. Topologies in which agents thread chat histories or accumulate context across rounds do not have this property and admit unbounded inter\-temporal information flow\. Generalization\-respecting topologies route across\-round influence through a small number of typed records that the substrate, not the agents, controls\.

#### P3\. Provenance\-determined read scope\.

Every record carries a provenance tag identifying the data segment it was derived from\. Each reading role has a read scope expressed as a set of \(content type, provenance\) pairs that role is authorized to consume\. A record outside any role’s read scope is unreachable to that role\. The substrate enforces the scope at read time\. Topologies in which roles share a global read context, including most multi\-agent chat designs, fail this property\. The held\-out segment is sealed precisely when no role on the artifact\-proposal path has holdout\-derived records in its read scope\.

#### P4\. Skill stores as the locus of evolution\.

A topology is*evolutionary*only if there is a designated locus where its accumulated competence resides\. Wiki records accumulate \(P3\), briefs refresh \(P2\), and edge filters fire \(P1\), but none of these alone changes what the search procedure can do in subsequent rounds\. The methodological claim is that self\-evolution requires a separate class of node, the*skill store*, whose contents are persistent, executable, and directly consumed by a producer node’s local update rule\. Without this class of node, the topology has memory but no causal effect on future rounds\. With it, the search procedure’s terminal state at roundTTdiffers from its terminal state at roundT−1T\-1, and the difference is recorded in the substrate as a named transition\. Skill stores are also where decomposition’s writability constraint lives: each store is owned by exactly one producer node, and no other node can write to it\. A topology with shared writable skill stores reintroduces the collapse described in Section[3\.1](https://arxiv.org/html/2606.29194#S3.SS1)through the back door: a producer ofggthat writes to the proposer\-of\-ff’s skill store can implicitly relaxff’s search space\.

#### P5\. Closure\-induced state transitions\.

Persistent state in a skill store advances only through transitions computed by the substrate from records the substrate produced for other reasons\. The transitions do not consult the external evaluator \(which would defeat sealing\) and are not adjudicated by any node with a stake in the outcome \(which would defeat decomposition\)\. The transition rule’s outcome variable is necessarily internal to the closed search\. The methodology accepts this circularity as a structural property of sealed joint search and treats its consequences in Section[6](https://arxiv.org/html/2606.29194#S6)\.

### 3\.4Implications of the Topological Reorientation

The reorientation has three consequences that algorithm\-centered methodologies do not yield\.

First, generalization analysis becomes*compositional*\. Properties of a sealed topology are inherited by any local update rule that respects the edges; the reverse is not true\. A topology analysis tells the practitioner which classes of leakage are structurally impossible regardless of the optimizer in use\. Algorithm\-level analysis covers only whether a specific optimizer satisfies a specific seal under specific assumptions about its behavior\.

Second, the methodological objects of study transfer across domains\. A sealed topology designed for joint search over alpha factors and their scorers can, with no change to its edge structure, host a joint search over theorems and proof tactics, over programs and their tests, or over experimental designs and their analysis plans\. The local update rules at the loci change; the topology does not\. The instantiation choices that vary across domains \(which LLM, which operator language, which outcome variable in the P5 rule\) are decoupled from the topology that makes joint search admissible\.

Third, ablations target edges rather than agents\. The ablation we report in Section[5](https://arxiv.org/html/2606.29194#S5)disables transitions on certain edges \(the P5 transition rule on the P4 skill stores\) while leaving all other edges and all node\-local update rules unchanged\. Under an algorithm\-centered methodology this is a strange experiment\. Under a topology\-centered methodology it is the direct ablation: it isolates which edge of the graph the empirical performance depends on\.

### 3\.5Two Timescales of Topological Self\-Evolution

Under SJS, self\-evolution operates on the topology, not on the nodes\. Within a round, the topology’s state is read\-only to the node update rules; nothing a node computes during a round can modify the substrate visible to another node before the round closes\. Across rounds, the substrate updates: new records accumulate in the typed stores, the bounded inter\-temporal edges refresh from the just\-completed round’s adjudicator output, and the skill\-store transition rules fire under the P5 closure condition\. Because every state change is a named substrate record, the channel that carries learning is also the channel a reviewer can inspect\.

### 3\.6The Agent\-to\-Agent Realization of SJS

SJS specifies a topology but does not say what populates the loci\. The framework admits multiple realization patterns: classical multi\-process RL ensembles, federated optimization with disjoint workers, or, in the regime we study, large\-language\-model agents holding isolated contexts and communicating through typed substrate\-mediated edges\. We name this last realization the*agent\-to\-agent*\(A2A\) realization of SJS, and argue that it fits the topology’s properties for three reasons\.

First, an LLM call is a suitable*locus*\. Each call begins from an unchanged prior, executes a stateless function from prompt to output, and consumes no information beyond its supplied context\. LLM agents that communicate only through the substrate satisfy the disjoint\-state requirement of P1’s producer\-decomposition condition by construction\. Multi\-process RL ensembles can satisfy P1 too, but require active work to prevent gradient leakage and parameter sharing; LLM agents satisfy it for free\.

Second, LLM message\-passing fits P2 and P3\. LLM prompts are typed strings: an agent’s input is a structured payload, not a fragment of memory\. The substrate can therefore filter the payload at the edge \(P1’s asymmetric delivery, P3’s role\-scoped read scope\) and bound the cross\-round capacity \(P2\) without modifying any agent’s internal mechanism\. This payload discipline lets an adjudicator role be realized as a panel of independent LLM instances whose narrative reports the substrate aggregates into a single brief\. Each panel instance writes a report; disagreement among instances enters the next round as evidence rather than being collapsed by majority vote\. The same operations on a multi\-process RL ensemble would require explicit serialization layers between processes; LLM agents already operate on serialized typed payloads\.

Third, code\-as\-an\-output makes LLM agents suitable producers for the P4 skill stores\. An LLM agent that emits Python code for a new operator, a new scoring metric, or a new evaluator rubric writes into a versioned skill store the same way a software engineer commits to a versioned codebase\. Classical optimizers can populate P4 stores too \(a learned reward function in RL is one such artifact\), but the artifacts are typically continuous parameters rather than discrete, executable, named entries\. LLM agents produce the kind of artifact that the four\-state machine \(*builtin*/*trial*/*accepted*/*rejected*\) was designed for\.

### 3\.7Agora: An A2A System for Alpha Discovery

We test the A2A realization on a concrete domain: alpha discovery on Chinese A\-share equities, with the sealed external evaluatoruurealized as a fixed layered backtest\. The system, named*Agora*, populates the SJS topology with five LLM agent roles distributed across nine independent clients \(the alpha\-miner and evaluation\-miner are the producers offfandgg; the research\-report is the upstream advisory; two adjudicator panels of three instances each handle adjudication\), eight typed skill stores \(four owned by the alpha\-miner and seeded from the AlphaGen\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\]operator vocabulary and Maskable PPO infrastructure\[[35](https://arxiv.org/html/2606.29194#bib.bib35),[31](https://arxiv.org/html/2606.29194#bib.bib31)\]; the metric store is seeded with four builtins and is the store in which evolution is observed\), three communication channels \(directed advisory, bounded cross\-round briefs, and a typed shared LLM Wiki\), and a substrate\-local promotion rule on training\-segment Sharpe\. Figure[1](https://arxiv.org/html/2606.29194#S3.F1)renders the topology\.

![Refer to caption](https://arxiv.org/html/2606.29194v1/x1.png)Figure 1:The Sealed Joint Search topology realized by Agora as an agent\-to\-agent system\. Two producer loci \(alpha\-miner, evaluation\-miner\) proposeffandgg; an upstream advisory locus \(research\-report\) emits a single bundle that the substrate routes to the two producers under distinct content filters \(P1\)\. Two adjudicator panels of three instances each write narrative reports that the orchestrator aggregates into bounded cross\-round briefs \(P2, dashed edges\)\. The LLM Wiki is the typed shared memory through which every locus reads and writes under a role\-scoped, provenance\-tagged filter \(P3, dotted edges\)\. Eight skill stores, four owned by the alpha\-miner \(seeded from AlphaGen\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\]\) and four distributed across the evaluation\-miner, the research\-report, and the adjudicator panels, are the locus of evolution \(P4\); their state advances under the closure rule of P5 \(\|ρ\|\>0\.3\|\\rho\|\>0\.3promote,\|ρ\|<0\.05\|\\rho\|<0\.05demote, against the alpha\-miner’s training\-segment Sharpe\)\. The external evaluatoruusits outside the sealed substrate and is consulted once at the end of the run\.Section[4\.3](https://arxiv.org/html/2606.29194#S4.SS3)gives the full instantiation parameters \(which LLM, how many clients per role, the exact filter rules, the round schedule, the promotion thresholds\)\.

### 3\.8Scope and Open Questions

SJS specifies the information topology of admissible joint search\. It does not specify which loci to populate, how many adjudicator instances to run, which optimizer to use at each producer locus, which operator language to expressℱ\\mathcal\{F\}in, or which outcome variable to use in the closure\-induced transition rule\. Agora’s specific choices \(five roles, nine LLM clients, three panels of three instances each, AlphaGen operators, training Sharpe as the P5 outcome variable\) are instantiation parameters rather than methodological commitments\.

Two open questions are intrinsic to the methodology rather than to any instantiation\. The closure\-induced transition rule \(P5\) must reference an internal outcome variable, and the strength of the self\-evolution signal is bounded by how informative that variable is about the external evaluatoruu\. SJS makes this dependence explicit but does not eliminate it\. Producer decomposition \(P1\) is necessary by the argument in Section[3\.2](https://arxiv.org/html/2606.29194#S3.SS2)but is difficult to ablate empirically without dismantling the rest of the topology\. Section[6](https://arxiv.org/html/2606.29194#S6)returns to both\.

## 4Experimental Setup

We compare Agora against seven baselines on the CSI 1000 universe under a sealed 91\-day holdout\.

### 4\.1Data and Evaluation Windows

All methods operate on the CSI 1000 dynamic\-universe panel sourced from RiceQuant post\-adjusted OHLCV\. The panel covers 2014\-10\-17 – 2026\-05\-27\. Daily auxiliary fields includecombo\_mask\(tradable membership\),limit\_up\_filter\(price\-limit binary mask\), and the next\-day open price used as the execution reference\. The temporal split is fixed across all methods\. The training segment runs from 2014\-10 – 2019\-12 and contains 1277 trading days\. The test segment runs from 2020\-01 – 2025\-12 and contains 1461 trading days\. The holdout segment runs from 2026\-01 – 2026\-05 and contains 91 trading days\. The split is decided before any method is run and is identical for training, evaluator promotion, top\-KKselection, and final scoring\. Top\-30 selection across every method uses the train\-segment information coefficient\.

### 4\.2External Backtest

The external utilityuuof Section[3](https://arxiv.org/html/2606.29194#S3)\(condition C5\) is realized by the functionsimple\_backtest\_layered: a 10\-decile sort with 5\-day rebalance on next\-day open, after costs\. Round\-trip cost is 9 basis points one\-way \(double\-sided 0\.04% commission plus 0\.05% stamp tax\)\. The function is imported from a frozen module that no method, baseline or Agora, may modify\. It returns rank IC, IR, long\-short Sharpe, annualized return, maximum drawdown, decile monotonicity, average turnover, and excess statistics against an equal\-weight CSI 1000 benchmark\.

### 4\.3Agora Implementation

Agora’s instantiation parameters as the A2A realization of SJS are given below; the mapping from SJS properties to design choices is in Section[3\.6](https://arxiv.org/html/2606.29194#S3.SS6), and full implementation details are in Appendix[A](https://arxiv.org/html/2606.29194#A1)\.

#### Roles and clients \(P1 producers, adjudicators, advisory\)\.

Five roles are realized as nine independent LLM clients, all backed byclaude\-sonnet\-4\-6\[[1](https://arxiv.org/html/2606.29194#bib.bib1)\]\. The research\-report role \(one client\) provides the upstream advisory and owns the topic and macro\-regime libraries\. The alpha\-miner role \(one client\) is the producer offfand owns the operator, RL\-network, RL\-algorithm, and reward libraries\. The evaluation\-miner role \(one client\) is the producer ofggand owns the metric library\. The alpha\-evaluator and factor\-metrics\-evaluator roles each run as panels of three independent instances; each instance carries its own copy of a rubric or meta\-rubric library\. The panels do not vote\. Each instance writes a narrative report and the orchestrator aggregates the three reports into a single brief; disagreement across the three reports enters the brief rather than being resolved before it\. Every client runs on a private message history and a private system prompt, with no read access to any other client’s context\.

#### Three communication channels\.

Inter\-role information moves through three typed channels\.

Channel Ais the within\-round directed advisory edge of P1\. The research\-report emits a single advisory bundle per round, which the substrate routes to the alpha\-miner and the evaluation\-miner under distinct content filters\. Macro\-regime labels and dated references are stripped from the alpha\-miner’s payload before delivery, since the alpha\-miner targets the training segment and regime labels correlate with later segments\. The evaluation\-miner receives the advisory unfiltered\.

Channel Bis the bounded cross\-round edge of P2\. After each adjudicator panel writes its three reports in a round, the orchestrator aggregates them into one brief\. Two briefs are carried across the round boundary: the alpha\-evaluator brief and the factor\-metrics\-evaluator brief\. In the next round, the alpha\-evaluator brief is delivered to both the alpha\-miner and the research\-report, while the factor\-metrics\-evaluator brief is delivered to the evaluation\-miner only\. The asymmetry is enforced at the substrate, not by agent instruction: the factor\-metrics evaluator has been exposed to test\-segment numbers, and routing its brief to the research\-report would reintroduce a path from test\-segment evidence into the alpha search\.

Channel Cis the persistent shared memory of P3, realized as the LLM Wiki\. The Wiki is partitioned into eight typed sections \(factors, evaluators, reports, concepts, failures, verdicts, sources, regimes\)\. Every role reads through a role\-scoped filter that determines which fields of a record are visible\. Roles on the alpha\-proposal path \(research\-report, alpha\-miner, alpha\-evaluator\) cannot see test\- or holdout\-derived numerical fields; the evaluation\-miner and factor\-metrics\-evaluator can see test\-derived fields at coarse fidelity\.

#### Forced\-segment evaluation \(P3 enforcement\)\.

When the alpha\-miner’s candidates are first scored for the alpha\-evaluator panel, the substrate forces the scoring segment to be the training segment and tells the alpha\-miner the same\. The test\-segment scoring of the same candidates happens later, for the evaluation\-miner’s separate use; the alpha\-miner never sees those numbers\. Where the evaluation\-miner must condition on a test\-derived signal in order to act, the substrate delivers a four\-bucket categorical label against a fixed threshold rather than a real value, realizing the coarse\-categorical fidelity reduction of P3\.

#### Skill libraries \(P4 stores\)\.

Agora maintains eight typed skill libraries:operator,RL\-network,RL\-algorithm,reward, andmetricon the alpha\-side cluster, plustopic,rubric, andmeta\-rubric\. The libraries ship with 64 builtin skills total\. Each entry is in one of four states:*builtin*\(immutable, ships with the system\),*trial*\(proposed at runtime\),*accepted*\(promoted from trial\), and*rejected*\(soft\-deleted\)\. Each library has exactly one owning role; no other role can write to it\. The alpha\-miner’s four libraries are seeded with the AlphaGen\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\]operator vocabulary and Maskable PPO infrastructure\[[35](https://arxiv.org/html/2606.29194#bib.bib35),[31](https://arxiv.org/html/2606.29194#bib.bib31)\], which makes Agora’s signal search compatible with the AlphaGen baseline \(B2 below\)\. The metric library is seeded with four builtins \(rank IC, information ratio, score stability, turnover penalty\) and is the library in which agent\-driven evolution is observed in our experiments\.

#### Code\-proposal sandbox\.

A code\-bearing proposal enters a library only after passing three filters: a static restriction on imports and language constructs, a dynamic dry\-run on synthetic data with a bounded compute budget that rejects exceptions or non\-finite outputs, and, for RL\-algorithm proposals, a smoke\-training pass that must not diverge\.

#### Promotion rules \(P5\)\.

Two routes are admitted\. The empirical route is automatic: every time a metric scores an alpha that is later backtested, the substrate logs a pair \(metric score, training Sharpe\)\. After at least three observations, the running absolute correlation\|ρ¯\|\|\\bar\{\\rho\}\|is computed\. Promotion from*trial*to*accepted*fires when\|ρ¯\|\>0\.3\|\\bar\{\\rho\}\|\>0\.3; demotion from*accepted*back to*trial*fires when\|ρ¯\|\|\\bar\{\\rho\}\|falls below0\.050\.05\. The empirical rule runs at the end of every round\. The agent\-initiated route lets the evaluation\-miner file an explicit promotion request, which the factor\-metrics\-evaluator panel adjudicates in its next\-round brief\.

#### Round execution sequence\.

A round runs in a fixed sequence with no inner retry loop\. The research\-report agent issues its advisory; the alpha\-miner proposes candidates conditioned on the prior round’s alpha\-evaluator brief and an optional PPO seed pool; the alpha\-evaluator panel scores the candidates on the training segment and writes three reports; the evaluation\-miner proposes a metric\-library action conditioned on the prior round’s factor\-metrics brief; the candidates are re\-scored on the test segment under the new metric set, composited into a portfolio, and run through the per\-alpha layered backtest; the factor\-metrics\-evaluator panel writes three reports on the backtest result\. The orchestrator then commits: alphas above the backtest threshold persist, the metric library’s predictive correlations update and auto\-promotion fires, the operator, network, reward, RL\-algorithm, and macro\-regime libraries’ usage statistics update, and the two adjudicator briefs are aggregated for the next round\. Wiki commits are written last\. Persistent state advances only at the commit step\. The within\-round dynamics are reproducible from the starting state; learning is carried entirely by the across\-round updates to the Wiki, the two briefs, and the library status registers\.

#### Run configuration\.

The reported run uses 100 outer rounds, theclaude\-sonnet\-4\-6backbone for every role, and an RTX 5090 GPU for the optional PPO relay\. Each round uses roughly 45 LLM calls, yielding∼4500\\sim 4500calls total, plus 100 PPO relay episodes at 15000 timesteps each\. Wall\-clock is roughly 60 hours\. The alpha database after round 100 contains 94 unique alphas\. The reported Agora composite is the equal\-weightzz\-score of the top\-30 alphas selected by training\-segment IC\.

### 4\.4Baselines

Seven baselines span symbolic search, deep RL, prompt\-only LLMs, and ablations of Agora itself\. All consume the same data, the same split, and the same external backtest\.

#### B1: Genetic programming \(gplearn\)\.

Adapted from the official AlphaGen reference implementation\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\]\. Population 500 \(reduced from 1000\), 20 generations \(reduced from 40\)\. Initialization tree depth\(2,6\)\(2,6\)\. Tournament size 100\. Crossover, sub\-tree mutation, hoist mutation, and point mutation rates are 0\.3, 0\.1, 0\.01, and 0\.1\. The token\-length cap is 20: trees beyond the cap are assigned fitness−1\.0\-1\.0\. Fitness is per\-day rank IC averaged over the train segment viacalc\.calc\_single\_IC\_ret\. Random seed is 42\. The top\-KKpool is extracted asCounter\(cache\)\.most\_common\(top\_k\)\. A small parser bypass viaeval\(key\)handles expressions such asEMA\(Constant\(\-2\.0\), 20\)that AlphaGen’sparse\_expressionrejects\.

#### B2: AlphaGen pure PPO\.

scripts/train\_alphagen\_ppo\.pyis invoked directly\. The configuration uses 500 000 total timesteps, pool capacity 30, an LSTM shared\-net of 2 layers withdmodel=128d\_\{\\text\{model\}\}\{=\}128and dropout 0\.1, and MaskablePPO from\[[35](https://arxiv.org/html/2606.29194#bib.bib35),[31](https://arxiv.org/html/2606.29194#bib.bib31)\]\. The reward is training\-segment IC againstRef\(open,\-6\)/Ref\(open,\-1\)\-1\. The device iscuda:0and the seed is 42\. Wall\-clock is∼\\sim5 hours on the RTX 5090\.

#### B3: Single LLM, one\-shot\.

A singleclaude\-sonnet\-4\-6call at temperature 0\.9 andmax\_tokens=8000\. The system prompt enumerates the AlphaGen operator set, fixes the 5\-day open\-to\-open target, and prohibits future\-leaking constructs\. The user message is ”design 50 alpha factor expressions, output JSON with key alphas”\. No iteration, no feedback\.

#### B4: Single LLM, iterative\.

B3 extended to 10 rounds of 20 alphas per round\. After each round, the top\-3 and bottom\-3 by train\-segment IC are appended to the next user message\. Pearson correlations between expressions are computed viabatch\_pearsonron theevaluate\_alphaoutput\.

#### B5: Alpha101\.

The 30\-alpha OHLCV\-only subset of WorldQuant Alpha101\[[14](https://arxiv.org/html/2606.29194#bib.bib14)\], namelyalpha\_1throughalpha\_30\. No training\.alpha\_15throughalpha\_30are evaluated with theindneutralize/IndClasssteps skipped, since the required industry classifier is not part of the public dataset\.

#### B6: Frozen libraries\.

The Agora repository is cloned to a sibling directory;wiki/andruns/are emptied; everyregistry\.jsonis reset to the 64 builtin skills with 0 trial and 0 accepted entries\. Every library’sadd,modify,promote,demote, andauto\_managemethod is monkey\-patched to a no\-op\.enable\_incremental\_ppo=False\. The run lasts 5 outer rounds,∼\\sim1\.5 hours each\. B6 is the ablation that isolates the contribution of substrate\-local promotion \(C4\): channels A–C and the forced\-segment evaluator are intact, only the libraries cannot evolve\.

#### B7: Random search\.

3000 random expression trees, maximum depth 4, sampled uniformly from the AlphaGen operator set with the same constants and lookback setΔ​t∈\{1,5,10,20,40\}\\Delta t\\in\\\{1,5,10,20,40\\\}\. Fitness is identical to B4\. The top\-30 are selected by signed IC\.

#### Omitted baseline \(DSR\)\.

Deep Symbolic Regression\[[29](https://arxiv.org/html/2606.29194#bib.bib29)\]was originally planned as an additional baseline\. We replaced it with B7 for two reasons: DSR is dominated by AlphaGen\-PPO in the AlphaGen paper’s own ablations\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\], and installing thedsopackage on Windows requires a Cython/TF toolchain that is not part of the reference environment\.

### 4\.5Per\-Alpha and Portfolio Metrics

For each candidate alphaff, the external backtest returns the following per\-alpha quantities\. The per\-alpha information coefficientIC​\(f\)\\text\{IC\}\(f\)is the time\-series mean of cross\-sectional Spearman rank correlation betweenftf\_\{t\}and the next\-period return\. The per\-alpha Sharpe is the annualized ratio of mean to standard deviation of the long\-short decile return series offf\. Decile monotonicity is the Spearman rank correlation between the decile index1,…,101,\\dots,10and the realized mean returns of the deciles, taking values in\[−1,1\]\[\-1,1\]\. Average turnover is the mean across rebalance dates of the L1 norm of weight changes divided by 2\.

For each method, the reported portfolio is formed by equal\-weight z\-score combination of the top\-30 alphas selected on train\-segment IC\. Portfolio quantities \(Sharpe, IC, excess return\) are computed by passing the composite signal back through the same external backtest\. Excess statistics are computed against the equal\-weight CSI 1000 benchmark return series\.

### 4\.6Significance Testing

All comparisons are between Agora and one baseline at a time on the holdout window 2026\-01 – 2026\-05\.

#### Primary test\.

For each comparison methodXX, define the daily difference of composite portfolio log\-returnsdt=rtAgora−rtXd\_\{t\}=r^\{\\text\{Agora\}\}\_\{t\}\-r^\{X\}\_\{t\}\. Under the one\-sided nullH0:𝔼​\[dt\]≤0H\_\{0\}:\\mathbb\{E\}\[d\_\{t\}\]\\leq 0, we report a Newey–West HACtt\-statistic\[[26](https://arxiv.org/html/2606.29194#bib.bib26)\]with lagLmax=5L\_\{\\max\}=5and Bartlett kernel:

σ^NW2=γ^0\+2​∑L=15\(1−LLmax\+1\)​γ^L,tNW=d¯σ^NW/T,\\hat\{\\sigma\}^\{2\}\_\{\\text\{NW\}\}\\;=\\;\\hat\{\\gamma\}\_\{0\}\\;\+\\;2\\sum\_\{L=1\}^\{5\}\\\!\\left\(1\-\\tfrac\{L\}\{L\_\{\\max\}\+1\}\\right\)\\\!\\hat\{\\gamma\}\_\{L\},\\qquad t\_\{\\text\{NW\}\}\\;=\\;\\frac\{\\bar\{d\}\}\{\\hat\{\\sigma\}\_\{\\text\{NW\}\}/\\sqrt\{T\}\},whereγ^L\\hat\{\\gamma\}\_\{L\}is the sample autocovariance ofdtd\_\{t\}at lagLL\. The lag is chosen to match the 5\-day rebalance, which is the dominant serial\-correlation horizon in the difference series\. The portfolio test is the appropriate inferential statement here: the 30 per\-alpha Sharpes share the same 91\-day window and the same universe and are correlated through common factor exposures, so a cross\-sectional test that treats them as i\.i\.d\. is anti\-conservative\.

#### Secondary descriptive tests\.

Two further tests are reported as descriptive supplements rather than inferential statements about the portfolio\. A two\-sample bootstrap with 10 000 resamples is run on the difference of medians of the per\-alpha Sharpe distributions of Agora andXX\. A one\-sided Mann–WhitneyUUtest is run on the same per\-alpha Sharpe distributions\. These tests measure whether the median quality of an Agora alpha differs from the median quality of anXXalpha, not whether the portfolios differ\.

## 5Results

### 5\.1Headline Comparison

Table[1](https://arxiv.org/html/2606.29194#S5.T1)reports the comparison of Agora against the seven baselines on the 5\-month holdout segment \(2026\-01 to 2026\-05\) that no LLM observed during training\. Each method produces a top\-30 alpha selection under a single rule \(highest train\-segment IC\), and each top\-30 set is equal\-weightzz\-score\-composited and scored by the same layered backtest \(10 deciles, 5\-day rebalance, after\-cost 9 bps one\-way\)\. Agora’s composite attains a holdout portfolio Sharpe of\+1\.87\+1\.87, the highest of any method\. The closest single\-seed competitor is AlphaGen\-PPO \(B2\) at\+1\.334\+1\.334\(seed=42\), a margin of\+0\.54\+0\.54Sharpe units\. The cross\-seed mean for B2 is materially lower and is reported in Section[5\.4](https://arxiv.org/html/2606.29194#S5.SS4)\. Figure[2](https://arxiv.org/html/2606.29194#S5.F2)shows the comparison as a bar chart; Figure[3](https://arxiv.org/html/2606.29194#S5.F3)shows the cumulative long/short NAV trajectories on the holdout window\.

The per\-alpha median Sharpe \(a method\-level summary insensitive to portfolio construction\) tells the same story: Agora at\+1\.06\+1\.06, Alpha101 \(B5\) at\+0\.541\+0\.541as the closest non\-Agora value, and four of the seven baselines negative\. Agora also leads on portfolio IC \(\+0\.0894\+0\.0894\) and decile monotonicity \(\+0\.285\+0\.285\), and on the long\-short annualized return \(\+48\.4%\+48\.4\\%, roughly 13 percentage points above B2’s\+35\.4%\+35\.4\\%\)\.

![Refer to caption](https://arxiv.org/html/2606.29194v1/x2.png)Figure 2:Holdout portfolio Sharpe ratio for each method\. Top\-30 alphas are equal\-weightzz\-score\-composited and scored by the same layered backtest; selection uses train\-segment IC\. Agora \(\+1\.87\+1\.87\) leads the seven baselines\. The closest single\-seed competitor is AlphaGen\-PPO at seed=42 \(\+1\.334\+1\.334\); the cross\-seed B2 mean is−0\.755\-0\.755\(Section[5\.4](https://arxiv.org/html/2606.29194#S5.SS4)\)\. The frozen\-libraries ablation B6 \(−0\.38\-0\.38\) sets an upper bound of\+2\.25\+2\.25Sharpe units on the combined contribution of skill\-library evolution and the PPO relay; Section[5\.8](https://arxiv.org/html/2606.29194#S5.SS8)decomposes this gap\.![Refer to caption](https://arxiv.org/html/2606.29194v1/x3.png)Figure 3:Cumulative long/short NAV on the holdout segment \(2026\-01\-06 to 2026\-05\-27, 91 trading days, after\-cost, rebased to 1\.0 at day 1\) under the train\-IC selection rule\. Agora terminates at\+10\.5%\+10\.5\\%, AlphaGen\-PPO \(seed=42\) at\+6\.7%\+6\.7\\%, and every other baseline at or below 1\.0\. The Agora trajectory is the steadiest of the eight, consistent with its higher portfolio IC \(\+0\.089\+0\.089\) and decile monotonicity \(\+0\.285\+0\.285\) reported in Table[1](https://arxiv.org/html/2606.29194#S5.T1)\.Table 1:Holdout \(2026\-01 to 2026\-05, never seen by any LLM\) comparison of Agora against seven baselines\. Top\-30 alphas selected by train\-segment IC and scored by the same layered backtest \(10 deciles, 5\-day rebalance, after\-cost\)\. Per\-alpha numbers are medians across the 30 alphas; portfolio numbers are for the equal\-weightzz\-score composite\. “Ann\. Return” is the long\-short annualized return of the composite \(top\-decile minus bottom\-decile, daily compounding\), the standard market\-neutral return metric for a decile\-rank long\-short portfolio\. Single\-seed numbers are reported for parity with the AlphaGen reporting convention; cross\-seed results appear in Table[3](https://arxiv.org/html/2606.29194#S5.T3)and Newey\-West HAC tests in Section[5\.5](https://arxiv.org/html/2606.29194#S5.SS5)\. Agora leads every baseline on long\-short Sharpe, annualized return, portfolio IC, and decile monotonicity\. The full 100\-round Agora run was performed at a single seed; full\-system seed variance is uncharacterized \(Section[5\.4](https://arxiv.org/html/2606.29194#S5.SS4)\)\.
### 5\.2Decile Decomposition

Table[1](https://arxiv.org/html/2606.29194#S5.T1)reports the long\-short annualized return as the portfolio\-level return metric, which is the appropriate number for a decile\-rank long\-short composite\. We disclose the per\-decile breakdown for completeness, since the long\-short return averages over a top decile and a bottom decile that contribute very differently\. Agora’s signal is short\-side concentrated\. The bottom\-decile \(G1\) annualized return is−27\.39%\-27\.39\\%on the holdout segment; the top\-decile \(G10\) annualized return is\+6\.46%\+6\.46\\%; the long\-short annualized return is\+48\.4%\+48\.4\\%, the highest of any method and 13 percentage points above B2’s\+35\.4%\+35\.4\\%\. The equal\-weighted benchmark returned\+9\.75%\+9\.75\\%over the same window\. Agora’s long\-only top decile thus earns a positive\+6\.46%\+6\.46\\%in absolute terms but trails the benchmark by−3\.29%\-3\.29\\%, while B2’s top decile beats the benchmark by\+2\.5%\+2\.5\\%\(\+12\.3%\+12\.3\\%vs\.\+9\.75%\+9\.75\\%\)\. The long\-short ordering reverses this: B2’s weaker short side gives it long\-short Sharpe and IC below Agora’s despite a stronger long leg\. The signal reported here is intended for long\-short or short\-extension deployment, the setting in which the headline numbers of Table[1](https://arxiv.org/html/2606.29194#S5.T1)apply\.

### 5\.3Robustness to Top\-KKChoice

Table[2](https://arxiv.org/html/2606.29194#S5.T2)reports holdout portfolio Sharpe forK∈\{10,20,30,50\}K\\in\\\{10,20,30,50\\\}under the same train\-IC selection rule\. Agora leads at everyKK\. The margin is small atK=10K\{=\}10\(\+3\.78\+3\.78for Agora vs\.\+3\.45\+3\.45for B6, a margin of\+0\.33\+0\.33\) and large atK=30K\{=\}30\(\+1\.87\+1\.87vs\.−0\.38\-0\.38, a margin of\+2\.25\+2\.25\)\. Three baselines \(B5, B6, B7\) are positive atK=10K\{=\}10but flip negative atK=30K\{=\}30, indicating that their top\-quartile alphas carry the portfolio while alphas 11–30 dilute it\. Agora retains positive Sharpe atK=30K\{=\}30\. The result suggests that evolution increases the depth of the usable alpha pool rather than the quality of the very best alphas\.

Table 2:Robustness of holdout portfolio Sharpe to the choice of top\-KK\. Agora leads at everyKK, but the gap shrinks substantially atK=10K=10: the top\-10 alphas of B6 \(frozen libraries\) achieve Sharpe\+3\.45\+3\.45vs\. Agora’s\+3\.78\+3\.78\. The Agora advantage atK=30K=30comes primarily from the*depth*of the alpha pool \(alphas 11–30 of Agora remain useful; alphas 11–30 of B6 dilute the portfolio sharply, dropping it from\+3\.45\+3\.45to−0\.38\-0\.38\)\. This is consistent with the evolution narrative: the metric\-library evolution increases the depth of usable alphas without changing the very best 10\.
### 5\.4Cross\-Seed Robustness

The headline numbers in Table[1](https://arxiv.org/html/2606.29194#S5.T1)are single\-seed results for every method, consistent with the AlphaGen reporting convention\. Because B2 \(AlphaGen\-PPO\) is the strongest non\-Agora baseline and B6 \(frozen libraries\) carries the central ablation result, B2 was re\-run at two additional seeds \(0 and 1\) on top of seed=42, and B6 at one additional seed \(0\)\. The full 100\-round Agora run was performed at a single seed\. A single full Agora run consumes approximately 60 GPU\-hours plus 4,500 LLM calls, and the budget for this draft did not allow a second seed\. Agora’s full\-system seed variance is therefore uncharacterized and is treated as a limitation in Section[6](https://arxiv.org/html/2606.29194#S6)\.

Table 3:Multi\-seed summary for the strongest non\-Agora baseline \(B2 AlphaGen\-PPO\) and the ablation \(B6 frozen libraries\)\. Each row is mean±\\pmstandard deviation across the seeds we ran \(results reported earlier in the paper use seed=42\)\.PPO’s portfolio Sharpe is highly seed\-unstable: a swing of more than44Sharpe units is observed between seeds, with mean below zero\. The frozen\-libraries ablation has lower variance but still falls far short of Agora’s\+1\.87\+1\.87portfolio Sharpe at both seeds\.B2’s portfolio Sharpe swings from\+1\.334\+1\.334\(seed=42\) to−3\.06\-3\.06\(seed=0\) on the same holdout segment under the same training configuration: a 4\.4 Sharpe\-unit gap attributable to the random seed of the PPO learner alone\. The mean across the three seeds is−0\.755\-0\.755with standard deviation2\.202\.20\. Under any cross\-seed reporting convention, B2’s mean Sharpe falls roughly 2\.6 units below Agora’s single\-seed value of\+1\.87\+1\.87\. B6 exhibits smaller variance \(σ=0\.563\\sigma=0\.563across two seeds, mean\+0\.019\+0\.019under the train\-IC selection rule\); its mean falls 1\.85 Sharpe units short of Agora’s headline\.

The headline\+1\.87\+1\.87is one draw from a stochastic process whose variance is not measured\. A second full\-system Agora run is the most important deferred experiment \(Appendix[A\.9](https://arxiv.org/html/2606.29194#A1.SS9)\)\. The three B2 seeds already reported establish that the seed=42 headline is not representative of B2’s mean behavior\.

### 5\.5Significance Testing

The primary significance test is a one\-sided Newey\-West HACtt\-test\[[26](https://arxiv.org/html/2606.29194#bib.bib26)\]on the daily differencedt=rtAgora−rtXd\_\{t\}=r^\{\\text\{Agora\}\}\_\{t\}\-r^\{X\}\_\{t\}between Agora’s composite portfolio return and each baseline’s, at lag 5, testingH0:𝔼​\[dt\]≤0H\_\{0\}:\\mathbb\{E\}\[d\_\{t\}\]\\leq 0\. Table[4](https://arxiv.org/html/2606.29194#S5.T4)reports the statistic,pp\-value, and verdict for each comparison\.

Table 4:Newey\-West HACtt\-tests of Agora versus each baseline on holdout daily portfolio return differences \(lag 5, 91 daily observations, one\-sidedH0:𝔼​\[dt\]≤0H\_\{0\}:\\mathbb\{E\}\[d\_\{t\}\]\\leq 0\)\.Two of the seven comparisons reach conventional significance: Agora versus B1 atα=0\.01\\alpha=0\.01\(t=\+2\.58t=\+2\.58,p=0\.005p=0\.005\), and Agora versus B7 atα=0\.05\\alpha=0\.05\(t=\+1\.70t=\+1\.70,p=0\.045p=0\.045\)\. The B4 comparison is on the borderline \(t=\+1\.64t=\+1\.64,p=0\.051p=0\.051\)\. The remaining four comparisons \(B2, B3, B5, B6\) fail to rejectH0H\_\{0\}on the 91\-day window\. The Agora\-vs\-B2 comparison in particular yieldst=\+0\.19t=\+0\.19,p=0\.42p=0\.42, despite a 1\.5\-Sharpe\-unit point\-estimate gap, because the within\-window variance of the daily portfolio return difference is too large for a 91\-day sample to distinguish\.

Table 5:Annualized portfolio Sharpe with Newey\-West HAC 95% CIs \(kernel lag 5\)\. Computed from the real long/short cumulative NAV series on the 91\-day holdout window\. With only 91 trading days, individual Sharpe CIs are wide for every method — the within\-paper relative ordering is more reliable than the absolute levels\. The pairwise Newey\-Westtt\-test of Agora vs\. each baseline on the daily return*difference*\(also at lag 5\) is the primary significance statistic and is reported alongside\.The individual NW 95% confidence intervals reported in Table[5](https://arxiv.org/html/2606.29194#S5.T5)on the underlying portfolio Sharpe values are wide for every method on this 91\-day window: Agora at\[−2\.12,\+6\.10\]\[\-2\.12,\+6\.10\], B2 at\[−1\.97,\+4\.37\]\[\-1\.97,\+4\.37\], B6 at\[−4\.27,\+3\.65\]\[\-4\.27,\+3\.65\]\. Most of these intervals include zero\. The relative ordering across methods \(Agora point estimate above every baseline; B2 and B3 the only two with positive point estimates; five baselines below zero\) is more reliable on this short window than the absolute levels\. A larger holdout window is the cleanest path to sharpening the NW HAC conclusions and is left for future work\.

The cross\-seed comparison \(Section[5\.4](https://arxiv.org/html/2606.29194#S5.SS4)\) is the stronger separator between Agora and B2\. Agora’s single\-seed Sharpe is\+1\.87\+1\.87; B2’s three\-seed mean is−0\.755\-0\.755\(σ=2\.20\\sigma=2\.20\); the\+2\.6\+2\.6\-unit gap on the cross\-seed mean is more than ten times the within\-seed Sharpe noise observed in Agora’s nearest known comparator\. The Newey\-West test on a single 91\-day window does not capture this seed dimension\.

Table 6:Paired bootstrap significance tests of Agora versus each baseline\. Statistic: difference in median per\-alpha holdout Sharpe ratio\.10,00010\{,\}000resamples; 95% CIs are quantile intervals on the bootstrap distribution of the difference\. The Mann–WhitneyUUtest is one\-sided \(Agora\>\>baseline\)\. All differences are positive and significant at the 5% level\.The paired bootstrap and Mann\-WhitneyUUtest on per\-alpha Sharpe values in Table[6](https://arxiv.org/html/2606.29194#S5.T6)are reported as descriptive statistics on the shape of the alpha\-pool distributions\. They treat the 30 per\-alpha Sharpe values as i\.i\.d\., which they are not \(same 91\-day window, same universe, correlated factor exposures\), and the resultingpp\-values are anti\-conservative\. Figure[4](https://arxiv.org/html/2606.29194#S5.F4)visualizes the underlying distributions\.

![Refer to caption](https://arxiv.org/html/2606.29194v1/x4.png)Figure 4:Per\-alpha holdout Sharpe distribution for the top\-30 alphas of each method, selected by train\-segment IC\. Agora median is\+1\.06\+1\.06\. AlphaGen\-PPO \(seed=42\) is the only baseline with median above zero\. These distributions are descriptive; the primary significance test uses portfolio\-level returns \(Section[5\.5](https://arxiv.org/html/2606.29194#S5.SS5)\)\.
### 5\.6Per\-Baseline Failure Modes

Each baseline fails in a different way, diagnostic of a specific architectural choice\.

#### B1 \(Genetic Programming\)\.

The gplearn search converged to a single algebraic class: 30 mathematically equivalent variants of\(𝑣𝑜𝑙𝑢𝑚𝑒\+ℎ𝑖𝑔ℎ\)/c\(\\mathit\{volume\}\+\\mathit\{high\}\)/cfor various negative constantscc\. Training IC is positive \(\+0\.0384\+0\.0384,t=14\.9t=14\.9\); even holdout IC remains\+0\.037\+0\.037\. Holdout Sharpe is−1\.74\-1\.74and decile monotonicity is−0\.73\-0\.73: the predictor sign reverses out\-of\-sample\. A single algebraic factor with no diversity is dominated by the regime in which it lives\.

#### B2 \(AlphaGen pure PPO\)\.

PPO at seed=42 finds a respectable single\-seed portfolio \(Sharpe\+1\.334\+1\.334\), but the per\-alpha IC distribution straddles zero \(median−0\.021\-0\.021, with eleven of thirty alphas havingIC\>0\\mathrm\{IC\}\>0\)\. The portfolio compensates via decile bucketing, and the alphas are not all directionally consistent\. Across three seeds the portfolio Sharpe ranges from−3\.06\-3\.06to\+1\.334\+1\.334, and the mean falls to−0\.755\-0\.755\(σ=2\.20\\sigma=2\.20\)\. The seed=42 headline is therefore a favourable draw rather than a stable result\.

#### B3 \(Single LLM, one\-shot\)\.

Portfolio Sharpe of\+0\.657\+0\.657is misleadingly close to AlphaGen\-PPO, but portfolio IC is−0\.052\-0\.052and the per\-alpha Sharpe median is\+0\.086\+0\.086, near zero\. The composite gets lucky on this regime\. Without a feedback mechanism there is no reason to expect generalization\.

#### B4 \(Single LLM, iterative\)\.

Across ten rounds the LLM’s training IC climbs steadily from\+0\.038\+0\.038to\+0\.053\+0\.053\. Holdout per\-alpha Sharpe median is−0\.676\-0\.676and holdout monotonicity is−0\.70\-0\.70, indicating reversed bucketing\. The single\-agent iterative configuration learns to game the training metric\.

#### B5 \(Alpha101\)\.

The handcrafted set was authored against a different market structure roughly a decade ago\. On the 2026 holdout window, holdout Sharpe is−0\.82\-0\.82and excess return is−22%\-22\\%\. Nothing in the set adapts to the low\-dispersion 2026 regime\.

#### B6 \(Frozen\-library ablation\)\.

B6 shares Agora’s architecture \(five agents, LLM closed loop, Wiki, evaluator panel\)\. With mutation disabled and only the 64 builtin skills available, holdout Sharpe falls to−0\.379\-0\.379, a degradation of\+2\.25\+2\.25units relative to Agora\. B6 also disables the PPO relay, conflating two mechanisms; the decomposition ablations of Section[5\.8](https://arxiv.org/html/2606.29194#S5.SS8)disentangle them\.

#### B7 \(Random search\)\.

After 3,000 random expression trees, the top\-30 by training IC achieves holdout Sharpe−0\.683\-0\.683\. Uniform sampling over the operator vocabulary trails every directed method\.

### 5\.7Locus of Evolution

Agora carries eight skill libraries\. Across R1 to R100, evolution concentrated in exactly one of them\. Table[7](https://arxiv.org/html/2606.29194#S5.T7)reports the state of each library at R100\.

Table 7:Skill\-library state after R100\. The metric library is the only library to receive substantial evolution; the operator, network, reward, RL\-algorithm, topic, rubric, and meta\-rubric libraries remained at their builtin sets\.Theevaluation\_mineragent proposed 80 candidate metrics across R1 to R100\. Two of them were promoted from trial to accepted\.

monotonicity\_score\_v1\(R21,avg\_pred\_corr=\+0\.47\\texttt\{avg\\\_pred\\\_corr\}=\+0\.47at promotion,\+0\.158\+0\.158at R100,nobs=32n\_\{\\mathrm\{obs\}\}=32\) is the Spearman correlation between decile rank and decile annualized return\. It encodes the constraint that IC alone is inadequate and that monotonic decile bucketing is required\. Promotion was agent\-initiated: the evaluation\-miner issued an explicitpromoteaction at R21, which the factor\-metrics\-evaluator panel approved on qualitative grounds\. The cumulativeavg\_pred\_corrat R100 is\+0\.158\+0\.158, below the\|0\.3\|\|0\.3\|auto\-management threshold; the metric became less correlated with downstream Sharpe in later rounds as the alpha pool composition shifted\. The empirical promotion mechanism would not have promoted this metric on the R100 evidence; the implication for the promotion rule is recorded as a limitation in Section[6](https://arxiv.org/html/2606.29194#S6)\.

excess\_drawdown\_penalty\_v1\(R50,avg\_pred\_corr=\+0\.557\\texttt\{avg\\\_pred\\\_corr\}=\+0\.557at promotion,nobs=26n\_\{\\mathrm\{obs\}\}=26\) applies a non\-linear penalty when relative\-to\-benchmark drawdown is worse than−20%\-20\\%, plus a turnover\-cost discount\. It encodes the constraint that high IC paired with high turnover is unprofitable after costs\. Promotion was empirical: the metric crossed the\|avg\_pred\_corr\|\>0\.3\|\\texttt\{avg\\\_pred\\\_corr\}\|\>0\.3threshold withnobs=26≥3n\_\{\\mathrm\{obs\}\}=26\\geq 3and was auto\-promoted\. The post\-hoc holdout correlationr=\+0\.41r=\+0\.41\(n=26n=26\) is consistent with the train\-Sharpe correlation observed at promotion\.

The remaining 78 trial metrics did not pass the auto\-management threshold; they remain in the library and are sampled with30%30\\%probability under theε\\varepsilon\-greedy policy\. Both promoted metrics are standard quantitative\-finance constructs rather than novel inventions\. The contribution is the rediscovery process from a deliberately sparse builtin set of four \(IC, IR, stability, turnover\)\.

Figure[5](https://arxiv.org/html/2606.29194#S5.F5)shows the cumulative count of alphas passing the backtest threshold across R1 to R100, with the two promotion events marked\.

![Refer to caption](https://arxiv.org/html/2606.29194v1/x5.png)Figure 5:Evolution timeline for Agora across R1 to R100\. Blue line: cumulative alphas passing the backtest threshold\. Red bars: per\-round pass rate\. Green dashed lines mark the two promotion events \(monotonicity\_score\_v1at R21,excess\_drawdown\_penalty\_v1at R50\)\. The cumulative pass curve rises in rounds following each promotion\.Table 8:Cost sensitivity of the Agora top\-30 composite on the holdout segment\. Default cost \(1×\\times\) is 0\.04% commission \(double side\) plus 0\.05% stamp tax = 9 bps one\-way\. The layered backtest is re\-run with a parametrized cost rate; all other configuration \(10 deciles, 5\-day rebalance, train\-IC selection of top\-30\) is held constant\. Long\-short \(LS\) Sharpe remains positive across all cost multipliers tested; the long\-only top\-decile remains marginal at 1×\\timesand deteriorates rapidly with cost\.Cost sensitivity for the Agora composite is reported in Table[8](https://arxiv.org/html/2606.29194#S5.T8)for cost multipliers1×1\\times,2×2\\times,3×3\\times,5×5\\timeson the default 9 bps one\-way rate\. The long\-short Sharpe stays between\+1\.87\+1\.87and\+1\.92\+1\.92across all multipliers because long and short turnover are similar and the cost drag affects both legs\. The long\-only top\-decile Sharpe drops from\+0\.345\+0\.345at1×1\\timesto−0\.064\-0\.064at5×5\\times\. The signal is tractable as a long\-short strategy under realistic cost assumptions and fragile long\-only\.

### 5\.8Ablation Decomposition

The B6 ablation in Section[5\.1](https://arxiv.org/html/2606.29194#S5.SS1)disables both skill\-library evolution and the PPO relay\. The\+2\.25\+2\.25Sharpe gap between Agora \(\+1\.87\+1\.87\) and B6 \(−0\.379\-0\.379\) is therefore an upper bound on the combined contribution of these two mechanisms together with any third mechanism \(round count prominently\)\. Two decomposition ablations isolate the components\.

#### B6 \+ augmented builtins\.

The two LLM\-discovered metrics \(monotonicity\_score\_v1andexcess\_drawdown\_penalty\_v1\) are pre\-loaded into the builtin set of the otherwise\-frozen\-library pipeline, which then runs for 5 outer rounds\. This measures the static value of pre\-knowing the right metrics\. Result: holdout Sharpe\+0\.02\+0\.02, versus B6’s−0\.379\-0\.379, a\+0\.40\+0\.40Sharpe\-unit lift\.

#### B6 \+ PPO relay\.

Skill libraries remain frozen but the incremental PPO relay \(15,000 steps per round\) is re\-enabled, for 5 rounds\. This isolates the contribution of the PPO relay alone in the absence of evolution\. Result: holdout Sharpe−1\.18\-1\.18, which is−0\.80\-0\.80Sharpe units worse than B6\. The PPO relay alone appears actively harmful in this configuration\. The result rests on a single 5\-round run; B6’s across\-seed standard deviation under the train\-IC rule isσ=0\.563\\sigma=0\.563, so a single B6\+relay observation could plausibly fall within a couple ofσ\\sigmaof B6’s mean\. The qualitative direction \(the relay alone is not helpful\) is the more reliable claim; the magnitude−0\.80\-0\.80awaits seed replication\.

Table 9:Ablation decomposition\. B6 variants run 5 outer rounds \(B6\+aug additionally at 20 rounds\), Agora 100 outer rounds\. The round\-count confound is partially controlled by the 20\-round B6\+aug row; the inner\-loop budget remains uncontrolled\.ConfigurationHoldout SharpeNotesB6 frozen 5rds−0\.379\-0\.379Baseline ablationB6 \+ augmented builtins\+0\.02\+0\.02Static\-metric value:\+0\.40\+0\.40B6 \+ augmented builtins,20rds−1\.96\-1\.96More rounds, frozen libs, drift downB6 \+ PPO relay, 5rds−1\.18\-1\.18Relay alone:−0\.80\-0\.80\(preliminary,n=1n\{=\}1\)Agora\+1\.872\+1\.872Complete systemΔ\\DeltaTotal\+2\.25\+2\.25Agora−\-B65rdsΔ\\DeltaStatic\-metric\+0\.40\+0\.40\(B6\+aug5rds\)−\-B65rdsΔ\\DeltaAgora−\-B6\+aug5rds\+1\.85\+1\.85Evolution \+ relay \+ round count combinedΔ\\DeltaAgora−\-B6\+aug20rds\+3\.83\+3\.83Same combination, controlling for round countFour findings follow from Table[9](https://arxiv.org/html/2606.29194#S5.T9)\. \(i\) The static value of knowing the two metrics is\+0\.40\+0\.40Sharpe, less than 20% of the\+2\.25\+2\.25total gap\. Pre\-coding the two metrics into the builtin set on day one closes only a fifth of the gap\. \(ii\) The PPO relay alone is harmful rather than helpful in our single observation \(−0\.80\-0\.80Sharpe vs\. B6, magnitude preliminary\)\. RL training without concurrent evolution of the metrics that score the trained alphas does not reproduce Agora’s gain\. \(iii\) Running B6\+aug for longer makes it worse:−1\.96\-1\.96at 20 rounds vs\.\+0\.02\+0\.02at 5 rounds, single seed each\. With libraries frozen, the LLM closed loop cannot update its evaluation criteria in response to new candidates, and the alpha pool drifts toward signals that score well under the fixed metric set without generalizing\. This rules out a simple round\-count explanation for the Agora\-vs\-B6\+aug gap\. \(iv\) The remaining gap \(Agora minus B6\+aug at 20 rounds,\+3\.83\+3\.83\) is not attributable to round count and is associated with metric\-library evolution beyond the two pre\-loaded metrics and its interaction with the PPO relay\.

Three simple replication paths fail\. Pre\-coding the right metrics on day one, adding only the PPO relay to a frozen\-library system, and running the frozen\-library system longer all fall short of Agora’s performance\. All B6 variants are single\-seed runs and the 20\-round B6\+aug result warrants seed replication; the direction of the attributions \(more rounds without library evolution does not help and may hurt; relay alone is not helpful\) is the more reliable claim than the magnitudes\.

### 5\.9Rolling\-Window Robustness

The headline result rests on a single 5\-month window \(2026\-01 to 2026\-05\)\. To partially address whether it reflects a regime\-specific factor loading rather than a generalizable signal, the same Agora top\-30 alphas \(selected once on train\-segment IC, never re\-tuned\) were evaluated across six additional 6\-month windows spanning 2020 to 2025, plus the 2026 holdout\.

Table 10:Rolling\-window evaluation of the same Agora top\-30 alphas \(selected by train\-segment IC on 2014–2019; never re\-trained for these windows\) across multiple regimes\.Caveat:2020–2025 is the test segment that the evaluation\-miner had categorical \(bucketed\) feedback access to during the closed loop; the rolling\-window results below are therefore not fully out\-of\-sample in the strict sense\. The 2026H1 row is the only truly never\-seen segment\. The rolling results characterize how the same 30 alphas perform across regimes that differ from the headline holdout \(COVID, rally, decline, multi\-year structural shifts\)\. LS = long\-short composite; Top = top\-decile only;nn= trading days\.The 2020–2025 windows fall inside the test segment that the evaluation\-miner observed through 4\-bucket categorical feedback during the closed loop\. They are not strictly out\-of\-sample in the sense the 2026 holdout is\. The rolling table therefore characterizes robustness across regimes \(COVID onset, rally, decline, multi\-year structural shifts\) more than pure out\-of\-sample generalization\.

Long\-short Sharpe is positive in 7 of 7 windows \(range\+0\.013\+0\.013to\+4\.585\+4\.585, mean\+1\.97\+1\.97, median\+1\.87\+1\.87\)\. Portfolio IC is positive in 7 of 7 windows \(range\+0\.030\+0\.030to\+0\.106\+0\.106\)\. Decile monotonicity is positive in 5 of 7 windows; it goes mildly negative in 2021 H2 \(the rally regime, in which the long\-short Sharpe also collapses to\+0\.013\+0\.013\) and slightly negative in 2023 H1\. The strongest individual window is 2020 H1 \(the COVID\-onset window\) at Sharpe\+4\.585\+4\.585, consistent with the signal benefiting from high cross\-sectional dispersion\. The 2026 holdout Sharpe of\+1\.872\+1\.872sits at the median of the rolling distribution\.

Two limitations qualify the rolling analysis\. First, the 6\-month windows do not test seed variance; they are deterministic re\-applications of the same 30 alphas\. Second, the 2020–2025 windows shared categorical feedback with the LLM closed loop and are weaker out\-of\-sample evidence than 2026\. Only 2026 H1 is a never\-seen segment in the strict sense\.

### 5\.10Summary

On the 5\-month never\-seen holdout, Agora attains portfolio Sharpe\+1\.87\+1\.87\. The closest single\-seed competitor is AlphaGen\-PPO at\+1\.334\+1\.334\(seed=42\); the cross\-seed B2 mean is−0\.755\-0\.755\(σ=2\.20\\sigma=2\.20\)\. Agora leads on portfolio IC \(\+0\.0894\+0\.0894\) and decile monotonicity \(\+0\.285\+0\.285\)\. Excess long\-only annualized return is−3\.29%\-3\.29\\%, reflecting short\-side concentration: the long\-short annualized return is\+48\.4%\+48\.4\\%\(G1 at−27\.39%\-27\.39\\%, G10 at\+6\.46%\+6\.46\\%\)\. Cost sensitivity \(Table[8](https://arxiv.org/html/2606.29194#S5.T8)\) holds the long\-short Sharpe between\+1\.87\+1\.87and\+1\.92\+1\.92across1×1\\timesto5×5\\timesthe default 9 bps one\-way rate\.

Newey\-West HAC tests on daily portfolio return differences rejectH0H\_\{0\}against B1 atα=0\.01\\alpha=0\.01and against B7 atα=0\.05\\alpha=0\.05; the B4 comparison is borderline \(p=0\.051p=0\.051\); comparisons against B2, B3, B5, B6 do not reach conventional significance on the 91\-day window\. Individual NW 95% CIs are wide for every method \(Table[5](https://arxiv.org/html/2606.29194#S5.T5)\)\.

The frozen\-libraries ablation \(B6 at−0\.379\-0\.379\) gives a\+2\.25\+2\.25Sharpe\-unit gap to Agora\. Decomposition \(Section[5\.8](https://arxiv.org/html/2606.29194#S5.SS8)\) attributes\+0\.40\+0\.40to the static value of pre\-loading the two LLM\-discovered metrics and−0\.80\-0\.80\(preliminary, single 5\-round run\) to the PPO relay alone\. The remaining\+1\.85\+1\.85is attributable to metric\-library evolution beyond those two metrics, the relay’s interaction with evolution, and the 100\-vs\-5 round count\. Rolling\-window evaluation \(Table[10](https://arxiv.org/html/2606.29194#S5.T10)\) reports positive long\-short Sharpe in 7 of 7 windows and positive portfolio IC in 7 of 7 windows, with the 2026 holdout at the median of the distribution\.

## 6Discussion

The B6 collapse to Sharpe−0\.379\-0\.379isolates the joint contribution of F4 \(persistent versioned artifact stores\) and F5 \(substrate\-local promotion\) at an upper bound of\+2\.25\+2\.25Sharpe units\. Pre\-loading the two LLM\-discovered metrics into a frozen\-library run recovers only\+0\.40\+0\.40of that gap, less than 20%, so persistence without an outcome\-grounded promotion rule does not substitute for the ongoing process that produces metrics\. Adding the PPO relay back onto a frozen\-library substrate produces−1\.18\-1\.18Sharpe, worse than B6 alone: an asymmetric upgrade of the search procedure without a co\-evolving evaluator is harmful\. The remaining\+1\.85\+1\.85is associated with further metric evolution, the relay’s interaction with metric evolution, and the 100\-vs\-5 round\-count gap\.

The four LLM\-based configurations \(B3, B4, B6, Agora\) form a ladder that progressively restores SJS conditions: B3 violates F1 and F4 and produces a roulette portfolio, B4 restores no conditions and overfits the training metric, B6 satisfies F1–F3 but disables F4–F5, Agora satisfies F1–F5 and leads on every Table[1](https://arxiv.org/html/2606.29194#S5.T1)dimension\. F1 \(decomposition\) was not directly tested by ablation\. A single\-agent\-with\-evolution control, in which one LLM instance both proposes alphas and evolves metrics, is the most important missing experiment for the framework and is queued in Appendix[A\.9](https://arxiv.org/html/2606.29194#A1.SS9)\.

The locus of evolution was narrow\. All measurable evolution occurred in the metric library; the seven other libraries remained at builtins\. The two promoted metrics,monotonicity\_score\_v1andexcess\_drawdown\_penalty\_v1, were produced by the agent\-to\-agent closed loop without a fixed target: the builtin set was deliberately sparse, the evaluation\-miner proposed candidates freely, and the substrate\-local promotion rule decided acceptance on aggregate evidence across rounds\. No single agent produced them\. Both are standard in quantitative practice\.

#### Limitations\.

Five limitations bound the interpretation of these results\. \(i\) The full 100\-round Agora run was executed at one seed at a cost of roughly 60 GPU\-hours and 4,500 LLM calls\. The B6 ablation across two seeds givesσ=0\.563\\sigma=0\.563, so the headline Sharpe should be read as one realization of a stochastic process whose variance is not fully characterized\. \(ii\) The strongest baseline B2 swings 4\.4 Sharpe units across three seeds with a cross\-seed mean of−0\.755\-0\.755\. The headline table reports B2 at seed=42 for parity with the AlphaGen reporting convention; the cross\-seed comparison is more decisive than the single\-seed one\. \(iii\) The empirical promotion rule uses training\-segment Sharpe as outcome variable, which is also the alpha\-miner’s optimization target; this circularity is noted in Section[3\.3](https://arxiv.org/html/2606.29194#S3.SS3)\. One consequence: the second accepted metric was promoted via the agent\-initiated route after itsavg\_pred\_corrdecayed below the auto\-promotion threshold by R100\. \(iv\) The headline number rests on a single 5\-month holdout; the rolling\-window analysis covers six earlier windows that shared categorical feedback with the closed loop, so only 2026 H1 is strictly out\-of\-sample\. \(v\) The A2A decomposition is not directly ablated\. Methodological caveats, reproducibility, and a full list of queued follow\-up experiments are in Appendix[A\.9](https://arxiv.org/html/2606.29194#A1.SS9)\.

## 7Conclusion

Sealed Joint Search \(SJS\) is a framework for autonomous\-discovery problems in which the scoring function and the artifacts must co\-evolve while the external evaluator remains sealed\. SJS is specified at the information\-flow level by five structural conditions \(F1–F5\) and a boundary condition on the evaluator, and is independent of any particular optimizer, language model, or operator language\.

Agora instantiates SJS as an agent\-to\-agent LLM system on Chinese A\-share alpha mining: five role\-specialized agents, three typed channels, eight skill libraries built on AlphaGen, and a substrate\-local promotion rule against training Sharpe\. On a 91\-day 2026 holdout sealed from every LLM\-facing input, Agora reaches portfolio Sharpe\+1\.87\+1\.87against the strongest baseline at\+1\.334\+1\.334\(single seed\) and−0\.755\-0\.755\(cross\-seed mean\)\. Freezing the libraries collapses Sharpe to−0\.379\-0\.379and pre\-loading the two LLM\-discovered metrics into the frozen system recovers only\+0\.40\+0\.40of the gap; the static value of the metrics is therefore a small fraction of the total\. The two promoted metrics, monotonicity\_score\_v1 and excess\_drawdown\_penalty\_v1, are standard quantitative constructs that the substrate rediscovered through F4 and F5 from a deliberately sparse builtin set: neither was designed in, and neither was produced by any single agent\.

Two questions remain open: whether F1 \(decomposition\) is necessary, since the single\-agent\-with\-evolution baseline is missing from the ablation grid, and whether F1–F5 transfer to other autonomous\-discovery domains\.

## References

- \[1\]Anthropic\.The Claude 3 model family: Opus, Sonnet, Haiku\.Anthropic technical report\(2024\)\.
- \[2\]Bailey, D\. H\. and Borwein, J\. M\. and López de Prado, M\. and Zhu, Q\. J\.\.Pseudo\-mathematics and financial charlatanism: The effects of backtest overfitting on out\-of\-sample performance\.Notices of the American Mathematical Society61\(5\), 458–471 \(2014\)\.
- \[3\]Brown, T\. B\. and Mann, B\. and Ryder, N\. and Subbiah, M\. and Kaplan, J\. and Dhariwal, P\. and others\.Language models are few\-shot learners\.Advances in Neural Information Processing Systems \(NeurIPS\)\(2020\)\.
- \[4\]Carhart, M\. M\.\.On persistence in mutual fund performance\.The Journal of Finance52\(1\), 57–82 \(1997\)\.
- \[5\]Chen, M\. and Tworek, J\. and Jun, H\. and Yuan, Q\. and Ponde de Oliveira Pinto, H\. and Kaplan, J\. and others\.Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\(2021\)\.
- \[6\]Cui, C\. and Wang, W\. and Zhang, M\. and Chen, G\. and Luo, Z\. and Ooi, B\. C\.\.AlphaEvolve: A learning framework to discover novel alphas in quantitative investment\.Proceedings of the 2021 International Conference on Management of Data \(SIGMOD\), 2208–2216 \(2021\)\.
- \[7\]Diebold, F\. X\. and Mariano, R\. S\.\.Comparing predictive accuracy\.Journal of Business & Economic Statistics13\(3\), 253–263 \(1995\)\.
- \[8\]Fama, E\. F\. and French, K\. R\.\.Common risk factors in the returns on stocks and bonds\.Journal of Financial Economics33\(1\), 3–56 \(1993\)\.
- \[9\]Fama, E\. F\. and French, K\. R\.\.A five\-factor asset pricing model\.Journal of Financial Economics116\(1\), 1–22 \(2015\)\.
- \[10\]Harvey, C\. R\. and Liu, Y\. and Zhu, H\.\.…\\ldotsand the cross\-section of expected returns\.The Review of Financial Studies29\(1\), 5–68 \(2016\)\.
- \[11\]Hong, S\. and Zhuge, M\. and Chen, J\. and Zheng, X\. and Cheng, Y\. and Zhang, C\. and Wang, J\. and Wang, Z\. and Yau, S\. K\. S\. and Lin, Z\. and Zhou, L\. and Ran, C\. and Xiao, L\. and Wu, C\. and Schmidhuber, J\.\.MetaGPT: Meta programming for a multi\-agent collaborative framework\.InInternational Conference on Learning Representations \(ICLR\)\(2024\)\.
- \[12\]Hu, S\. and Lu, C\. and Clune, J\.\.Automated design of agentic systems\.arXiv preprint arXiv:2408\.08435\(2024\)\.
- \[13\]Jegadeesh, N\. and Titman, S\.\.Returns to buying winners and selling losers: Implications for stock market efficiency\.The Journal of Finance48\(1\), 65–91 \(1993\)\.
- \[14\]Kakushadze, Z\.\.101 Formulaic Alphas\.Wilmott2016\(84\), 72–81 \(2016\)\.
- \[15\]Kaufman, S\. and Rosset, S\. and Perlich, C\. and Stitelman, O\.\.Leakage in data mining: Formulation, detection, and avoidance\.ACM Transactions on Knowledge Discovery from Data6\(4\), 1–21 \(2012\)\.
- \[16\]Koza, J\. R\.\.Genetic Programming: On the Programming of Computers by Means of Natural Selection\.MIT Press \(1992\)\.
- \[17\]Lehman, J\. and Stanley, K\. O\.\.Exploiting open\-endedness to solve problems through the search for novelty\.Proceedings of the 11th International Conference on Artificial Life \(ALIFE\)\(2008\)\.
- \[18\]Li, G\. and Hammoud, H\. A\. A\. K\. and Itani, H\. and Khizbullin, D\. and Ghanem, B\.\.CAMEL: Communicative agents for “mind” exploration of large language model society\.Advances in Neural Information Processing Systems \(NeurIPS\)\(2023\)\.
- \[19\]Liang, J\. and Huang, W\. and Xia, F\. and Xu, P\. and Hausman, K\. and Ichter, B\. and Florence, P\. and Zeng, A\.\.Code as policies: Language model programs for embodied control\.InIEEE International Conference on Robotics and Automation \(ICRA\)\(2023\)\.
- \[20\]Liu, X\. and Yu, H\. and Zhang, H\. and Xu, Y\. and Lei, X\. and Lai, H\. and Gu, Y\. and Ding, H\. and Men, K\. and Yang, K\. and Zhang, S\. and Deng, X\. and Zeng, A\. and Du, Z\. and Zhang, C\. and Shen, S\. and Zhang, T\. and Su, Y\. and Sun, H\. and Huang, M\. and Dong, Y\. and Tang, J\.\.AgentBench: Evaluating LLMs as agents\.InInternational Conference on Learning Representations \(ICLR\)\(2024\)\.
- \[21\]Lu, C\. and Lu, C\. and Lange, R\. T\. and Foerster, J\. and Clune, J\. and Ha, D\.\.The AI scientist: Towards fully automated open\-ended scientific discovery\.arXiv preprint arXiv:2408\.06292\(2024\)\.
- \[22\]López de Prado, M\.\.Advances in Financial Machine Learning\.Wiley \(2018\)\.
- \[23\]Ma, Y\. J\. and Liang, W\. and Wang, G\. and Huang, D\. and Bastani, O\. and Jayaraman, D\. and Zhu, Y\. and Fan, L\. and Anandkumar, A\.\.Eureka: Human\-level reward design via coding large language models\.arXiv preprint arXiv:2310\.12931\(2023\)\.
- \[24\]Madaan, A\. and Tandon, N\. and Gupta, P\. and Hallinan, S\. and Gao, L\. and Wiegreffe, S\. and Alon, U\. and Dziri, N\. and Prabhumoye, S\. and Yang, Y\. and Gupta, S\. and Majumder, B\. P\. and Hermann, K\. and Welleck, S\. and Yazdanbakhsh, A\. and Clark, P\.\.Self\-Refine: Iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\(2023\)\.
- \[25\]Magar, I\. and Schwartz, R\.\.Data contamination: From memorization to exploitation\.Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(ACL\)\(2022\)\.
- \[26\]Newey, W\. K\. and West, K\. D\.\.A simple, positive semi\-definite, heteroskedasticity and autocorrelation consistent covariance matrix\.Econometrica55\(3\), 703–708 \(1987\)\.
- \[27\]OpenAI\.GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\(2023\)\.
- \[28\]Park, J\. S\. and O’Brien, J\. C\. and Cai, C\. J\. and Morris, M\. R\. and Liang, P\. and Bernstein, M\. S\.\.Generative agents: Interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\)\(2023\)\.
- \[29\]Petersen, B\. K\. and Landajuela, M\. and Mundhenk, T\. N\. and Santiago, C\. P\. and Kim, S\. K\. and Kim, J\. T\.\.Deep symbolic regression: Recovering mathematical expressions from data via risk\-seeking policy gradients\.InInternational Conference on Learning Representations \(ICLR\)\(2021\)\.
- \[30\]Qian, C\. and Liu, W\. and Liu, H\. and Chen, N\. and Dang, Y\. and Li, J\. and Yang, C\. and Chen, W\. and Su, Y\. and Cong, X\. and Xu, J\. and Li, D\. and Liu, Z\. and Sun, M\.\.ChatDev: Communicative agents for software development\.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)\(2024\)\.
- \[31\]Raffin, A\. and Hill, A\. and Gleave, A\. and Kanervisto, A\. and Ernestus, M\. and Dormann, N\.\.Stable\-Baselines3: Reliable reinforcement learning implementations\.Journal of Machine Learning Research22\(268\), 1–8 \(2021\)\.
- \[32\]Romera\-Paredes, B\. and Barekatain, M\. and Novikov, A\. and Balog, M\. and Kumar, M\. P\. and Dupont, E\. and Ruiz, F\. J\. R\. and Ellenberg, J\. S\. and Wang, P\. and Fawzi, O\. and Kohli, P\. and Fawzi, A\.\.Mathematical discoveries from program search with large language models\.Nature625, 468–475 \(2024\)\.
- \[33\]Schick, T\. and Dwivedi\-Yu, J\. and Dessì, R\. and Raileanu, R\. and Lomeli, M\. and Zettlemoyer, L\. and Cancedda, N\. and Scialom, T\.\.Toolformer: Language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\(2023\)\.
- \[34\]Schmidt, M\. and Lipson, H\.\.Distilling free\-form natural laws from experimental data\.Science324\(5923\), 81–85 \(2009\)\.
- \[35\]Schulman, J\. and Wolski, F\. and Dhariwal, P\. and Radford, A\. and Klimov, O\.\.Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\(2017\)\.
- \[36\]Shinn, N\. and Cassano, F\. and Berman, E\. and Gopinath, A\. and Narasimhan, K\. and Yao, S\.\.Reflexion: Language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\(2023\)\.
- \[37\]Stanley, K\. O\. and Lehman, J\. and Soros, L\.\.Open\-endedness: The last grand challenge you’ve never heard of\.O’Reilly Online\(2017\)\.
- \[38\]Vaswani, A\. and Shazeer, N\. and Parmar, N\. and Uszkoreit, J\. and Jones, L\. and Gomez, A\. N\. and Kaiser, L\. and Polosukhin, I\.\.Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\), pages 5998–6008 \(2017\)\.
- \[39\]Wang, R\. and Lehman, J\. and Clune, J\. and Stanley, K\. O\.\.Paired open\-ended trailblazer \(POET\): Endlessly generating increasingly complex and diverse learning environments and their solutions\.arXiv preprint arXiv:1901\.01753\(2019\)\.
- \[40\]Wang, G\. and Xie, Y\. and Jiang, Y\. and Mandlekar, A\. and Xiao, C\. and Zhu, Y\. and Fan, L\. and Anandkumar, A\.\.Voyager: An open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\(2023\)\.
- \[41\]Wang, X\. and Chen, Y\. and Yuan, L\. and Zhang, Y\. and Li, Y\. and Peng, H\. and Ji, H\.\.Executable code actions elicit better LLM agents\.InInternational Conference on Machine Learning \(ICML\)\(2024\)\.
- \[42\]Wei, J\. and Wang, X\. and Schuurmans, D\. and Bosma, M\. and Ichter, B\. and Xia, F\. and Chi, E\. and Le, Q\. V\. and Zhou, D\.\.Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\(2022\)\.
- \[43\]White, H\.\.A reality check for data snooping\.Econometrica68\(5\), 1097–1126 \(2000\)\.
- \[44\]Wu, S\. and Irsoy, O\. and Lu, S\. and Dabravolski, V\. and Dredze, M\. and Gehrmann, S\. and Kambadur, P\. and Rosenberg, D\. and Mann, G\.\.BloombergGPT: A large language model for finance\.arXiv preprint arXiv:2303\.17564\(2023\)\.
- \[45\]Wu, Q\. and Bansal, G\. and Zhang, J\. and Wu, Y\. and Li, B\. and Zhu, E\. and Jiang, L\. and Zhang, X\. and Zhang, S\. and Liu, J\. and Awadallah, A\. H\. and White, R\. W\. and Burger, D\. and Wang, C\.\.AutoGen: Enabling next\-gen LLM applications via multi\-agent conversation\.InCOLM\(2024\)\.
- \[46\]Xiao, Y\. and Sun, E\. and Luo, D\. and Wang, W\.\.TradingAgents: Multi\-agents LLM financial trading framework\.arXiv preprint arXiv:2412\.20138\(2024\)\.
- \[47\]Yang, H\. and Liu, X\. and Wang, C\. D\.\.FinGPT: Open\-source financial large language models\.arXiv preprint arXiv:2306\.06031\(2023\)\.
- \[48\]Yao, S\. and Zhao, J\. and Yu, D\. and Du, N\. and Shafran, I\. and Narasimhan, K\. and Cao, Y\.\.ReAct: Synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\)\(2023\)\.
- \[49\]Yao, S\. and Yu, D\. and Zhao, J\. and Shafran, I\. and Griffiths, T\. L\. and Cao, Y\. and Narasimhan, K\.\.Tree of thoughts: Deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\(2023\)\.
- \[50\]Yu, Y\. and Li, H\. and Chen, Z\. and Jiang, Y\. and Li, Y\. and Zhang, D\. and Liu, R\. and Suchow, J\. W\. and Khashanah, K\.\.FinMem: A performance\-enhanced LLM trading agent with layered memory and character design\.arXiv preprint arXiv:2311\.13743\(2023\)\.
- \[51\]Yu, S\. and Xue, H\. and Ao, X\. and Pan, F\. and He, J\. and Tu, D\. and He, Q\.\.Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\), pages 5476–5486 \(2023\)\.
- \[52\]Zelikman, E\. and Wu, Y\. and Mu, J\. and Goodman, N\. D\.\.STaR: Bootstrapping reasoning with reasoning\.InAdvances in Neural Information Processing Systems \(NeurIPS\)\(2022\)\.
- \[53\]Zhang, T\. and Li, Y\. and Jin, Y\. and Li, J\.\.AutoAlpha: an efficient hierarchical evolutionary algorithm for mining alpha factors in quantitative investment\.InProceedings of the AAAI Conference on Artificial Intelligence \(Workshop\)\(2020\)\.

## Appendix AImplementation Details

This appendix records the concrete versions, file paths, hyperparameters, and access tables that allow Agora to be re\-run\. The communication channels \(A, B, C\), the five\-role schema, and the four sealing enforcement points referenced below are defined in Section[3](https://arxiv.org/html/2606.29194#S3)\.

### A\.1Stack and Versions

- •Python 3\.11, PyTorch 2\.8 with CUDA 12\.9\.
- •AlphaGen\[[51](https://arxiv.org/html/2606.29194#bib.bib51)\]as a vendored submodule, used both for the operator/expression types and as B2\.
- •anthropicSDK 0\.110,claude\-sonnet\-4\-6for all nine LLM clients\.
- •stable\-baselines32\.9 \+sb3\-contrib2\.9 for the PPO relay and B2\.
- •gplearn0\.4\.2 for B1\.
- •Single\-machine: AMD Ryzen 9 \+ NVIDIA RTX 5090\.

### A\.2Data Adapter

OHLCV is loaded from a single RiceQuant\-exported\.ftrfile \(2,782 instruments×\\times3,493 trading days, post\-adjusted\)\. Universe membership is loaded from a daily\-binary parquet \(zz1000\_components\.parquet\)\. VWAP is reconstructed as\(H\+L\+C\)/3\(H\{\+\}L\{\+\}C\)/3because RiceQuant turnover is not adjusted\. Suspension days \(volume=0\\mathrm\{volume\}=0\) have all OHLC values overwritten with NaN\. Limit\-up days are masked out of portfolio entries \(a stock cannot realistically be bought at the print\)\.

The four\-segment split used throughout: train 2014\-10 to 2019\-12 \(1,277 days\), test 2020\-01 to 2025\-12 \(1,461 days\), holdout 2026\-01 to 2026\-05 \(91 days\), with an internal validation slice carved from the train tail for the PPO relay\.

### A\.3The PPO Relay

When enabled \(it is enabled in the Agora run\), each round runs an incremental 15,000\-step MaskablePPO training continuation seeded from the previous round’s policy weights\. The action space and observation space are identical to AlphaGen’s published configuration\. The relay is disabled in B6 and B7 because it would otherwise mutate the operator/network/RL\-algorithm libraries indirectly through training\.

A separateforce\_segment="train"flag is passed through to the alpha\-miner’s evaluation calls, so that no test\-segment numerical value can enter the PPO reward path\. The PPO test\-IC computation writes to a separate audit file \(top\_alphas\_audit\.json\) that is read only by the offline analysis path; the livetop\_alphas\.jsonconsumed by the LLM stack contains training IC only\.

### A\.4Residual Leakage Audit

The four\-segment split alone is necessary but not sufficient\. Five distinct paths along which test or holdout values could leak into the LLM\-facing closed loop were audited and sealed:

1. 1\.test\_ic/test\_rank\_icstripped from every LLM\-facing path\.AlphaPoolAdapter\(which seeds the alpha miner from the PPO pool\) andalpha\_miner\.proposeboth filter these fields out of the candidate metadata\.
2. 2\.Cross\-round metric feedback uses categorical buckets only\.The\_build\_metrics\_feedbackroutine reports “far above / above / near / far below” rather than numerical Sharpe, IC, annualized return, or drawdown\.
3. 3\.Test\-segment backtest feedback in read\-only categorical form\.The evaluation\-miner callssimple\_backtest\_layeredon the test segment \(2020–2025\) to check whether a proposed alpha holds up beyond the training window\. The numerical Sharpe value is never passed to the LLM context:\_build\_metrics\_feedbackconverts it to one of four categorical labels \(“far above”\[\>\+1\.5\]\[\>\+1\.5\], “above”\[\+0\.5,\+1\.5\]\[\+0\.5,\+1\.5\], “near”\[−0\.5,\+0\.5\]\[\-0\.5,\+0\.5\], “far below”\[<−0\.5\]\[<\-0\.5\]relative to a fixed Sharpe threshold of\+0\.5\+0\.5\) before inclusion in the agent’s user message\. Residual leakage risk\.No numerical value from this segment enters the LLM context as a learning signal, but this does not imply the test segment is clean\. Over 100 rounds×\\times∼45\{\\sim\}45LLM calls, the evaluation\-miner accumulates a history of which metric proposals correlate with “above” or “far above” labels on the 2020–2025 window\. This is a soft optimization signal\. An information\-theoretic upper bound: each round delivers at mostlog2⁡\(4\)=2\\log\_\{2\}\(4\)=2bits of test\-segment information per alpha evaluated; across 100 rounds with∼10\{\\sim\}10alpha evaluations per round, the accumulated signal is at most∼2,000\{\\sim\}2\{,\}000bits\. The effective signal is much weaker because \(a\) the four\-bucket discretization is coarse \(each bucket spans a Sharpe range of≥1\.0\\geq 1\.0\), \(b\) the metric\-promotion decision is governed byavg\_pred\_corragainst*train*\-segment Sharpe, not test\-segment feedback, so the test\-segment signal cannot directly pass the promotion threshold, and \(c\) the evaluation\-miner’s message history is reset at the start of each round, limiting within\-context accumulation\. The practical consequence is that the holdout segment \(2026\-01 to 2026\-05\) is the only segment fully clean for the composite system, including the metric\-selection sub\-process\. The test segment \(2020–2025\) is a validation set the metric\-evolution process has soft access to via categorical feedback\. This appendix does not claim the test segment is out\-of\-sample for metric selection; only that the holdout segment is\. A stronger guard, blocking test\-segment backtest calls entirely and relying only on train\-segment feedback for the evaluation\-miner, would remove this residual concern\.
4. 4\.Metricavg\_pred\_corruses train\-segment Sharpe\.The auto\-management correlation is computed as the Pearson correlation between the metric’s score and the*train*\-segment Sharpe of the alpha being scored, accumulated across all rounds in which the metric was used\. Test\-segment Sharpe is never included in this computation\. The promotion threshold is\|avg\_pred\_corr\|\>0\.3\|\\texttt\{avg\\\_pred\\\_corr\}\|\>0\.3with demotion at0\.050\.05and a minimum ofnobs=3n\_\{\\mathrm\{obs\}\}=3before either direction can fire\.
5. 5\.The PPO test\-IC computation writes to a separate audit file\.Thetop\_alphas\.jsonconsumed by the LLM stack contains only training IC;top\_alphas\_audit\.jsoncontains test IC and is read only by the offline analysis path\.

These five paths correspond to the four sealing enforcement points described in Section[3](https://arxiv.org/html/2606.29194#S3): the Wiki access\-role filter \(point 1\), Channel B input zeroing of numerical test fields \(points 2, 3\), theforce\_segment="train"alpha\-miner flag \(point 4\), and the four\-bucket categorical conversion for test feedback \(point 3\)\. The Wiki access\-role table \(Appendix[A\.5](https://arxiv.org/html/2606.29194#A1.SS5)\) governs which fields each role can read; the holdout segment is loaded by theExprEvaluatorsingleton only when the demo exits, never during the closed loop\.

### A\.5Wiki Access\-Role Matrix

Channel C is the LLM Wiki\. Read access is filtered per role at the adapter layer; the matrix below records which fields each role can read\.

∗The evaluation\-miner and factor\-metrics\-evaluator callsimple\_backtest\_layeredon the test segment \(2020–2025\) but receive only a four\-bucket categorical label \(“far above / above / near / far below” a Sharpe threshold of\+0\.5\+0\.5\), not the numerical Sharpe value\. The numerical test\-segment fields \(test\_ic,test\_sharpe, etc\.\) are never written to the wiki and never appear in any LLM\-facing message\. See Appendix[A\.4](https://arxiv.org/html/2606.29194#A1.SS4), point 3 for the residual in\-context learning caveat\.

RR = research\_report; A\.Miner = alpha\_miner; A\.Eval = alpha\_evaluator; E\.Miner = evaluation\_miner; FME = factor\_metrics\_evaluator\.

### A\.6Wiki Write\-Topology

The table below extends the access\-role matrix with write permissions, showing which agents write to which wiki sections and which agents subsequently read those sections\. This is the information\-flow graph of the LLM Wiki\.

The orchestrator is the sole writer for all sections exceptsources/; agents do not write directly to the wiki\. The categorical test\-segment label and access\-role filtering described above \(Appendix[A\.4](https://arxiv.org/html/2606.29194#A1.SS4), point 3\) are enforced at write time, not at read time, so a future agent that accidentally requested a numerical test field would not receive it\.

### A\.7Agent Specialization Evidence

The three\-instancealpha\_evaluatorpanel and three\-instancefactor\_metrics\_evaluatorpanel are not stylistic duplicates: each instance has its own LLM client, its own message history, and reads disjoint slices of the wiki via the access\-role table\. Evaluator panels do not vote; each instance writes a report, and the orchestrator aggregates the reports into the cross\-round briefs that flow on Channel B\.

Across R1–R100 of the Agora run, the three alpha\-evaluator instances producedspecific\_findingslists whose lengths differed by 3 or more in 1% of rounds\. That figure excludes cases where lists are similar in length but differ in content, and it does not capture 2–1 verdict splits at the per\-alpha level\. The research\_report and alpha\_miner agents draw from disjoint skill libraries \(topic\_libraryvs\. four miner libraries\), and theirsystem\_prompts are non\-overlapping\.

### A\.8Reproducibility

The release accompanying this paper will include: \(i\) the full A2A codebase, including the orchestrator, the eight skill libraries with their builtin sets, and the LLM Wiki schema; \(ii\) all seven baseline implementations, sharing a common evaluation harness with Agora; \(iii\) the data adapter and the four\-segment time split; \(iv\) all training logs, including theregistry\.jsonandstats\.jsonsnapshots that record the per\-round state of every library\. The trained LLM context itself is not released, as it is provider\-specific\.

### A\.9Open Issues

The following experiments are deferred to future work or a camera\-ready revision\.

1. 1\.Full 100\-round multi\-seed Agora\.The full system has been run at one seed only\. A second full seed \(approximately 60 GPU\-hours\) would bound the full\-system seed variance\.
2. 2\.100\-round B6 and B6\+augmented\-builtins runs\.The B6 ablation variants ran for 5 outer rounds; Agora ran for 100\. A 100\-round B6\+aug run \(roughly 30 hours\) would close the round\-count confound on the\+0\.40\+0\.40“static value of two metrics” attribution and the−0\.80\-0\.80“PPO relay alone” attribution\. Both attributions are expected to hold in direction; the magnitudes are not yet characterized at the matching round budget\.
3. 3\.Multi\-seed replication of the decomposition ablations\.The B6\+aug 20\-round and B6\+relay 5\-round results each rest on a single seed\. Replication at additional seeds \(around 1\.5–2 hours each\) would tighten the magnitude of the−0\.80\-0\.80“PPO relay alone is harmful” attribution, which is currently preliminary\.
4. 4\.Single\-agent\-with\-evolution baseline\.A direct ablation of F1 \(decomposition\) is the most important missing empirical test of SJS\. One LLM instance that both proposes alphas and evolves metrics, without the A2A decomposition, would isolate the contribution of F1 from F4–F5\. We have not run it\.
5. 5\.Operator, network, and reward library evolution\.Seven of the eight libraries did not evolve in the reported run\. Whether operator, network, or reward library evolution adds further lift under different prompts, exploration policies, or longer horizons is open\. The infrastructure supports it; the budget for this draft did not\.
6. 6\.Replication on non\-Chinese universes\.The data adapter is dataset\-agnostic; porting to S&P 500 or NASDAQ requires only the adapter\. We have not performed this experiment\.
7. 7\.Generalization of SJS beyond alpha mining\.The information\-flow contract is plausibly domain\-agnostic; the cost\-effective realization \(which LLM, which substrate, which promotion outcome variable\) is domain\-specific\. Testing F1–F5 on autonomous theorem proving, autonomous experimental design, or autonomous code synthesis is the direct check\.
8. 8\.Factor decomposition and capacity analysis\.Deployment\-relevant questions \(factor\-adjusted alpha relative to\[[8](https://arxiv.org/html/2606.29194#bib.bib8),[9](https://arxiv.org/html/2606.29194#bib.bib9),[4](https://arxiv.org/html/2606.29194#bib.bib4)\], ADV\-based capacity, short\-borrow availability, multiple\-testing correction following\[[10](https://arxiv.org/html/2606.29194#bib.bib10),[43](https://arxiv.org/html/2606.29194#bib.bib43)\]\) require new data infrastructure outside the present evaluation harness\.

## Appendix BBaseline Implementation Details

Each baseline shares a single evaluation harness: a\(T×N\)\(T\\times N\)factor panel plus a segment specification go in, and a dictionary of metrics is returned, computed by the samesimple\_backtest\_layeredengine that scores Agora’s alphas\. Hyperparameters below are sufficient for reproduction\.

### B\.1B1 \(GP / gplearn\)

The official AlphaGen GP reference implementation is adapted unchanged with respect to the algorithm, replacing only the data adapter \(CSI 1000 in place of CSI 300, 5d open\-to\-open target in place of the original 20d close\-to\-close\) and the evaluation route\.

- •Population size: 500\.
- •Generations: 20\.
- •Init depth:\(2,6\)\(2,6\)\.
- •Tournament size: 100\.
- •Crossover / sub\-tree mutation / hoist mutation / point mutation=0\.3/0\.1/0\.01/0\.1=0\.3/0\.1/0\.01/0\.1\.
- •Stopping criterion: 1\.0 \(i\.e\., never; all 20 generations are run\)\.
- •Token\-length cap: 20 \(longer trees are assigned fitness−1\.0\-1\.0\)\.
- •Fitness: per\-day rank IC averaged over the train segment, computed bycalc\.calc\_single\_IC\_ret\(the same call AlphaGen uses\)\.
- •Seed: 42\.
- •Top\-K extraction:Counter\(cache\)\.most\_common\(top\_k\)\.

#### Parser bypass\.

GP search builds expressions by string\-tree concatenation; some resulting expressions are not accepted by AlphaGen’sparse\_expressionroutine \(for example,EMA\(Constant\(\-2\.0\), 20\)is rejected because its operand is not a “featured” expression\)\. For evaluation, the parser is bypassed and the expression object is reconstructed directly viaeval\(key\), the same path the GP fitness function uses\. This affects only how the search\-time expression is recovered for evaluation; it does not change what GP searched over\.

### B\.2B2 \(AlphaGen pure PPO\)

Agora’s published training script \(scripts/train\_alphagen\_ppo\.py\) is invoked with the LLM stack disengaged\.

- •Total timesteps: 500,000\.
- •Pool capacity: 30 \(matches the top\-K used elsewhere\)\.
- •Policy: LSTM shared net, 2 layers,dmodel=128d\_\{\\mathrm\{model\}\}=128, dropout 0\.1\.
- •Algorithm: MaskablePPO\.
- •Reward: training\-segment IC againstRef​\(open,−6\)/Ref​\(open,−1\)−1\\mathrm\{Ref\}\(\\mathrm\{open\},\-6\)/\\mathrm\{Ref\}\(\\mathrm\{open\},\-1\)\-1\.
- •Device:cuda:0\.
- •Seed: 42\.
- •Evaluation: loadtop\_alphas\.json; route through the common evaluation harness, which uses the sameevaluate\_exprsparser path Agora uses\.

Training wall\-clock: roughly 5 hours on RTX 5090\.

### B\.3B3 \(Single LLM, one\-shot\)

- •Model:claude\-sonnet\-4\-6\.
- •Temperature: 0\.9 \(matches Agora’salpha\_miner\)\.
- •Max tokens: 8000\.
- •System prompt: a description of the alphagen operator set, the 5d open\-to\-open target, and the requirement that alphas not reference future values\. Identical operator vocabulary to Agora\.
- •User prompt: “designN=50N=50alpha factor expressions, output a JSON object with keyalphas\.”
- •No iteration, no feedback\.

Survivors are deduped by normalized expression string and all unique ones are fed through the common evaluation harness \(truncated to top\-30 if needed\)\.

### B\.4B4 \(Single LLM, iterative\)

Same model and prompt scaffold as B3, but iterated:

- •10 rounds\.
- •20 alphas requested per round\.
- •Feedback per round: top\-3 best and bottom\-3 worst alphas \(by training IC\) from prior rounds, with their rationales, fed back into the user message\.
- •Train IC fitness computed identically to B1’s fitness function \(viabatch\_pearsonronevaluate\_alphaoutput, the same call AlphaGen’s PPO uses for its reward\)\. A separate train calculator is built once and reused across rounds\.
- •Final selection: top\-30 from the union of all 10 rounds, ranked by train IC, fed through the common evaluation harness\.

### B\.5B5 \(Alpha101\)

The OHLCV\-only subset of WorldQuant’s 101 formulas is implemented:α1\\alpha\_\{1\}throughα30\\alpha\_\{30\}\. Each is computed as a single\-pass panel operation, with rolling correlations and cross\-sectional rank operations applied per day\.

Custom variants \(sector\-neutralized or industry\-bucketed alphas, both of which require features unavailable for the CSI 1000 universe\) are not substituted\. Amongα15\\alpha\_\{15\}throughα30\\alpha\_\{30\}, those that involveindneutralizeorIndClassare skipped; the surviving subset is what is reported\.

### B\.6B6 \(Frozen libraries ablation\)

The B6 isolated environment is constructed by:

1. 1\.Cloning Agora’s repository to a sibling directory withwiki/emptied,runs/emptied, and every skill library’sdiscovered/contents removed andregistry\.jsonreset so only the builtin sets remain \(64 builtin skills total\)\.
2. 2\.Monkey\-patching every library class’sadd,modify,promote,demote, andauto\_managemethods to no\-op \(returning a failure tuple of the appropriate arity\) before importing the orchestrator\.
3. 3\.Settingenable\_incremental\_ppo = False\(the PPO relay would otherwise mutate operator/network/RL\-algorithm libraries indirectly through training, breaking the ablation’s framing as*frozen*libraries; this also introduces an additional confound in the Agora\-vs\-B6 attribution that is discussed in Appendix[A\.9](https://arxiv.org/html/2606.29194#A1.SS9)\)\.
4. 4\.Running the orchestrator for5 outer rounds\. Wall\-clock: approximately 1\.5 hours per run\.
5. 5\.Reading the resultingholdout\_report\.jsonand routing the top\-30 alphas \(selected bytrain\-segment IC, matching the selection rule used for every other method\) through the common evaluation harness\.

The configuration is otherwise identical to Agora\.

#### Round\-count confound\.

B6 was run for 5 outer rounds; Agora was run for 100\. The round count is a real confound: even with frozen libraries, additional rounds accumulate more alpha candidates and evaluator feedback, which could lift B6’s Sharpe\. A 20\-round or 100\-round B6 is queued in Appendix[A\.9](https://arxiv.org/html/2606.29194#A1.SS9)\.

### B\.7B7 \(Random search\)

- •3,000 random expression trees, max depth 4\.
- •Sampled uniformly from the alphagen operator set with the same constants and delta\-times AlphaGen uses \(\{1,5,10,20,40\}\\\{1,5,10,20,40\\\}forΔ​t\\Delta t\)\.
- •Fitness: same as B4 \(per\-day Pearson IC onevaluate\_alphaoutput\)\.
- •Top\-30 by signed IC; fed through the common evaluation harness\.

### B\.8Significance Testing

The significance tests run on each baseline’s per\-alpha holdout Sharpe distribution and on the daily holdout NAV series:

1. 1\.\(Primary\) Newey\-West HACtt\-teston the daily difference in composite portfolio log\-returnsdt=rtAgora−rtXd\_\{t\}=r^\{\\mathrm\{Agora\}\}\_\{t\}\-r^\{X\}\_\{t\}\. The Newey\-West sandwich variance estimator is computed directly:σNW2=γ0\+2​∑L=15\(1−L/\(Lmax\+1\)\)​γL\\sigma^\{2\}\_\{\\mathrm\{NW\}\}=\\gamma\_\{0\}\+2\\sum\_\{L=1\}^\{5\}\(1\-L/\(L\_\{\\max\}\+1\)\)\\gamma\_\{L\}with Bartlett kernel andLmax=5L\_\{\\max\}=5, whereγL\\gamma\_\{L\}is the lag\-LLautocovariance ofdtd\_\{t\}\[[26](https://arxiv.org/html/2606.29194#bib.bib26)\]\. The annualized Sharpe SE uses the delta\-method approximationSE​\(Sharpeann\)≈SE​\(μ\)/σ​252\\mathrm\{SE\}\(\\mathrm\{Sharpe\}\_\{\\mathrm\{ann\}\}\)\\approx\\mathrm\{SE\}\(\\mu\)/\\sigma\\sqrt\{252\}, ignoring the variance ofσ\\sigma\. A manual implementation is used rather than a library routine so the estimator is fully auditable\. The reportedtt\-statistic, one\-sidedpp\-value \(H0:𝔼​\[dt\]≤0H\_\{0\}:\\mathbb\{E\}\[d\_\{t\}\]\\leq 0\), and Newey\-West 95% CI on the annualized Sharpe in Table[5](https://arxiv.org/html/2606.29194#S5.T5)all come from this implementation\.
2. 2\.\(Secondary, descriptive\) Two\-sample bootstrap\(10,000 resamples, NumPy default RNG seeded with 42\) on the difference of medians of the two per\-alpha Sharpe distributions; reports the observed difference, 2\.5%/97\.5% percentiles as the 95% CI, and a two\-sidedpp\-value\. Reported for distributional characterization only; not a valid portfolio\-level test due to cross\-sectional correlation\.
3. 3\.\(Secondary, descriptive\) One\-sided Mann–WhitneyUUtestof Agora versus the baseline\.

These two secondary tests are reported in Table[6](https://arxiv.org/html/2606.29194#S5.T6)alongside the primary NW HAC results\.

## Appendix CSource Code: Two Promoted Metrics

This appendix lists the full source of the two metrics promoted fromtrialtoacceptedduring the 100\-round Agora run\. Both files are written by theevaluation\_mineragent at runtime \(R21 and R50, respectively\), then validated through the AST sandbox plus dry\-run before entering the live closed loop\.

Two facts are verifiable from the listings: \(i\) the metrics implement standard quant practice \(decile\-monotonicity Spearman correlation; excess\-drawdown nonlinear penalty plus turnover\-cost discount\); they are not novel discoveries\. \(ii\) Each fits in fewer than 100 lines and uses onlynumpyandscipy\.stats, which the AST sandbox whitelist permits\.

### C\.1monotonicity\_score\_v1\.py\(proposed at R21\)

Promotion route: agent\-initiated, meta\-evaluator\-approved\. The metric’savg\_pred\_corron train\-segment Sharpe was\+0\.47\+0\.47at promotion in R21, above the auto\-management threshold of\|0\.3\|\|0\.3\|at that moment but withnobsn\_\{\\mathrm\{obs\}\}still small\. The cumulativeavg\_pred\_corrdrifted to\+0\.158\+0\.158by R100 \(below the0\.30\.3threshold but above the0\.050\.05demotion floor\), reflecting that the metric became less predictive in later rounds as the alpha pool composition shifted\. Because demotion requires\|avg\_pred\_corr\|<0\.05\|\\texttt\{avg\\\_pred\\\_corr\}\|<0\.05withnobs≥3n\_\{\\mathrm\{obs\}\}\\geq 3, the metric remainedacceptedthrough R100\. Promotion was agent\-initiated; theavg\_pred\_corrat promotion was supportive but not the primary evidence\.

```
"""
name: monotonicity_score_v1
status: accepted
version: 1
description: Spearman monotonicity score across 5 quantile groups,
             penalizing factors whose monotonicity is unstable over
             time (high std across days).
rationale: standard IC measures full cross-section correlation but
           cannot distinguish "correlated but non-monotonic
           bucketing" (a beta-loading factor) from "strict
           monotonic bucketing with modest IC" (a true alpha).
           Particularly useful in low-dispersion regimes where
           short-side blowups can produce spurious long-short
           spreads.
"""

import numpy as np
from scipy import stats

def compute(factor_values, future_returns, **kwargs):
    factor_values = np.asarray(factor_values, dtype=float)
    future_returns = np.asarray(future_returns, dtype=float)

    if factor_values.ndim == 1:
        factor_values = factor_values.reshape(1, -1)
        future_returns = future_returns.reshape(1, -1)

    T, N = factor_values.shape
    if N < 20:
        return 0.0

    n_groups = 5
    mono_scores = []

    for t in range(T):
        fv = factor_values[t]
        fr = future_returns[t]

        valid = np.isfinite(fv) & np.isfinite(fr)
        if valid.sum() < n_groups * 4:
            continue

        fv_v = fv[valid]
        fr_v = fr[valid]

        sorted_idx = np.argsort(fv_v)
        group_size = len(sorted_idx) // n_groups

        group_returns = []
        for g in range(n_groups):
            if g < n_groups - 1:
                idx = sorted_idx[g * group_size:(g + 1) * group_size]
            else:
                idx = sorted_idx[g * group_size:]
            group_returns.append(np.mean(fr_v[idx]))

        group_returns = np.array(group_returns)
        group_ranks = np.arange(1, n_groups + 1)

        corr, _ = stats.spearmanr(group_ranks, group_returns)
        if np.isfinite(corr):
            mono_scores.append(corr)

    if len(mono_scores) == 0:
        return 0.0

    avg_mono = float(np.mean(mono_scores))

    # Stability penalty: factors with high day-over-day variance in
    # monotonicity get downweighted.
    if len(mono_scores) > 2:
        std_mono = float(np.std(mono_scores))
        stability_penalty = max(0.0, std_mono - 0.3) * 0.5
        avg_mono = avg_mono - stability_penalty

    return float(np.clip(avg_mono, -1.0, 1.0))
```

### C\.2excess\_drawdown\_penalty\_v1\.py\(proposed at R50\)

Promotion route: empirical \(auto\-managed\)\. The metric’savg\_pred\_corron train\-segment Sharpe was\+0\.557\+0\.557at promotion in R50 \(nobs=26n\_\{\\mathrm\{obs\}\}=26\), well above the\|0\.3\|\|0\.3\|threshold; the cumulativeavg\_pred\_corrat R100 remained above threshold throughout\. This is the cleanest case of empirical validation in the metric library\.

```
"""
name: excess_drawdown_penalty_v1
status: accepted
version: 1
description: Top-quantile excess drawdown vs. cross-sectional mean,
             with nonlinear penalty when drawdown < -20% and a
             turnover-cost discount.
rationale: IC and IR cannot distinguish "high IC masked by high
           turnover producing negative net excess" from "high IC
           with low turnover and positive net excess". Excess
           drawdown is a core dimension for separating real alpha
           from market-beta exposure, particularly in
           low-dispersion regimes where transaction costs are a
           large fraction of gross excess.
"""

import numpy as np
from scipy.stats import spearmanr

def compute(factor_values, future_returns, **kwargs):
    fv = np.array(factor_values)
    fr = np.array(future_returns)
    T, N = fv.shape
    if T < 10 or N < 10:
        return 0.0

    n_quantiles = 5
    top_returns = []
    bottom_returns = []
    market_returns = []
    turnovers = []
    prev_top_mask = None

    for t in range(T):
        f = fv[t]
        r = fr[t]
        valid = np.isfinite(f) & np.isfinite(r)
        if valid.sum() < n_quantiles * 2:
            continue
        f_v = f[valid]
        r_v = r[valid]
        n_v = len(f_v)
        k = max(1, n_v // n_quantiles)
        sorted_idx = np.argsort(f_v)
        top_idx = sorted_idx[-k:]
        bottom_idx = sorted_idx[:k]
        top_ret = np.mean(r_v[top_idx])
        bottom_ret = np.mean(r_v[bottom_idx])
        mkt_ret = np.mean(r_v)
        top_returns.append(top_ret)
        bottom_returns.append(bottom_ret)
        market_returns.append(mkt_ret)

        # Turnover proxy: change in the set of names in the top quantile.
        valid_indices = np.where(valid)[0]
        top_global = set(valid_indices[top_idx])
        if prev_top_mask is not None:
            overlap = len(top_global & prev_top_mask)
            turnover = 1.0 - overlap / max(len(top_global), 1)
        else:
            turnover = 1.0
        turnovers.append(turnover)
        prev_top_mask = top_global

    if len(top_returns) < 5:
        return 0.0

    top_arr = np.array(top_returns)
    mkt_arr = np.array(market_returns)
    turnover_arr = np.array(turnovers)

    excess = top_arr - mkt_arr

    # One-side cost charge of 10 bps per unit turnover.
    cost_per_day = turnover_arr * 0.001
    net_excess = excess - cost_per_day

    cum_excess = np.cumsum(net_excess)
    running_max = np.maximum.accumulate(cum_excess)
    drawdown = cum_excess - running_max
    max_drawdown = np.min(drawdown)

    ann_factor = 250.0 / len(net_excess)
    ann_excess = np.sum(net_excess) * ann_factor

    # Piecewise drawdown penalty.
    if max_drawdown < -0.30:
        drawdown_penalty = -0.5
    elif max_drawdown < -0.20:
        drawdown_penalty = -0.25 + (max_drawdown + 0.20) * 1.25
    elif max_drawdown < -0.10:
        drawdown_penalty = -0.05 + (max_drawdown + 0.10) * 0.5
    else:
        drawdown_penalty = 0.0

    avg_turnover = np.mean(turnover_arr[1:])
    turnover_penalty = -max(0.0, avg_turnover - 0.3) * 0.3

    # Squashed excess-return component.
    excess_score = float(np.tanh(ann_excess * 5.0)) * 0.5

    score = excess_score + drawdown_penalty + turnover_penalty
    return float(np.clip(score, -1.0, 1.0))
```

#### Assessment\.

Both implementations are standard quant practice that any experienced researcher could write in 30–60 minutes\. What the listing shows is the closed\-loop process: theevaluation\_miner\(i\) identified gaps in the existing metric set, \(ii\) drafted the implementations as Python source, \(iii\) had them validated by the AST sandbox plus dry\-run, and \(iv\) accumulated empirical predictive correlation across rounds\. Forexcess\_drawdown\_penalty\_v1, the empirical accumulation cleared the0\.30\.3threshold at promotion and stayed above it through R100\. Formonotonicity\_score\_v1, theavg\_pred\_corrwas above the threshold at promotion \(\+0\.47\+0\.47at R21\) but slid to\+0\.158\+0\.158by R100; promotion was agent\-initiated and meta\-evaluator\-approved rather than driven by empirical accumulation, and the post\-promotion drift means the “discovery” framing should be weighted accordingly\.

Similar Articles

@RitOnchain: https://x.com/RitOnchain/status/2069693848478269730

X AI KOLs Timeline

This article details how a systematic fund replaced its traditional NLP pipeline with a RAG-based LLM agent architecture, achieving a 340% improvement in alpha generation from unstructured data. It cites recent research (Alpha-GPT 2.0, FinCon, FinAgent) showing significant gains in automated factor discovery and trading performance.

AI-Trader: Benchmarking Autonomous Agents in Real-Time Financial Markets

Papers with Code Trending

This paper introduces AI-Trader, the first fully automated live benchmark for evaluating LLMs in financial decision-making across US stocks, A-shares, and cryptocurrencies. It highlights that general intelligence does not guarantee trading success and emphasizes the importance of risk control in autonomous agents.

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Hugging Face Daily Papers

Introduces Agent Bazaar, a multi-agent simulation framework for evaluating economic alignment of LLMs, identifying failure modes like algorithmic instability and Sybil deception, and training a 9B model that outperforms frontier models using targeted reinforcement learning.