Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries
Summary
This paper presents EinsteinArena, an agent-native platform enabling decentralized scientific discovery through open interaction among autonomous AI agents. The platform has already produced 12 new state-of-the-art results, including an improved lower bound for the kissing number problem in dimension 11, demonstrating that collective AI-driven research can emerge from agents sharing insights and building on each other's work.
View Cached Full Text
Cached at: 06/10/26, 06:11 AM
# Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries
Source: [https://arxiv.org/html/2606.10402](https://arxiv.org/html/2606.10402)
Federico Bianchi1∗, Yongchan Kwon1∗, Aneesh Pappu2, James Zou1,2 1Together AI2Stanford University ∗Equal contribution
\(May 2026\)
###### Abstract
Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other’s ideas over long time horizons\. Recent AI systems have shown that language\-model\-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation\. In this paper, we present EinsteinArena, an agent\-native platform for open distributed research and discovery\. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem\-specific discussion forum where agents can ask questions and share insights\. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously\. As of May 2026, agents on EinsteinArena have discovered 12 new state\-of\-the\-art results better than any previous human or AI solutions\. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604\. This advance did not come from a single agent or isolated run\. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent\-to\-agent borrowing of ideas\. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI\-driven research\.
## 1Introduction
The history of scientific discovery is a history of collective work\. Individual breakthroughs depend on a substrate of shared knowledge: prior failed attempts that narrowed the search space, partial constructions that pointed in the right direction, and public records that let later researchers avoid known dead ends\. This social infrastructure — seminars, preprints, open repositories, and scientific forums — is what has made complex problems tractable\.
As AI systems take on a larger role in scientific discovery, a natural question is whether they can benefit from similar infrastructure\. Current AI discovery systems are powerful but isolated: each run explores a problem independently and produces results that are seldom incorporated into a shared body of knowledge that other agents can readily reuse\. This mirrors an earlier era of human research: before preprints, before open datasets, before norms that make science cumulative\. The question is not only whether a single agent can improve upon the best known result, but whether a community of agents, operating on shared state and building on one another’s partial discoveries, can make substantially faster progress\.
This pattern is especially visible in mathematics and theoretical computer science, where progress toward a new bound or construction is frequently incremental and distributed\. A candidate may be nearly correct but numerically unstable; a proof sketch may require a different parameterization; a construction may only become valid after multiple rounds of refinement\. Recent work on AI for scientific discovery has shown that language\-model\-based systems can improve known solutions to open problems\. Systems such as AlphaEvolve\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\], Virtual Lab\[[24](https://arxiv.org/html/2606.10402#bib.bib2)\], and TTT\-Discover\[[35](https://arxiv.org/html/2606.10402#bib.bib3)\]indicate that search with modern models can already produce nontrivial progress\. However, these systems are usually organized around isolated runs or tightly controlled pipelines\. They do not expose the social structure that makes human research effective: public traces, shared partial results, and the ability for one solver to continue from where another left off\. Interestingly, a parallel line of work has begun studying AI agents as collective systems rather than isolated solvers\. Moltbook, a Reddit\-style platform exclusively populated by AI agents, showed that large agent populations exhibit emergent social dynamics mirroring human online communities, even without a shared task\.
In this work, we ask whether agents can make progress when they operate on a common platform with shared state, public leaderboard, and problem\-specific discussion\. We also ask whether open traces extend the effective time horizon of search by allowing later agents to inherit promising directions instead of restarting from scratch\.
We present EinsteinArena, a platform designed to study these questions on scientific problems with exact or near\-exact verification\. EinsteinArena enables agents to access shared research artifacts, build on prior solutions, and receive continuous feedback through automated evaluation\. More fundamentally, rather than baking knowledge into a task\-specific harness that vanishes when a run ends, EinsteinArena treats the platform as a persistent shared memory: prior solutions, failed attempts, and partial insights become a substrate that any agent can build on, letting progress accumulate across agents and over time\. Our initial focus is mathematics, where the problem statements are precise, the optimization objective is clear, and verification can often be made deterministic and efficient\. Notably, agents on EinsteinArena improved the lower bound for the kissing number in dimension 11 from 593 to 604 — one of the largest improvements since Best’s 1980 construction\[[1](https://arxiv.org/html/2606.10402#bib.bib7)\]breakthrough\.111As described in\[[3](https://arxiv.org/html/2606.10402#bib.bib6)\]\.
Our main contributions are three\-fold:
1. 1\.We present EinsteinArena, an open platform for multiple agents to organically collaborate on scientific problems, with public problem specifications, automatic verification, real\-time leaderboards, and discussion threads\.
2. 2\.We document that the platform has already produced new state\-of\-the\-art results on 12 open mathematical problems\.
3. 3\.We provide a linguistic analysis of collaborative agent search in the wild, showing how public traces, iterative submission, and shared debugging can produce new state\-of\-the\-art solutions that no single agent found alone\.
## 2EinsteinArena
### 2\.1Overview and design principles
EinsteinArena is an open, agent\-native platform where AI agents compete and collaborate on unsolved research problems\. The platform is built around three core components: \(i\) a curated collection of open problems with public verifiers, \(ii\) a live leaderboard that tracks the best known solution for each problem, and \(iii\) a public discussion board where agents can share intermediate findings, document failed approaches, and build upon one another’s discoveries\. Figure[1](https://arxiv.org/html/2606.10402#S2.F1)presents the EinsteinArena web interface\.
Figure 1:The EinsteinArena web interface\. Each problem page includes a problem description, an active leaderboard with agent scores, and a discussion board where agents can share findings, ask questions, and propose hypotheses\. All three components are updated in real time as new submissions are evaluated or new threads are published\.A central design principle of EinsteinArena is transparency: all core artifacts required for participation are publicly accessible\. Problem statements, verifier source code, leaderboard scores, best submitted solutions, and discussion threads can be accessed by any agents through either the web interface or the API\. In this way, the platform functions as a shared research environment where the current frontier is visible to all participants, enabling agents to inspect, reuse, and extend the best work produced by others\. Instructions for interacting with the platform are also publicly available through a markdown fileskill\.md222Theskill\.mdfile is accessible via[https://einsteinarena\.com/skill\.md](https://einsteinarena.com/skill.md)\., which specifies the API endpoints and submission procedures\. The source code for the EinsteinArena platform, along with the analysis code used in our experiments, is available at[https://github\.com/vinid/einstein\-arena](https://github.com/vinid/einstein-arena)\.
### 2\.2Problem curation and current frontier
We curate a collection of open mathematical optimization problems from AlphaEvolve\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\]\. We selected problems with objective and computationally efficient evaluation procedures to enable rapid feedback and iterative improvement\. We also prioritized problems with established research interest, including tasks on which previous studies have reported strong AI\-agent results \(e\.g\., the three autocorrelation problems\) as well as tasks for which substantially better human\-derived solutions are known, leaving significant room for improvement \(e\.g\., the prime number theorem problem\)333TheREADME\.mdfile of the GitHub repository[https://github\.com/google\-deepmind/alphaevolve\_repository\_of\_problems](https://github.com/google-deepmind/alphaevolve_repository_of_problems)includes a figure illustrating which problems have best known results relative to AI solutions\.\.
Table[1](https://arxiv.org/html/2606.10402#S2.T1)summarizes the problems active on EinsteinArena as of May 2026, together with their current best scores\. For problems on which EinsteinArena participants achieved a new state\-of\-the\-art result, we additionally report the previous best known result and the current best result obtained on the platform\. Since the launch of EinsteinArena on March 19, 2026, agents have discovered new state\-of\-the\-art results for 12 problems, and most of these findings emerged through collaborative effort, with agents iteratively building on prior solutions and feedback from the community\.444We note that SimpleTES\[[34](https://arxiv.org/html/2606.10402#bib.bib23)\]identified a new state\-of\-the\-art construction with a score of0\.3808680\.380868for the Erdős minimum overlap problem\. This construction is better than the current EinsteinArena result, but this score was achieved after we had already obtained a score of0\.3808710\.380871that is superior to that of TTT\-Discover\[[35](https://arxiv.org/html/2606.10402#bib.bib3)\]\.We present representative case studies of these collaborative discoveries in Sections[2](https://arxiv.org/html/2606.10402#S3.F2)and[4](https://arxiv.org/html/2606.10402#S4)\.
Table 1:Problems on EinsteinArena as of May 2026\. Bold entries indicate improved scores where EinsteinArena agents achieved new state\-of\-the\-art results\. The previous best scores are sourced from AlphaEvolve\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\], TTT\-Discover\[[35](https://arxiv.org/html/2606.10402#bib.bib3)\], and references therein\. A detailed description of the problems is provided in Appendix[A](https://arxiv.org/html/2606.10402#A1)\.
### 2\.3Problem specification and verifier
Each problem on EinsteinArena is specified by four components: a natural\-language description of the mathematical task; asolutionSchemathat defines the exact JSON structure a valid submission must have; ascoringfield indicating whether lower or higher scores are better; and averifier, which is executable Python code that maps a submitted solution to a scalar score\.
The verifier is the central artifact\. Many verifiers follow the reference implementations used in prior work such as AlphaEvolve\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\], but we add additional checks for invalid submissions\. Verifiers are manually audited and updated when agents expose numerical or validity edge cases\. They are also public: agents can download and run them locally without making API calls\. This means that server\-side evaluation is reproducible, and local runs are intended to be semantically identical to server\-side ones\. Agents do not need to guess the scoring function or submit solutions blindly; they can iterate offline and submit only when they have a credible improvement\. Transparency here is the main feature that is only possible on this type of open problem\.
Problems vary in verifier complexity\. Some verifiers apply a closed\-form formula to the submission directly \(e\.g\., computing the overlap integral for Erdős or the autoconvolution ratio for the autocorrelation problems\)\. Others require heavier computation, such as checking pairwise distance conditions over hundreds of vectors \(e\.g\., kissing number\) or drawing10710^\{7\}samples \(e\.g\., prime number theorem\)\. All verifiers share the same interface: they accept a Pythondictand return a singlefloatrepresenting the optimization variable of interest\.
### 2\.4Agent registration, interaction and evaluation pipeline
To participate in EinsteinArena, agents must first register on the platform\. During registration, the server generates a random 32\-byte value calledchallengeand a difficulty parameterkk\. The agent must then find a valuennsuch thatSHA256\(challenge \+ n\)begins withkkleading zero bits\. This proof\-of\-work computation is inexpensive while making large\-scale registration attempts computationally expensive, thereby discouraging spam\. Upon successfully completing this registration process, the agent is issued a Bearer token that can be used to authenticate subsequent API requests, including solution submissions and other write operations\.
Once registered, agents can list problems, fetch problem specifications, submit solutions, download verifier codes, and poll results asynchronously\. Since EinsteinArena does not provide a human\-friendly interface for submissions or other write operations, participation is intended to occur through agents rather than direct human interaction\. This design helps ensure that leaderboard results reflect genuine agent capabilities\. To further encourage broad participation and experimentation, EinsteinArena does not require disclosure or registration of the humans who create or operate these agents\.
As for the evaluation pipeline, all submissions are checked in isolated execution environments \(E2B sandboxes\), where the problem verifier is executed against the submission data\. After evaluation, each result is written back to the database and the leaderboard is updated according to the platform’s acceptance rules\. For problems that require high numerical precision – such as the kissing number, where the difference between a valid and an invalid configuration can be smaller than machine epsilon – verifiers use Python’sdecimal\.Decimalarithmetic at 30–80 significant digits for the overlap loss computation and exact arithmetic for integer\-valued submissions\.
### 2\.5Leaderboard and acceptance rules
The leaderboard shows at most one solution per agent for each problem, corresponding to the agent’s best submission\. A new submission appears on the leaderboard only if it improves the agent’s current best score; rejected or lower\-scoring submissions are not added to the leaderboard and not stored in our database\. When an agent achieves a new personal best, the leaderboard is updated to reflect the improved score\. While only the best\-performing solution is shown publicly, all personal\-best submissions are retained in the database to enable reconstruction of the agent’s progress over time\.
To claim the top position, a submission must pass a stricter acceptance pipeline: it is required to exceed the current best score by a problem\-specific minimum improvement thresholdδ\\delta\. Because the leaderboard score ranges vary substantially across problems, we carefully selectδ\\deltafor each problem with two goals in mind: \(i\) keeping the threshold low enough to encourage iterative improvements by agents, and \(ii\) keeping it high enough to prevent leaderboard changes caused solely by non\-significant modifications or floating\-point discrepancies between evaluators\.
### 2\.6Discussion threads
Every problem has an associated discussion board\. Agents can open threads and post replies, which pass through a Llama\-Guard\-based moderation step before becoming publicly visible\[[9](https://arxiv.org/html/2606.10402#bib.bib24)\]\. The thread structure mirrors how researchers share working notes: an agent can post a construction that is not yet valid but contains a promising direction, explain why a particular coordinate family was explored, or flag a numerical failure mode that others should avoid\.
This record accumulates over time and can be queried through the API\. Unlike a leaderboard, which stores only the current frontier, the discussion board stores the path to the frontier\. These partial contributions have no natural home in a system that only records final scores\. Agents often use the platform to discuss approaches, ask questions, and summarize progress\. Figures[3\(a\)](https://arxiv.org/html/2606.10402#S3.F3.sf1)and[3\(b\)](https://arxiv.org/html/2606.10402#S3.F3.sf2)show agents asking and answering questions, and agents updating each other on solution progress, for the kissing number and second autocorrelation inequality problems, respectively\.
## 3Case Study I: kissing number in dimension 11
Figure 2:Best known lower bounds for the kissing number in dimension 11\. The record stood at 582 for roughly 40 years before a burst of recent progress\. Note that canonical citation for Ganzhinov’s result is\[[6](https://arxiv.org/html/2606.10402#bib.bib22)\], even if the result first appeared in the 2022 preprint\.Having described the platform, we now examine the kissing number problem in depth and present how open multi\-agent search produces progress through large\-scale exploration and iterative refinement\.
##### Problem Overview\.
The kissing number in dimensiond∈ℕd\\in\\mathbb\{N\}asks the maximum number of non\-overlapping unit spheres inℝd\\mathbb\{R\}^\{d\}that can simultaneously touch a central unit sphere\. A detailed description of the problem is provided in Appendix[A\.1](https://arxiv.org/html/2606.10402#A1.SS1)\. Exact kissing numbers are known only for the dimensiond∈\{1,2,3,4,8,24\}d\\in\\\{1,2,3,4,8,24\\\}, while for most other dimensions only upper and lower bounds have been established\. We consider the lower bound ford=11d=11\. The best known lower bound has progressed from582582to592592in\[[6](https://arxiv.org/html/2606.10402#bib.bib22)\]555The arXiv version of\[[6](https://arxiv.org/html/2606.10402#bib.bib22)\]was released in 2022\., then to593593in\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\]\. In EinsteinArena, a collaborative effort among AI agents improved this lower bound to594594, after which we had agents build on their result to further extend it to604604\. To understand how this improvement emerged, we divide the process into two stages: first, the construction of a valid configuration withn=594n=594, and second, the extension from595595to604604\.
##### Constructing the 594\-Sphere Configuration\.
The construction forn=594n=594relies on three main components: a strong initial construction byalpha\_omega\_agents\[[13](https://arxiv.org/html/2606.10402#bib.bib19)\], an optimization procedure that produces a near\-valid construction, and a final post\-processing to obtain a valid construction\. Bothalpha\_omega\_agentsandJSAgent\[[23](https://arxiv.org/html/2606.10402#bib.bib20)\]iterated on the solution\. One AI agent,KawaiiCorgi, observed that the leaderboard score was nonlinear and instead optimized a linearized surrogate obtained via a Taylor expansion\. In particular, the agent constructed a least\-squares objective function of the form
∑i<j\(ci⊤cj−2\)2\\sum\_\{i<j\}\(c\_\{i\}^\{\\top\}c\_\{j\}\-2\)^\{2\}and optimized it using both the strong initial construction that was available in EinsteinArena and the LSQR algorithm\[[19](https://arxiv.org/html/2606.10402#bib.bib40)\]\. This resulted in a smooth quadratic objective that enables efficient optimization\. Empirical experiments done by the agent showed it to be significantly more effective, leading the agent to adopt this approach\. As a result, the loss decreased from approximately10−1010^\{\-10\}to10−5010^\{\-50\}, which means the average overlap between two overlapped spheres is much less than10−5010^\{\-50\}\.
After LSQR refinement, the agent observed that most inner productsxi⊤xjx\_\{i\}^\{\\top\}x\_\{j\}are close to simple integer values such as−2\-2,0or11, though not exactly equal\. This indicates the presence of an underlying discrete structure, so the agent applied a final integer\-snapping post\-processing step to make these values exact, resulting in a provably valid construction\. In effect, the agents first found a numerically near\-valid configuration and then converted it into an exact discrete structure that the verifier could certify\.
##### Extending the Construction to 604 Spheres\.
After establishing the construction forn=594n=594, agents extended this approach to larger numbersn≥595n\\geq 595\. The combination of the surrogate loss function, and more critically, the integer\-snapping technique allowed us to push the construction ton=600n=600with relative ease\. To understand this limitation, the agent analyzed all constructions for594≤n≤600594\\leq n\\leq 600and found that they share a common set of496496vectors\. These vectors form a highly structured backbone\. This observation reveals a strong underlying geometric structure and suggests that further improvements may be achievable within integral constructions\. Motivated by this, the agent explored extensions in a larger algebraic space, leading to the discovery of a new construction withn=604n=604\. The mathematical details of this construction are provided in Appendix[B](https://arxiv.org/html/2606.10402#A2)\.
\(\(a\)\)Discussion betweenAlpha Omega Agents\[[13](https://arxiv.org/html/2606.10402#bib.bib19)\]andChronoson the kissing number problem\.
\(\(b\)\)Discussion betweenJSAgent\[[23](https://arxiv.org/html/2606.10402#bib.bib20)\]andClaudeExplorer\[[10](https://arxiv.org/html/2606.10402#bib.bib21)\]on the second autocorrelation problem\.
Figure 3:Discussions on the EinsteinArena platform exemplify how agents ask questions and build off each other’s ideas\.Figure 4:Conversation topic distribution in the kissing number problem\. Agents discuss a wide variety of topics on the platform, from specific methodological approaches \(e\.g\., structural/lattice decoding and micro\-perturbation refinement\) to announcing best scores and summarizing previous approaches\. Numbers indicate the percentage of all posts in each category, with absolute counts shown in parentheses\.Figure 5:Solution lineage of the kissing number problem\. Arrows denote parental lineage, which is determined by computing similarity between feature vectors representing solution submissions\. Details on methodologies are provided in Appendix[C](https://arxiv.org/html/2606.10402#A3)\.
##### Collaborative Reasoning and Discussion Patterns\.
Agents on the platform engage in a wide variety of discussions, ranging from methodological analysis to synthesizing and revisiting previous approaches\. Figure[3\(a\)](https://arxiv.org/html/2606.10402#S3.F3.sf1)shows agents asking each other questions about problem\-solving approaches for the kissing number in dimension 11 problem, focused on improving the geometry of the current best solution\. In addition, Figure[4](https://arxiv.org/html/2606.10402#S3.F4)shows the breakdown of discussion topics for the kissing number in dimension 11\. A plurality of conversations focus on problem\-specific reasoning strategies; specifically, agents frequently discuss structural/lattice decoding \(34%\), which refers to interpreting a numerical kissing\-number configuration as a geometric object\. Instead of solely treating the submission as hundreds of floating\-point vectors, these posts look for hidden structure: repeated distance patterns, contact graphs, symmetry, integer\-like coordinates, shells, or resemblance to known lattice constructions\. Other discussion topics broadly reflect communicating and organizing progress, e\.g\., broadcasting discoveries of new solution basins/optima and revisiting whether older approaches may be relevant in light of newer discoveries\. The details of how the taxonomy of topics was created and how topic distribution statistics were calculated are provided in Appendix[D](https://arxiv.org/html/2606.10402#A4)\.
##### Search Dynamics and Solution Lineages\.
Figure[5](https://arxiv.org/html/2606.10402#S3.F5)summarizes the progression of solution submissions to the kissing number problem as a sequence of search regimes\. Early submissions byCHRONOSagent incrementally refine early submissions by reducing the overlap penalty but remain far from feasibility\. TheGradientagent discovers a new basin that is long\-lived \(0\.156 overlap penalty\) with multiple agents includingCHRONOScontributing small improvements within the same broad geometry\. The major transition is thealpha\_omega\_agentssubmission that jumps to a new basin with score 0\.0119, breaking the plateau rather than simply polishing it\. SubsequentCHRONOS,alpha\_omega\_agents, andKawaiiCorgisubmissions preserve the shared \(17,088\)\-pair active\-set topology of this new basin while driving the residual violation down by many orders of magnitude\. The finalKawaiiCorgisubmission reduces the penalty to zero; the details regarding this and a solution ton=604n=604are described at the beginning of this section\. Details on lineage construction and interpretation are provided in Appendix[C](https://arxiv.org/html/2606.10402#A3)\.
## 4Case Study II: the second autocorrelation inequality
Our second case study focuses on the second autocorrelation inequality, a critical problem at the intersection of additive combinatorics and harmonic analysis\.
Figure 6:Solution lineage for the second autocorrelation inequality problem\. Arrows denote parental lineage, which is determined by computing similarity between feature vectors representing solution submissions\. Details on methodologies are provided in Appendix[C](https://arxiv.org/html/2606.10402#A3)\.##### Problem Overview\.
The autocorrelation measures the overlap between a set or function and a shifted copy of itself, often revealing structure that is not apparent from the original object alone\. In particular, autocorrelations can encode the distribution of pairwise differences in a set and are also closely connected to Fourier magnitudes of a function\[[25](https://arxiv.org/html/2606.10402#bib.bib31),[11](https://arxiv.org/html/2606.10402#bib.bib32)\]\. It plays a central role in many areas of mathematics and the sciences\[[17](https://arxiv.org/html/2606.10402#bib.bib1),[2](https://arxiv.org/html/2606.10402#bib.bib30)\], and the second autocorrelation inequality is part of a broader sequence of extremal autocorrelation problems that study the limits of how concentrated or dispersed an autocorrelation can be\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\]\.
The second autocorrelation inequality asks for determining the optimal constantC\>0C\>0such that every non\-negative functionf:ℝ→ℝ≥0f:\\mathbb\{R\}\\to\\mathbb\{R\}\_\{\\geq 0\}satisfies
‖f⋆f‖22≤C‖f⋆f‖1‖f⋆f‖∞\\\|f\\star f\\\|\_\{2\}^\{2\}\\leq C\\\|f\\star f\\\|\_\{1\}\\\|f\\star f\\\|\_\{\\infty\}wheref⋆ff\\star fdenotes the autoconvolution offf\. Since optimization over all integrable functions is not feasible, we focus on a set of step functions and consider a discretized formulation, as suggested in previous studies of autocorrelation inequalities\[[17](https://arxiv.org/html/2606.10402#bib.bib1),[35](https://arxiv.org/html/2606.10402#bib.bib3)\]\. This discretization transforms the problem into an optimization problem over a finite\-dimensional vector\. Throughout this section, we use ‘interval’ to denote the dimensionality of a vector\. A detailed description is provided in Appendix[A\.4](https://arxiv.org/html/2606.10402#A1.SS4)\. AlphaEvolve showed thatC≥0\.9610C\\geq 0\.9610, and EinsteinArena agents improved the best known lower bound to0\.96260\.9626\.
##### Collaborative Reasoning and Methodological Development\.
Similar to the kissing number problem, agents working on the second autocorrelation inequality built on one another’s solutions and synthesized prior approaches\. In particular, Dinkelbach optimization—which iteratively maximizes‖f⋆f‖22−λ‖f⋆f‖1‖f⋆f‖∞\\\|f\\star f\\\|\_\{2\}^\{2\}\-\\lambda\\\|f\\star f\\\|\_\{1\}\\\|f\\star f\\\|\_\{\\infty\}while updating the hyperparameterλ\\lambdaat each step\[[4](https://arxiv.org/html/2606.10402#bib.bib33)\]—emerged as a key methodology, with multiple agents reporting improved results \(See Figure[3\(b\)](https://arxiv.org/html/2606.10402#S3.F3.sf2)\)\. In addition, complementary techniques such as simulated annealing contributed to the best\-performing solutions on the platform\[[10](https://arxiv.org/html/2606.10402#bib.bib21),[23](https://arxiv.org/html/2606.10402#bib.bib20)\]\. Rather than arising from a single breakthrough, progress emerged through the collective refinement of these techniques, with agents identifying their limitations, adapting them to new settings, and combining them with complementary ideas\.
##### Search Dynamics and Solution Lineages\.
Figure[6](https://arxiv.org/html/2606.10402#S4.F6)summarizes the progression of solution submissions to the second autocorrelation problem as a sequence of increasing discretization resolution \(i\.e\., the number of intervals used to represent the step function\) and local\-refinement\. The initial family consists of previous SOTA results from AlphaEvolve\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\]and TTT\-Discover\[[35](https://arxiv.org/html/2606.10402#bib.bib3)\], both represented using5×1045\\times 10^\{4\}intervals\. Together\-AI agent first improved upon these solutions by doubling the resolution to10510^\{5\}intervals\. Subsequent submissions byOpusMathAgent,CHRONOS,ClaudeExplorer, andJSAgentrefined this high\-performing solution family\. In particular,JSAgentachieved the strongest10510^\{5\}\-interval solution at0\.9622140\.962214\. The final breakthrough came fromClaudeExplorer, which increased the resolution to4×1054\\times 10^\{5\}intervals and refined the previous solution to the best score of 0\.962643\. Details on lineage construction and interpretation are provided in Appendix[C](https://arxiv.org/html/2606.10402#A3)\.
## 5Related Work
##### AI for scientific discovery\.
A growing body of work applies language models and evolutionary search to open research problems\[[14](https://arxiv.org/html/2606.10402#bib.bib14),[17](https://arxiv.org/html/2606.10402#bib.bib1),[35](https://arxiv.org/html/2606.10402#bib.bib3),[24](https://arxiv.org/html/2606.10402#bib.bib2),[12](https://arxiv.org/html/2606.10402#bib.bib13),[32](https://arxiv.org/html/2606.10402#bib.bib11),[16](https://arxiv.org/html/2606.10402#bib.bib10)\]\. AlphaEvolve\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\]uses a Gemini\-based agent to evolve programs that improve known solutions in mathematics and algorithm design, producing new results on problems including matrix multiplication and the kissing number\. TTT\-Discover\[[35](https://arxiv.org/html/2606.10402#bib.bib3)\]extends this by performing reinforcement learning at test time, allowing the model to continue training on a single target problem rather than using a frozen policy\. Virtual Lab\[[24](https://arxiv.org/html/2606.10402#bib.bib2)\]organizes multiple agents into a simulated research group to address biology problems\. These systems share a common structure: a single run, orchestrated by a fixed pipeline, produces a candidate that is evaluated privately\. EinsteinArena differs in that the platform itself is shared state: any agent can observe the current frontier, download prior solutions, and continue where others left off\.
##### Multi\-agent collaboration\.
Many prior multi\-agent works focus on homogeneous teams where each agent is a copy of the same underlying base model\[[5](https://arxiv.org/html/2606.10402#bib.bib15),[24](https://arxiv.org/html/2606.10402#bib.bib2),[37](https://arxiv.org/html/2606.10402#bib.bib25)\], require specifying fixed workflow/agent interaction patterns in advance \(including fixed role decomposition\)\[[7](https://arxiv.org/html/2606.10402#bib.bib29),[24](https://arxiv.org/html/2606.10402#bib.bib2),[28](https://arxiv.org/html/2606.10402#bib.bib9),[27](https://arxiv.org/html/2606.10402#bib.bib12),[15](https://arxiv.org/html/2606.10402#bib.bib16),[20](https://arxiv.org/html/2606.10402#bib.bib8)\], and/or view agents as independent execution units without deliberative dynamics\[[33](https://arxiv.org/html/2606.10402#bib.bib26),[38](https://arxiv.org/html/2606.10402#bib.bib27),[36](https://arxiv.org/html/2606.10402#bib.bib28)\]\. Generally, these prior works also typically involve a small number of agents in a closed loop within a single session\.
In EinsteinArena, all of these axes are flexible\. Specifically, by allowing users to specify their own agents, the platform enables teams of heterogeneous agents powered by varied underlying base models to collaborate, allowing for model\-specific knowledge or capabilities to contribute to problem solving\. By enabling agents to choose when/how they interact with the platform \(e\.g\. some agents may choose to only submit solutions, whereas other agents may choose to actively engage in discussion\), EinsteinArena allows foremergentcoordination mechanisms\. This alleviates the problem of specifying a fixed agent collaboration pattern for a discovery problem, where an optimal collaboration structure is usually unknown a\-priori\. Finally, agents can engage in deliberation with other agents to refine existing problem\-solving approaches or generate new ideas, as shown in Figure[3](https://arxiv.org/html/2606.10402#S3.F3)\.
A few recent works have moved further in this direction\. CORAL\[[21](https://arxiv.org/html/2606.10402#bib.bib4)\]deploys multiple agents through shared persistent memory, but within a single orchestrated run rather than a shared public platform, in addition to using homogeneous agent teams\. AgentRxiv\[[22](https://arxiv.org/html/2606.10402#bib.bib17)\]allows independent agents to share research reports through a centralized preprint server, showing possible improvements on benchmarks\. EinsteinArena similarly studies collaboration at platform scale with verifiable artifacts: many independent agents, operating asynchronously over days, with a shared record of prior attempts\. Platforms like ClawdLab\[[30](https://arxiv.org/html/2606.10402#bib.bib38)\]show that multi\-agent scientific collaboration can extend beyond domains with exact verification; EinsteinArena instead grounds coordination in a shared, deterministic verifier\.
## 6Discussion
EinsteinArena shows that agents, given the right infrastructure, can autonomously make progress on open problems\. Effective collaboration, as in human research, requires exposing intermediate artifacts\. Partial constructions, failed attempts, verifier issues, and short explanations can all become useful starting points for later agents\.
Verification is a critical component and requires ongoing maintenance\. We frequently found it essential to keep strengthening our verifier to prevent invalid solutions, numerical instabilities, and overflow issues\. For example, the verifier for the kissing number problem had to be revised after its launch because the required numerical precision exceeded the limits of a standard double\-precision pipeline\. We ultimately upgraded it to 80\-digit precision using Python’sDecimalmodule\. In addition to this, agents are optimizing aggressively against the scoring function, so verifiers must be public and reproducible to ensure transparency\.
Our focus so far is on mathematical problems with objective and relatively easy\-to\-compute verifiers\. This choice lets us study collaboration under controlled evaluation, but it does not yet establish how well the same platform design will transfer to domains such as formal proof, algorithm design, or computational biology\[[8](https://arxiv.org/html/2606.10402#bib.bib39),[31](https://arxiv.org/html/2606.10402#bib.bib37)\]\.
There are also open questions about incentives\. A public leaderboard may improve coordination, but it may also bias agents toward short\-horizon score chasing rather than pursuing directions that are promising but slow to yield verified improvements\. Likewise, open discussion is useful only if the traces are informative enough to reuse — a thread full of failed attempts without explanation is noise rather than signal\. These are empirical questions, and EinsteinArena is intended to make them evident rather than speculative\.
A subtler question concerns the relationship between competition and collaboration\. The platform rewards agents for beating one another’s scores, but the kissing number result required agents to share information that helped competitors\. This tension is not unique to AI systems — it mirrors incentive structures in human research communities — but it may manifest differently when the agents are optimizing explicitly for leaderboard position\.
Even with these limitations, the EinsteinArena results suggest that open multi\-agent work should be studied directly rather than approximated through isolated benchmarks\. The platform produced new mathematical advances within a short period, and the largest gains emerged from interaction between agents rather than from a single unusually strong run\. More broadly, our work demonstrates a vision of the platform\-as\-harness: while prior research on AI agents has focused on creating customized, task\-specific harnesses\[[29](https://arxiv.org/html/2606.10402#bib.bib36),[26](https://arxiv.org/html/2606.10402#bib.bib34),[18](https://arxiv.org/html/2606.10402#bib.bib35)\], we argue that a shared platform is a more flexible substrate that allows multiple heterogeneous agents to collaborate, compete, and build on each other’s work without requiring bespoke scaffolding for each new problem\.
## References
- \[1\]\(1980\)Binary codes with a minimum distance of four \(corresp\.\)\.IEEE Trans\. Inf\. Theory26,pp\. 738–742\.External Links:[Link](https://api.semanticscholar.org/CorpusID:40030299)Cited by:[§1](https://arxiv.org/html/2606.10402#S1.p5.1)\.
- \[2\]C\. Boyer and Z\. K\. Li\(2026\)An improved example for an autoconvolution inequality\.Experimental Mathematics,pp\. 1–7\.Cited by:[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px1.p1.1)\.
- \[3\]J\. H\. Conway and N\. J\. A\. Sloane\(1988\)Sphere packings and kissing numbers\.InSphere Packings, Lattices and Groups,pp\. 1–30\.External Links:ISBN 978\-1\-4757\-2016\-7,[Document](https://dx.doi.org/10.1007/978-1-4757-2016-7%5F1),[Link](https://doi.org/10.1007/978-1-4757-2016-7_1)Cited by:[footnote 1](https://arxiv.org/html/2606.10402#footnote1)\.
- \[4\]W\. Dinkelbach\(1967\)On nonlinear fractional programming\.Management science13\(7\),pp\. 492–498\.Cited by:[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px2.p1.2)\.
- \[5\]Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch\(2024\)Improving factuality and reasoning in language models through multiagent debate\.InForty\-first international conference on machine learning,Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[6\]M\. Ganzhinov\(2025\)Highly symmetric lines\.Linear Algebra and its Applications722,pp\. 12–37\.Cited by:[Figure 2](https://arxiv.org/html/2606.10402#S3.F2),[Figure 2](https://arxiv.org/html/2606.10402#S3.F2.3.2),[§3](https://arxiv.org/html/2606.10402#S3.SS0.SSS0.Px1.p1.12),[footnote 5](https://arxiv.org/html/2606.10402#footnote5)\.
- \[7\]S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber\(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.External Links:2308\.00352,[Link](https://arxiv.org/abs/2308.00352)Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[8\]T\. Hubert, R\. Mehta, L\. Sartran, M\. Z\. Horváth, G\. Žužić, E\. Wieser, A\. Huang, J\. Schrittwieser, Y\. Schroecker, H\. Masoom,et al\.\(2025\)Olympiad\-level formal mathematical reasoning with reinforcement learning\.Nature\.Cited by:[§6](https://arxiv.org/html/2606.10402#S6.p3.1)\.
- \[9\]H\. Inan, K\. Upasani, J\. Chi, R\. Rungta, K\. Iyer, Y\. Mao, M\. Tontchev, Q\. Hu, B\. Fuller, D\. Testuggine,et al\.\(2023\)Llama guard: llm\-based input\-output safeguard for human\-ai conversations\.arXiv preprint arXiv:2312\.06674\.Cited by:[§2\.6](https://arxiv.org/html/2606.10402#S2.SS6.p1.1)\.
- \[10\]J\. Kang and ClaudeExplorer\(2026\)State\-of\-the\-art solutions for the second autocorrelation inequality\.Note:Einstein ArenaExternal Links:[Link](https://github.com/justinkang221/second-autocorrelation-inequality)Cited by:[3\(b\)](https://arxiv.org/html/2606.10402#S3.F3.sf2),[3\(b\)](https://arxiv.org/html/2606.10402#S3.F3.sf2.5.2),[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px2.p1.2)\.
- \[11\]Y\. Katznelson\(2004\)An introduction to harmonic analysis\.Cambridge University Press\.Cited by:[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px1.p1.1)\.
- \[12\]R\. T\. Lange, Y\. Imajuku, and E\. Cetin\(2025\)Shinkaevolve: towards open\-ended and sample\-efficient program evolution\.arXiv preprint arXiv:2509\.19349\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px1.p1.1)\.
- \[13\]W\. Lim\(2026\)Alpha omega agents\.GitHub\.External Links:[Link](https://github.com/quasar17/Alpha_Omega_Agents)Cited by:[3\(a\)](https://arxiv.org/html/2606.10402#S3.F3.sf1),[3\(a\)](https://arxiv.org/html/2606.10402#S3.F3.sf1.5.2),[§3](https://arxiv.org/html/2606.10402#S3.SS0.SSS0.Px2.p1.1)\.
- \[14\]C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha\(2024\)The ai scientist: towards fully automated open\-ended scientific discovery\.arXiv preprint arXiv:2408\.06292\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px1.p1.1)\.
- \[15\]A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[16\]L\. Mitchener, A\. Yiu, B\. Chang, M\. Bourdenx, T\. Nadolski, A\. Sulovari, E\. C\. Landsness, D\. L\. Barabasi, S\. Narayanan, N\. Evans,et al\.\(2025\)Kosmos: an ai scientist for autonomous discovery\.arXiv preprint arXiv:2511\.02824\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px1.p1.1)\.
- \[17\]A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. R\. Ruiz, A\. Mehrabian, M\. P\. Kumar, A\. See, S\. Chaudhuri, G\. Holland, A\. Davies, S\. Nowozin, P\. Kohli, and M\. Balog\(2025\)AlphaEvolve: a coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[§A\.1](https://arxiv.org/html/2606.10402#A1.SS1.p1.6),[§1](https://arxiv.org/html/2606.10402#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.10402#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2606.10402#S2.SS3.p2.1),[Table 1](https://arxiv.org/html/2606.10402#S2.T1),[Table 1](https://arxiv.org/html/2606.10402#S2.T1.12.2),[§3](https://arxiv.org/html/2606.10402#S3.SS0.SSS0.Px1.p1.12),[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px1.p2.6),[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px3.p1.5),[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px1.p1.1)\.
- \[18\]A\. Ospanov, F\. Farnia, and R\. Yousefzadeh\(2026\)APOLLO: automated LLM and lean collaboration for advanced formal reasoning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§6](https://arxiv.org/html/2606.10402#S6.p6.1)\.
- \[19\]C\. C\. Paige and M\. A\. Saunders\(1982\)LSQR: an algorithm for sparse linear equations and sparse least squares\.ACM Transactions on Mathematical Software \(TOMS\)8\(1\),pp\. 43–71\.Cited by:[§3](https://arxiv.org/html/2606.10402#S3.SS0.SSS0.Px2.p1.4)\.
- \[20\]C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun\(2024\-08\)ChatDev: communicative agents for software development\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15174–15186\.External Links:[Link](https://aclanthology.org/2024.acl-long.810/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.810)Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[21\]A\. Qu, H\. Zheng, Z\. Zhou, Y\. Yan, Y\. Tang, S\. Y\. Ong, F\. Hong, K\. Zhou, C\. Jiang, M\. Kong,et al\.\(2026\)Coral: towards autonomous multi\-agent evolution for open\-ended discovery\.arXiv preprint arXiv:2604\.01658\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p3.1)\.
- \[22\]S\. Schmidgall and M\. Moor\(2025\)Agentrxiv: towards collaborative autonomous research\.arXiv preprint arXiv:2503\.18102\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p3.1)\.
- \[23\]J\. Sung\(2026\)JSAgent: an ai agent for hard mathematical optimization\.GitHub\.External Links:[Link](https://github.com/jmsung/einstein)Cited by:[3\(b\)](https://arxiv.org/html/2606.10402#S3.F3.sf2),[3\(b\)](https://arxiv.org/html/2606.10402#S3.F3.sf2.5.2),[§3](https://arxiv.org/html/2606.10402#S3.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px2.p1.2)\.
- \[24\]K\. Swanson, W\. Wu, N\. L\. Bulaong, J\. E\. Pak, and J\. Y\. Zou\(2025\)The virtual lab of AI agents designs new SARS\-CoV\-2 nanobodies\.Nature646,pp\. 716–723\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09442-9)Cited by:[§1](https://arxiv.org/html/2606.10402#S1.p3.1),[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[25\]T\. Tao and V\. H\. Vu\(2006\)Additive combinatorics\.Vol\.105,Cambridge University Press\.Cited by:[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px1.p1.1)\.
- \[26\]E\. Toledo, K\. Hambardzumyan, M\. Josifoski, R\. Hazra, N\. Baldwin, A\. Audran\-Reiss, M\. Kuchnik, D\. Magka, M\. Jiang, A\. M\. Lupidi, A\. Lupu, R\. Raileanu, T\. Shavrina, K\. Niu, J\. Gagnon\-Audet, M\. Shvartsman, S\. Sodhani, A\. H\. Miller, A\. Charnalia, D\. Dunfield, C\. Wu, P\. Stenetorp, N\. Cancedda, J\. N\. Foerster, and Y\. Bachrach\(2026\)AI research agents for machine learning: search, exploration, and generalization in MLE\-bench\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§6](https://arxiv.org/html/2606.10402#S6.p6.1)\.
- \[27\]K\. Tran, D\. Dao, M\. Nguyen, Q\. Pham, B\. O’Sullivan, and H\. D\. Nguyen\(2025\)Multi\-agent collaboration mechanisms: a survey of llms\.arXiv preprint arXiv:2501\.06322\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[28\]J\. Wang, J\. Wang, B\. Athiwaratkun, C\. Zhang, and J\. Y\. Zou\(2025\)Mixture\-of\-agents enhances large language model capabilities\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[29\]A\. Wei, T\. Sun, Y\. Seenichamy, H\. Song, A\. Ouyang, A\. Mirhoseini, K\. Wang, and A\. Aiken\(2025\)Astra: a multi\-agent system for gpu kernel performance optimization\.arXiv preprint arXiv:2509\.07506\.Cited by:[§6](https://arxiv.org/html/2606.10402#S6.p6.1)\.
- \[30\]L\. Weidener, M\. Brkić, P\. Lee, M\. Karlsson, K\. Noessler, and P\. Kohlhaas\(2026\)From agent\-only social networks to autonomous scientific research: lessons from openclaw and moltbook, and the architecture of clawdlab and beach\. science\.arXiv preprint arXiv:2602\.19810\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p3.1)\.
- \[31\]S\. Xu, Q\. Feng, L\. Qiao, H\. Wu, T\. Shen, Y\. Cheng, S\. Zheng, and S\. Sun\(2025\-12\)Benchmarking all\-atom biomolecular structure prediction with FoldBench\.Nature Communications\.External Links:ISSN 2041\-1723,[Link](https://doi.org/10.1038/s41467-025-67127-3),[Document](https://dx.doi.org/10.1038/s41467-025-67127-3)Cited by:[§6](https://arxiv.org/html/2606.10402#S6.p3.1)\.
- \[32\]Y\. Yamada, R\. T\. Lange, C\. Lu, S\. Hu, C\. Lu, J\. Foerster, J\. Clune, and D\. Ha\(2025\)The ai scientist\-v2: workshop\-level automated scientific discovery via agentic tree search\.arXiv preprint arXiv:2504\.08066\.Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px1.p1.1)\.
- \[33\]Y\. Yang, H\. Chai, S\. Shao, Y\. Song, S\. Qi, R\. Rui, and W\. Zhang\(2025\)AgentNet: decentralized evolutionary coordination for llm\-based multi\-agent systems\.External Links:2504\.00587,[Link](https://arxiv.org/abs/2504.00587)Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[34\]H\. Ye, H\. Lin, J\. Tang, Y\. Luo, C\. Yang, C\. Su, R\. Thapa, R\. Yang, R\. Liu, Z\. Li,et al\.\(2026\)Evaluation\-driven scaling for scientific discovery\.arXiv preprint arXiv:2604\.19341\.Cited by:[footnote 4](https://arxiv.org/html/2606.10402#footnote4)\.
- \[35\]M\. Yuksekgonul, D\. Koceja, X\. Li, F\. Bianchi, J\. McCaleb, X\. Wang, J\. Kautz, Y\. Choi, J\. Zou, C\. Guestrin, and Y\. Sun\(2026\)Learning to discover at test time\.ICML\.Cited by:[§1](https://arxiv.org/html/2606.10402#S1.p3.1),[Table 1](https://arxiv.org/html/2606.10402#S2.T1),[Table 1](https://arxiv.org/html/2606.10402#S2.T1.12.2),[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px1.p2.6),[§4](https://arxiv.org/html/2606.10402#S4.SS0.SSS0.Px3.p1.5),[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px1.p1.1),[footnote 4](https://arxiv.org/html/2606.10402#footnote4)\.
- \[36\]J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang, B\. Zheng, B\. Liu, Y\. Luo, and C\. Wu\(2025\)AFlow: automating agentic workflow generation\.External Links:2410\.10762,[Link](https://arxiv.org/abs/2410.10762)Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[37\]W\. Zhao, M\. Yuksekgonul, S\. Wu, and J\. Zou\(2025\)SiriuS: self\-improving multi\-agent systems via bootstrapped reasoning\.External Links:2502\.04780,[Link](https://arxiv.org/abs/2502.04780)Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
- \[38\]M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber\(2024\)Language agents as optimizable graphs\.External Links:2402\.16823,[Link](https://arxiv.org/abs/2402.16823)Cited by:[§5](https://arxiv.org/html/2606.10402#S5.SS0.SSS0.Px2.p1.1)\.
## Appendix ADetailed Descriptions of Problems
### A\.1Kissing Number \(d=11d=11\)
The kissing number in dimensiond∈ℕd\\in\\mathbb\{N\}asks the maximum number of non\-overlapping unit spheres inℝd\\mathbb\{R\}^\{d\}that can simultaneously touch a central unit sphere\. This can be formalized as the maximum cardinality of a set of points\{x1,…,xN\}⊂2Sd−1:=\{x∈ℝd:‖x‖=2\}\\\{x\_\{1\},\\dots,x\_\{N\}\\\}\\subset 2S^\{d\-1\}:=\\\{x\\in\\mathbb\{R\}^\{d\}:\\\|x\\\|=2\\\}such that
‖xi−xj‖≥2for alli≠j\.\\\|x\_\{i\}\-x\_\{j\}\\\|\\geq 2\\quad\\text\{for all \}i\\neq j\.Since this condition is binary \(i\.e\., a construction is either valid or invalid\), we instead use a continuous proxy to assess agents’ submissions, as suggested by AlphaEvolve\[[17](https://arxiv.org/html/2606.10402#bib.bib1)\]\. In particular, forn∈ℕn\\in\\mathbb\{N\}, we compute the scoreC\(𝐱\)C\(\\mathbf\{x\}\)of a submission𝐱=\(x1,…,xn\)∈\(Sd−1\)n\\mathbf\{x\}=\(x\_\{1\},\\dots,x\_\{n\}\)\\in\(S^\{d\-1\}\)^\{n\}by
C\(𝐱\)=∑i<jmax\(0,2−‖2xi‖xi‖−2xj‖xj‖‖\)\.\\displaystyle C\(\\mathbf\{x\}\)=\\sum\_\{i<j\}\\max\\left\(0,2\-\\left\\\|\\frac\{2x\_\{i\}\}\{\\\|x\_\{i\}\\\|\}\-\\frac\{2x\_\{j\}\}\{\\\|x\_\{j\}\\\|\}\\right\\\|\\right\)\.In EinsteinArena, we initially consideredn=594n=594and extened it ton=604n=604\. Smaller values ofC\(𝐱\)C\(\\mathbf\{x\}\)correspond to better constructions, so the objective is to minimizeC\(𝐱\)C\(\\mathbf\{x\}\)\. The valid construction𝐱∗\\mathbf\{x\}^\{\*\}satisfiesC\(𝐱∗\)=0C\(\\mathbf\{x\}^\{\*\}\)=0\.
### A\.2Erdős Minimum Overlap
Letℋ=\{h:\[0,2\]→\[0,1\]:∫02h\(x\)𝑑x=1\}\\mathcal\{H\}=\\left\\\{h:\[0,2\]\\to\[0,1\]:\\int\_\{0\}^\{2\}h\(x\)\\,dx=1\\right\\\}and we define the zero extensions ofh∈ℋh\\in\\mathcal\{H\}
h~\(x\)=h\(x\)𝟏\[0,2\]\(x\),1−h~\(x\)=\(1−h\(x\)\)𝟏\[0,2\]\(x\)\.\\widetilde\{h\}\(x\)=h\(x\)\\mathbf\{1\}\_\{\[0,2\]\}\(x\),\\qquad\\widetilde\{1\-h\}\(x\)=\(1\-h\(x\)\)\\mathbf\{1\}\_\{\[0,2\]\}\(x\)\.Then, the problem asks to findinfh∈ℋC\(h\)\\inf\_\{h\\in\\mathcal\{H\}\}C\(h\)where
C\(h\)=sups∈ℝ∫ℝh~\(x\)1−h~\(x\+s\)𝑑x\.\\displaystyle C\(h\)=\\sup\_\{s\\in\\mathbb\{R\}\}\\int\_\{\\mathbb\{R\}\}\\widetilde\{h\}\(x\)\\widetilde\{1\-h\}\(x\+s\)dx\.In EinsteinArena, we consider its discretized formulation\. Let𝐡\\mathbf\{h\}be ann\-dimensional vector𝐡=\(h0,…,hn−1\)∈\[0,1\]n\\mathbf\{h\}=\(h\_\{0\},\\ldots,h\_\{n\-1\}\)\\in\[0,1\]^\{n\}satisfying the normalization condition∑i=0n−1hi=n/2\\sum\_\{i=0\}^\{n\-1\}h\_\{i\}=n/2withdx=2/ndx=2/n\. We define the score ofhhby
Cdis\(𝐡\)=dx×max−\(n−1\)≤s≤n−1∑0≤i<n0≤i\+s<nhi\(1−hi\+s\)\.\\displaystyle C\_\{\\mathrm\{dis\}\}\(\\mathbf\{h\}\)=dx\\times\\max\_\{\-\(n\-1\)\\leq s\\leq n\-1\}\\sum\_\{\\begin\{subarray\}\{c\}0\\leq i<n\\\\ 0\\leq i\+s<n\\end\{subarray\}\}h\_\{i\}\(1\-h\_\{i\+s\}\)\.The scoreCdis\(𝐡\)C\_\{\\mathrm\{dis\}\}\(\\mathbf\{h\}\)is an upper bound ofinfh∈ℋC\(h\)\\inf\_\{h\\in\\mathcal\{H\}\}C\(h\), and smaller values ofCdis\(𝐡\)C\_\{\\mathrm\{dis\}\}\(\\mathbf\{h\}\)yield better solutions\. Thus, the objective is to minimizeCdis\(𝐡\)C\_\{\\mathrm\{dis\}\}\(\\mathbf\{h\}\)\.
### A\.3First Autocorrelation Inequality
Letℱ\\mathcal\{F\}be the set of all nonnegative integrable functionsf:ℝ→ℝ≥0f:\\mathbb\{R\}\\to\\mathbb\{R\}\_\{\\geq 0\}whose support is contained in\[−1/4,1/4\]\[\-1/4,1/4\]\. The problem asks to find the largest constantCCsatisfying
sup−1/2≤t≤1/2∫ℝf\(t−x\)f\(x\)𝑑x≥C\(∫−1/41/4f\(x\)𝑑x\)2\\displaystyle\\sup\_\{\-1/2\\leq t\\leq 1/2\}\\int\_\{\\mathbb\{R\}\}f\(t\-x\)f\(x\)dx\\geq C\\left\(\\int\_\{\-1/4\}^\{1/4\}f\(x\)dx\\right\)^\{2\}for allf∈ℱf\\in\\mathcal\{F\}\. In EinsteinArena, we consider its discretized formulation\. Let𝐯=\(v0,…,vn−1\)\\mathbf\{v\}=\(v\_\{0\},\\ldots,v\_\{n\-1\}\)be a nonnegative vector representing a functionf∈ℱf\\in\\mathcal\{F\}withdx=0\.5ndx=\\frac\{0\.5\}\{n\}\. We define the score of𝐯\\mathbf\{v\}by
C\(𝐯\)=maxj\(𝐯∗𝐯\)j×dx\(∑i=0n−1vi×dx\)2,\\displaystyle C\(\\mathbf\{v\}\)=\\frac\{\\max\_\{j\}\(\\mathbf\{v\}\*\\mathbf\{v\}\)\_\{j\}\\times dx\}\{\\left\(\\sum\_\{i=0\}^\{n\-1\}v\_\{i\}\\times dx\\right\)^\{2\}\},where\(𝐯∗𝐯\)j:=∑k=max\(0,j−n\+1\)min\(j,n−1\)vkvj−k\(\\mathbf\{v\}\*\\mathbf\{v\}\)\_\{j\}:=\\sum\_\{k=\\max\(0,j\-n\+1\)\}^\{\\min\(j,n\-1\)\}v\_\{k\}\\,v\_\{j\-k\}\. The scoreC\(𝐯\)C\(\\mathbf\{v\}\)is an upper bound ofCC, and smaller values ofC\(𝐯\)C\(\\mathbf\{v\}\)yield better solutions\. Thus, the objective is to minimizeC\(𝐯\)C\(\\mathbf\{v\}\)\.
### A\.4Second Autocorrelation Inequality
Letℱ\\mathcal\{F\}denote the set of all nonnegative integrable functionsf:ℝ→ℝ≥0f:\\mathbb\{R\}\\to\\mathbb\{R\}\_\{\\geq 0\}\. Forf∈ℱf\\in\\mathcal\{F\}, define its autoconvolution by\(f⋆f\)\(t\):=∫ℝf\(t−x\)f\(x\)𝑑x\(f\\star f\)\(t\):=\\int\_\{\\mathbb\{R\}\}f\(t\-x\)f\(x\)dx\. The problem asks to determine the optimal constant
C=supf≥0,f⋆f≠0‖f⋆f‖22‖f⋆f‖1‖f⋆f‖∞\.C=\\sup\_\{f\\geq 0,\\;f\\star f\\neq 0\}\\frac\{\\\|f\\star f\\\|\_\{2\}^\{2\}\}\{\\\|f\\star f\\\|\_\{1\}\\,\\\|f\\star f\\\|\_\{\\infty\}\}\.In EinsteinArena, we consider a discretized formulation\. Let𝐯=\(v0,…,vn−1\)\\mathbf\{v\}=\(v\_\{0\},\\ldots,v\_\{n\-1\}\)be a nonnegative vector and let\(𝐯∗𝐯\)\(\\mathbf\{v\}\*\\mathbf\{v\}\)denote its discrete autoconvolution\(𝐯∗𝐯\)j:=∑k=max\(0,j−n\+1\)min\(j,n−1\)vkvj−k\(\\mathbf\{v\}\*\\mathbf\{v\}\)\_\{j\}:=\\sum\_\{k=\\max\(0,j\-n\+1\)\}^\{\\min\(j,n\-1\)\}v\_\{k\}\\,v\_\{j\-k\}, as in Appendix[A\.3](https://arxiv.org/html/2606.10402#A1.SS3)\. We define the score of𝐯\\mathbf\{v\}by
C\(𝐯\)=‖𝐯∗𝐯‖22‖𝐯∗𝐯‖1‖𝐯∗𝐯‖∞\.\\displaystyle C\(\\mathbf\{v\}\)=\\frac\{\\\|\\mathbf\{v\}\*\\mathbf\{v\}\\\|\_\{2\}^\{2\}\}\{\\\|\\mathbf\{v\}\*\\mathbf\{v\}\\\|\_\{1\}\\,\\\|\\mathbf\{v\}\*\\mathbf\{v\}\\\|\_\{\\infty\}\}\.The scoreC\(𝐯\)C\(\\mathbf\{v\}\)is a lower bound ofCC, and larger values ofC\(𝐯\)C\(\\mathbf\{v\}\)yield better solutions\. Thus, the objective is to maximizeC\(𝐯\)C\(\\mathbf\{v\}\)\.
### A\.5Third Autocorrelation Inequality
Letℱ\\mathcal\{F\}be the set of all integrable functionsf:ℝ→ℝf:\\mathbb\{R\}\\to\\mathbb\{R\}whose support is contained in\[−1/4,1/4\]\[\-1/4,1/4\]\. The problem asks to find the constant
C=inf∫−1/41/4f\(x\)𝑑x≠0\|sup−1/2≤t≤1/2∫ℝf\(t−x\)f\(x\)𝑑x\|\(∫−1/41/4f\(x\)𝑑x\)2\.C=\\inf\_\{\\int\_\{\-1/4\}^\{1/4\}f\(x\)dx\\neq 0\}\\frac\{\\left\|\\sup\_\{\-1/2\\leq t\\leq 1/2\}\\int\_\{\\mathbb\{R\}\}f\(t\-x\)f\(x\)dx\\right\|\}\{\\left\(\\int\_\{\-1/4\}^\{1/4\}f\(x\)dx\\right\)^\{2\}\}\.In EinsteinArena, we consider a discretized formulation as other autocorrelation problems\. Let𝐯=\(v0,…,vn−1\)\\mathbf\{v\}=\(v\_\{0\},\\ldots,v\_\{n\-1\}\)be a real vector representingffon\[−1/4,1/4\]\[\-1/4,1/4\]withdx=0\.5ndx=\\frac\{0\.5\}\{n\}\. We define the score by
C\(𝐯\)=\|maxj\(𝐯∗𝐯\)j×dx\|\(∑i=0n−1vidx\)2\.\\displaystyle C\(\\mathbf\{v\}\)=\\frac\{\\left\|\\max\_\{j\}\(\\mathbf\{v\}\*\\mathbf\{v\}\)\_\{j\}\\times dx\\right\|\}\{\\left\(\\sum\_\{i=0\}^\{n\-1\}v\_\{i\}\\,dx\\right\)^\{2\}\}\.The scoreC\(𝐯\)C\(\\mathbf\{v\}\)is an upper bound ofCC, and smaller values ofC\(𝐯\)C\(\\mathbf\{v\}\)yield better solutions\. Thus, the objective is to minimizeC\(𝐯\)C\(\\mathbf\{v\}\)\.
### A\.6Flat Polynomials
For a coefficient vector𝐜=\(c0,…,c69\)∈\{−1,\+1\}70\\mathbf\{c\}=\(c\_\{0\},\\ldots,c\_\{69\}\)\\in\\\{\-1,\+1\\\}^\{70\}, we define
g𝐜\(z\)=c0z69\+c1z68\+⋯\+c68z\+c69,g\_\{\\mathbf\{c\}\}\(z\)=c\_\{0\}z^\{69\}\+c\_\{1\}z^\{68\}\+\\cdots\+c\_\{68\}z\+c\_\{69\},and its flatness score as follows\.
C\(𝐜\)=max\|z\|=1\|g𝐜\(z\)\|71\.C\(\\mathbf\{c\}\)=\\frac\{\\max\_\{\|z\|=1\}\|g\_\{\\mathbf\{c\}\}\(z\)\|\}\{\\sqrt\{71\}\}\.The problem asks to findinf𝐜∈\{−1,\+1\}70C\(𝐜\)\\inf\_\{\\mathbf\{c\}\\in\\\{\-1,\+1\\\}^\{70\}\}C\(\\mathbf\{c\}\)\. In EinsteinArena, we approximate the optimal valueinf𝐜∈\{−1,\+1\}70C\(𝐜\)\\inf\_\{\\mathbf\{c\}\\in\\\{\-1,\+1\\\}^\{70\}\}C\(\\mathbf\{c\}\)by evaluating the maximum over\|z\|=1\|z\|=1on10610^\{6\}equally spaced points on the unit circle, instead of the entire set\{−1,\+1\}70\\\{\-1,\+1\\\}^\{70\}\. That is, we compute
C𝒮\(𝐜\)=maxz∈𝒮\|g𝐜\(z\)\|71,\\displaystyle C\_\{\\mathcal\{S\}\}\(\\mathbf\{c\}\)=\\frac\{\\max\_\{z\\in\\mathcal\{S\}\}\|g\_\{\\mathbf\{c\}\}\(z\)\|\}\{\\sqrt\{71\}\},where𝒮=\{ei2πj106−1:j∈\{0,…,106−1\}\}\\mathcal\{S\}=\\\{e^\{i\\frac\{2\\pi j\}\{10^\{6\}\-1\}\}:j\\in\\\{0,\\dots,10^\{6\}\-1\\\}\\\}\. The scoreC𝒮\(𝐜\)C\_\{\\mathcal\{S\}\}\(\\mathbf\{c\}\)is an upper bound ofinf𝐜∈\{−1,\+1\}70C\(𝐜\)\\inf\_\{\\mathbf\{c\}\\in\\\{\-1,\+1\\\}^\{70\}\}C\(\\mathbf\{c\}\), and smaller values ofC𝒮\(𝐜\)C\_\{\\mathcal\{S\}\}\(\\mathbf\{c\}\)yield better solutions\. Thus, the objective is to minimizeC𝒮\(𝐜\)C\_\{\\mathcal\{S\}\}\(\\mathbf\{c\}\)\.
### A\.7Maximum/Minimum Distance Ratio \(n=16n=16\)
For a list of1616pointsP=\(p1,…,p16\)∈ℝ16×2P=\(p\_\{1\},\\ldots,p\_\{16\}\)\\in\\mathbb\{R\}^\{16\\times 2\}, we let
dmin\(P\)=min1≤i<j≤16‖pi−pj‖2,dmax\(P\)=max1≤i<j≤16‖pi−pj‖2\.d\_\{\\min\}\(P\)=\\min\_\{1\\leq i<j\\leq 16\}\\\|p\_\{i\}\-p\_\{j\}\\\|\_\{2\},\\qquad d\_\{\\max\}\(P\)=\\max\_\{1\\leq i<j\\leq 16\}\\\|p\_\{i\}\-p\_\{j\}\\\|\_\{2\}\.The problem asks to findinfP∈ℝ16×2dmax\(P\)dmin\(P\)\\inf\_\{P\\in\\mathbb\{R\}^\{16\\times 2\}\}\\frac\{d\_\{\\max\}\(P\)\}\{d\_\{\\min\}\(P\)\}\. In EinsteinArena, we define the scoreR\(P\)=\(dmax\(P\)dmin\(P\)\)2R\(P\)=\\left\(\\frac\{d\_\{\\max\}\(P\)\}\{d\_\{\\min\}\(P\)\}\\right\)^\{2\}and seek a point setPPthat minimizesR\(P\)R\(P\)\.
### A\.8The Prime Number Theorem
LetF⊂ℕF\\subset\\mathbb\{N\}be a finite set and letf:F→ℝf:F\\to\\mathbb\{R\}be a finitely supported partial function such that∑k∈Ff\(k\)k=0\\sum\_\{k\\in F\}\\frac\{f\(k\)\}\{k\}=0andΦf\(x\)≤1\\Phi\_\{f\}\(x\)\\leq 1for allx≥1x\\geq 1whereΦf\(x\)=∑k∈Ff\(k\)⌊xk⌋\\Phi\_\{f\}\(x\)=\\sum\_\{k\\in F\}f\(k\)\\left\\lfloor\\frac\{x\}\{k\}\\right\\rfloor\. The score functional is
S\(f\)=−∑k∈Ff\(k\)logkk\.S\(f\)=\-\\sum\_\{k\\in F\}\\frac\{f\(k\)\\log k\}\{k\}\.In EinsteinArena,\|F\|≤2000\|F\|\\leq 2000, values are clipped to\[−10,10\]\[\-10,10\],f\(1\)f\(1\)is adjusted to enforce the normalization∑k∈Ff\(k\)k=0\\sum\_\{k\\in F\}\\frac\{f\(k\)\}\{k\}=0, and the constraintΦf\(x\)≤1\\Phi\_\{f\}\(x\)\\leq 1is checked by randomly drawing10710^\{7\}samples over
x∈\[1,10×maxk∈Fk\]\.x\\in\\left\[1,\\,10\\times\\max\_\{k\\in F\}k\\right\]\.Letπ\(x\)\\pi\(x\)be the number of primes less than or equal tox∈ℕx\\in\\mathbb\{N\}\. Then, the approximation in EinsteinArena can be seen as a lower bound oflimx→∞π\(x\)x/logx\\lim\_\{x\\to\\infty\}\\frac\{\\pi\(x\)\}\{x/\\log x\}, and thus the objective is to maximizeS\(f\)S\(f\)\.
### A\.9Circles Packing in a Square \(n=26n=26\)
Fori∈\{1,…,26\}i\\in\\\{1,\\dots,26\\\}, we letci:=\(xi,yi\)c\_\{i\}:=\(x\_\{i\},y\_\{i\}\)be the center ofii\-th circle on\[0,1\]×\[0,1\]\[0,1\]\\times\[0,1\]andrir\_\{i\}is its radius\. The problem asks to find a list of2626triples\{\(xi,yi,ri\)∈ℝ3\}i=126\\\{\(x\_\{i\},y\_\{i\},r\_\{i\}\)\\in\\mathbb\{R\}^\{3\}\\\}\_\{i=1\}^\{26\}that maximizes
∑i=126ri\\displaystyle\\sum\_\{i=1\}^\{26\}r\_\{i\}such that for alli∈\{1,…,26\}i\\in\\\{1,\\dots,26\\\}, \(i\)ri≤xi≤1−rir\_\{i\}\\leq x\_\{i\}\\leq 1\-r\_\{i\}, \(ii\)ri≤yi≤1−rir\_\{i\}\\leq y\_\{i\}\\leq 1\-r\_\{i\}, \(iii\)ri\>0r\_\{i\}\>0, and \(iv\)‖ci−cj‖2≥ri\+rj\\\|c\_\{i\}\-c\_\{j\}\\\|\_\{2\}\\geq r\_\{i\}\+r\_\{j\}for all1≤i<j≤261\\leq i<j\\leq 26\. That is, no circles should overlap with one another in a square\.
### A\.10Circles Packing in a Rectangle \(n=21n=21\)
Fori∈\{1,…,21\}i\\in\\\{1,\\dots,21\\\}, we letci:=\(xi,yi\)c\_\{i\}:=\(x\_\{i\},y\_\{i\}\)be the center ofii\-th circle onℝ2\\mathbb\{R\}^\{2\}andrir\_\{i\}is its radius\. The problem asks to find a list of2121triples\{\(xi,yi,ri\)∈ℝ3\}i=121\\\{\(x\_\{i\},y\_\{i\},r\_\{i\}\)\\in\\mathbb\{R\}^\{3\}\\\}\_\{i=1\}^\{21\}that maximizes
∑i=121ri\\displaystyle\\sum\_\{i=1\}^\{21\}r\_\{i\}such that for alli∈\{1,…,21\}i\\in\\\{1,\\dots,21\\\}, \(i\)ri\>0r\_\{i\}\>0, \(ii\)‖ci−cj‖2≥ri\+rj\\\|c\_\{i\}\-c\_\{j\}\\\|\_\{2\}\\geq r\_\{i\}\+r\_\{j\}for all1≤i<j≤211\\leq i<j\\leq 21, and \(iii\)\(maxi\(xi\+ri\)−mini\(xi−ri\)\)\+\(maxi\(yi\+ri\)−mini\(yi−ri\)\)≤2\\left\(\\max\_\{i\}\(x\_\{i\}\+r\_\{i\}\)\-\\min\_\{i\}\(x\_\{i\}\-r\_\{i\}\)\\right\)\+\\left\(\\max\_\{i\}\(y\_\{i\}\+r\_\{i\}\)\-\\min\_\{i\}\(y\_\{i\}\-r\_\{i\}\)\\right\)\\leq 2\. That is, no circles should overlap with one another in a rectangle\.
### A\.11Tammes Problem \(n=50n=50\)
We denote a list of5050points on a sphere byPP,i\.e\.,P=\(p1,…,p50\)P=\(p\_\{1\},\\dots,p\_\{50\}\)andpi∈S2:=\{x∈ℝ3:‖x‖2=1\}p\_\{i\}\\in S^\{2\}:=\\\{x\\in\\mathbb\{R\}^\{3\}:\\\|x\\\|\_\{2\}=1\\\}\. The problem asks to maximize the minimum pairwise Euclidean distancemaxPdmin\(P\)\\max\_\{P\}d\_\{\\min\}\(P\), where
dmin\(P\):=min1≤i<j≤50‖pi−pj‖2\.\\displaystyle d\_\{\\min\}\(P\):=\\min\_\{1\\leq i<j\\leq 50\}\\\|p\_\{i\}\-p\_\{j\}\\\|\_\{2\}\.In EinsteinArena, we seek to construct a lower bound ofmaxPdmin\(P\)\\max\_\{P\}d\_\{\\min\}\(P\)\.
### A\.12Edges vs\. Triangles
For0≤ρ≤10\\leq\\rho\\leq 1, letC\(ρ\)C\(\\rho\)denote the largest quantity such that any graph onnnvertices and\(ρ\+o\(1\)\)\(n2\)\(\\rho\+o\(1\)\)\\binom\{n\}\{2\}edges will have at least\(C\(ρ\)−o\(1\)\)\(n3\)\(C\(\\rho\)\-o\(1\)\)\\binom\{n\}\{3\}triangles\. The problem asks the value ofC\(ρ\)C\(\\rho\)\.
In EinsteinArena, we implemented this problem as follows\. We consider a matrixW∈ℝ≥0m×20W\\in\\mathbb\{R\}\_\{\\geq 0\}^\{m\\times 20\}such that∑j=120Wij=1\\sum\_\{j=1\}^\{20\}W\_\{ij\}=1for alli∈\{1,…,m\}i\\in\\\{1,\\dots,m\\\}\. For each row, we define its edge\-density and triangle\-density coordinates by
ρ\(Wi\)\\displaystyle\\rho\(W\_\{i\}\)=\(∑j=120Wij\)2−∑j=120Wij2=1−∑j=120Wij2,\\displaystyle=\\left\(\\sum\_\{j=1\}^\{20\}W\_\{ij\}\\right\)^\{2\}\-\\sum\_\{j=1\}^\{20\}W\_\{ij\}^\{2\}=1\-\\sum\_\{j=1\}^\{20\}W\_\{ij\}^\{2\},τ\(Wi\)\\displaystyle\\tau\(W\_\{i\}\)=\(∑j=120Wij\)3−3\(∑j=120Wij\)\(∑j=120Wij2\)\+2∑j=120Wij3=6∑1≤j1<j2<j3≤20Wij1Wij2Wij3\.\\displaystyle=\\left\(\\sum\_\{j=1\}^\{20\}W\_\{ij\}\\right\)^\{3\}\-3\\left\(\\sum\_\{j=1\}^\{20\}W\_\{ij\}\\right\)\\left\(\\sum\_\{j=1\}^\{20\}W\_\{ij\}^\{2\}\\right\)\+2\\sum\_\{j=1\}^\{20\}W\_\{ij\}^\{3\}=6\\sum\_\{1\\leq j\_\{1\}<j\_\{2\}<j\_\{3\}\\leq 20\}W\_\{ij\_\{1\}\}W\_\{ij\_\{2\}\}W\_\{ij\_\{3\}\}\.Let\(x0,y0\),…,\(xq,yq\)\(x\_\{0\},y\_\{0\}\),\\ldots,\(x\_\{q\},y\_\{q\}\)be the sorted unique list obtained from
\(0,0\),\(1,1\),\{\(ρ\(Wi\),τ\(Wi\)\):1≤i≤m\},\(0,0\),\\quad\(1,1\),\\quad\\\{\(\\rho\(W\_\{i\}\),\\tau\(W\_\{i\}\)\):1\\leq i\\leq m\\\},sorted by increasingxx\. For each interval\[xj,xj\+1\]\[x\_\{j\},x\_\{j\+1\}\], putΔj=xj\+1−xj\\Delta\_\{j\}=x\_\{j\+1\}\-x\_\{j\}\. Define the verifier segment areaAjA\_\{j\}by
Aj=\{yjΔj,yj\>yj\+1,\(yj\+yj\+3Δj\)Δj2,yj≤yj\+1andyj\+3Δj≤yj\+1,\(yj\+yj\+1\)wj2\+yj\+1\(Δj−wj\),yj≤yj\+1andyj\+3Δj\>yj\+1,A\_\{j\}=\\begin\{cases\}y\_\{j\}\\Delta\_\{j\},&y\_\{j\}\>y\_\{j\+1\},\\\\\[3\.00003pt\] \\dfrac\{\(y\_\{j\}\+y\_\{j\}\+3\\Delta\_\{j\}\)\\Delta\_\{j\}\}\{2\},&y\_\{j\}\\leq y\_\{j\+1\}\\text\{ and \}y\_\{j\}\+3\\Delta\_\{j\}\\leq y\_\{j\+1\},\\\\\[8\.99994pt\] \\dfrac\{\(y\_\{j\}\+y\_\{j\+1\}\)w\_\{j\}\}\{2\}\+y\_\{j\+1\}\(\\Delta\_\{j\}\-w\_\{j\}\),&y\_\{j\}\\leq y\_\{j\+1\}\\text\{ and \}y\_\{j\}\+3\\Delta\_\{j\}\>y\_\{j\+1\},\\end\{cases\}wherewj=\(yj\+1−yj\)/3w\_\{j\}=\(y\_\{j\+1\}\-y\_\{j\}\)/3in the last case\. Let
A=∑j=0q−1Aj,G=max0≤j<qΔj\.A=\\sum\_\{j=0\}^\{q\-1\}A\_\{j\},\\qquad G=\\max\_\{0\\leq j<q\}\\Delta\_\{j\}\.Then, we define the scoreS=−\(A\+10G\)S=\-\(A\+10G\), and the objective is to maximize this score\.
### A\.13Difference Bases
LetB⊂ℤ≥0B\\subset\\mathbb\{Z\}\_\{\\geq 0\}be a finite set of non\-negative integers\. Define the positive difference set
D\(B\)=\{b−b′:b,b′∈B,b\>b′\}\.D\(B\)=\\\{b\-b^\{\\prime\}:b,b^\{\\prime\}\\in B,\\ b\>b^\{\\prime\}\\\}\.For such a setBB, define
v\(B\)=max\{v∈ℤ≥1:\{1,…,v\}⊆D\(B\)\},v\(B\)=\\max\\\{v\\in\\mathbb\{Z\}\_\{\\geq 1\}:\\\{1,\\ldots,v\\\}\\subseteq D\(B\)\\\},whenever this set is nonempty\. The problem asks to minimize
\|B\|2v\(B\)\\frac\{\|B\|^\{2\}\}\{v\(B\)\}over finite setsB⊂ℤ≥0B\\subset\\mathbb\{Z\}\_\{\\geq 0\}for whichv\(B\)v\(B\)is defined\. In EinsteinArena, a submission is a listLLof at most20002000non\-negative integers\. It is converted into the deduplicated set
B=\{x:x∈L\}∪\{0\}\.B=\\\{x:x\\in L\\\}\\cup\\\{0\\\}\.If\|B\|\>2000\|B\|\>2000, the score is\+∞\+\\infty\. Otherwise, defineD\(B\)D\(B\)as above\. IfD\(B\)=∅D\(B\)=\\varnothingor1∉D\(B\)1\\notin D\(B\), the score is\+∞\+\\infty\. Otherwise, let
vEA\(B\)=max\{v∈ℤ≥1:\{1,…,v\}⊆D\(B\)\}\.v\_\{\\mathrm\{EA\}\}\(B\)=\\max\\\{v\\in\\mathbb\{Z\}\_\{\\geq 1\}:\\\{1,\\ldots,v\\\}\\subseteq D\(B\)\\\}\.The EinsteinArena score is
SEA\(B\)=\|B\|2vEA\(B\)\.S\_\{\\mathrm\{EA\}\}\(B\)=\\frac\{\|B\|^\{2\}\}\{v\_\{\\mathrm\{EA\}\}\(B\)\}\.Lower values ofSEA\(B\)S\_\{\\mathrm\{EA\}\}\(B\)yield better solutions\. Thus, the objective is to minimizeSEA\(B\)S\_\{\\mathrm\{EA\}\}\(B\)\.
### A\.14Heilbronn Problem for Triangles \(n=11n=11\)
We placen=11n=11points on or inside an equilateral triangle of side length 1\. For a point setP=\(p1,…,p11\)P=\(p\_\{1\},\\ldots,p\_\{11\}\), the problem asks to maximize the areaC\(P\)C\(P\)of the smallest triangle formed by any triple of the placed points, normalized by the bounding area:
C\(P\)=min1≤i<j<k≤11area\(pi,pj,pk\)3/4\.C\(P\)=\\frac\{\\min\_\{1\\leq i<j<k\\leq 11\}\\text\{area\}\(p\_\{i\},p\_\{j\},p\_\{k\}\)\}\{\\sqrt\{3\}/4\}\.Here,area\(pi,pj,pk\)\\text\{area\}\(p\_\{i\},p\_\{j\},p\_\{k\}\)denotes the area of a triangle formed bypip\_\{i\},pjp\_\{j\}, andpkp\_\{k\}\. The bounding equilateral triangle has verticesA=\(0,0\)A=\(0,0\),B=\(1,0\)B=\(1,0\),C=\(1/2,3/2\)C=\(1/2,\\sqrt\{3\}/2\), and the area3/4\\sqrt\{3\}/4, which is used in denominator\. All pointsPPmust lie on or inside this triangle\. In EinsteinArena, a construction gives a lower bound ofmaxPC\(P\)\\max\_\{P\}C\(P\), so the objective is to maximizeC\(P\)C\(P\)\.
### A\.15Thomson Problem \(n=282n=282\)
LetS2=\{x∈ℝ3:‖x‖2=1\}S^\{2\}=\\\{x\\in\\mathbb\{R\}^\{3\}:\\\|x\\\|\_\{2\}=1\\\}\. For a point configurationQ=\(q1,…,q282\)∈\(S2\)282Q=\(q\_\{1\},\\ldots,q\_\{282\}\)\\in\(S^\{2\}\)^\{282\}, define the Coulomb energy
E\(Q\)=∑1≤i<j≤2821‖qi−qj‖2\.E\(Q\)=\\sum\_\{1\\leq i<j\\leq 282\}\\frac\{1\}\{\\\|q\_\{i\}\-q\_\{j\}\\\|\_\{2\}\}\.The problem asks to find
infQ∈\(S2\)282E\(Q\)\.\\inf\_\{Q\\in\(S^\{2\}\)^\{282\}\}E\(Q\)\.In EinsteinArena, a submission is a list of vectorsP=\(p1,…,p282\)∈\(S2\)282P=\(p\_\{1\},\\ldots,p\_\{282\}\)\\in\(S^\{2\}\)^\{282\}\. Each submitted vectorpip\_\{i\}is normalized byp~i=pi‖pi‖2\\widetilde\{p\}\_\{i\}=\\frac\{p\_\{i\}\}\{\\\|p\_\{i\}\\\|\_\{2\}\}\. The EinsteinArena score is
EEA\(P\)=∑1≤i<j≤2821‖p~i−p~j‖2,E\_\{\\mathrm\{EA\}\}\(P\)=\\sum\_\{1\\leq i<j\\leq 282\}\\frac\{1\}\{\\\|\\widetilde\{p\}\_\{i\}\-\\widetilde\{p\}\_\{j\}\\\|\_\{2\}\},which can be seen as an upper bound ofinfQ∈\(S2\)282E\(Q\)\\inf\_\{Q\\in\(S^\{2\}\)^\{282\}\}E\(Q\)\. So, lower values ofEEA\(P\)E\_\{\\mathrm\{EA\}\}\(P\)yield better solutions, and the objective is to minimizeEEA\(P\)E\_\{\\mathrm\{EA\}\}\(P\)\.
## Appendix BKissing number construction \(n=604n=604\)
### B\.1The common backbone
Letℬ\\mathcal\{B\}denote the common backbone shared by the constructions for594≤n≤600594\\leq n\\leq 600\. It consists of
ℬ=\{±2ei:i=0,…,7\}∪\{∑i∈Sϵiei:S∈𝒮,ϵi∈\{±1\}\}\.\\mathcal\{B\}=\\\{\\pm 2e\_\{i\}:i=0,\\dots,7\\\}\\;\\cup\\;\\left\\\{\\sum\_\{i\\in S\}\\epsilon\_\{i\}e\_\{i\}:S\\in\\mathcal\{S\},\\ \\epsilon\_\{i\}\\in\\\{\\pm 1\\\}\\right\\\}\.Here, the collection𝒮\\mathcal\{S\}consists of3030supports of size44, constructed as follows:
- •𝒮\\mathcal\{S\}contains66supports given by unions of two pairs: Pi∪Pj,0≤i<j≤3,P\_\{i\}\\cup P\_\{j\},\\quad 0\\leq i<j\\leq 3,whereP0=\(0,6\),P1=\(1,4\),P2=\(2,5\)P\_\{0\}=\(0,6\),P\_\{1\}=\(1,4\),P\_\{2\}=\(2,5\), andP3=\(3,7\)P\_\{3\}=\(3,7\)\.
- •The remaining2424supports are arranged in three layers corresponding to coordinates8,9,108,9,10\. Each layer consists of88supports of the form T∪\{k\},k∈\{8,9,10\},T\\cup\\\{k\\\},\\quad k\\in\\\{8,9,10\\\},whereT⊂\{0,…,7\}T\\subset\\\{0,\\dots,7\\\}is a triple that contains at most one element from each pairP0,…,P3P\_\{0\},\\dots,P\_\{3\}\.
Then, each support contributes24=162^\{4\}=16vectors, and together with the1616axis vectors, we obtain
\|ℬ\|=16\+30⋅24=496\.\|\\mathcal\{B\}\|=16\+30\\cdot 2^\{4\}=496\.This backbone exhibits a strong combinatorial and geometric structure and is shared across all constructions up ton=600n=600\.
### B\.2Extension ton=604n=604
To extend beyondn=600n=600, we introduce additional vectors that go beyond the integral lattice structure\. In the last three coordinates\(8,9,10\)\(8,9,10\), define
u1=13\(2,1,2\),u2=13\(2,−2,−1\),u3=13\(1,2,−2\),u\_\{1\}=\\frac\{1\}\{3\}\(2,1,2\),\\qquad u\_\{2\}=\\frac\{1\}\{3\}\(2,\-2,\-1\),\\qquad u\_\{3\}=\\frac\{1\}\{3\}\(1,2,\-2\),which form an orthonormal frame\. The604604construction consists of the backboneℬ⊂D11\\mathcal\{B\}\\subset D\_\{11\}together with108108additional vectors:
ℰ1=\{±ea±eb±2uj:\(a,b\)∈\{P0,P1,P2,P3\},j∈\{1,2,3\}\},\\mathcal\{E\}\_\{1\}=\\left\\\{\\pm e\_\{a\}\\pm e\_\{b\}\\pm\\sqrt\{2\}\\,u\_\{j\}:\(a,b\)\\in\\\{P\_\{0\},P\_\{1\},P\_\{2\},P\_\{3\}\\\},\\ j\\in\\\{1,2,3\\\}\\right\\\},and
ℰ2=\{2\(±ui±uj\):1≤i<j≤3\}\.\\mathcal\{E\}\_\{2\}=\\left\\\{\\sqrt\{2\}\(\\pm u\_\{i\}\\pm u\_\{j\}\):1\\leq i<j\\leq 3\\right\\\}\.Here eachuju\_\{j\}is embedded in coordinates\(8,9,10\)\(8,9,10\)\. Thus,
\|ℰ1\|=4⋅4⋅6=96,\|ℰ2\|=3⋅4=12,\|\\mathcal\{E\}\_\{1\}\|=4\\cdot 4\\cdot 6=96,\\qquad\|\\mathcal\{E\}\_\{2\}\|=3\\cdot 4=12,and the total number of vectors is
\|ℬ\|\+\|ℰ1\|\+\|ℰ2\|=496\+96\+12=604\.\|\\mathcal\{B\}\|\+\|\\mathcal\{E\}\_\{1\}\|\+\|\\mathcal\{E\}\_\{2\}\|=496\+96\+12=604\.
## Appendix CLineage and Fingerprint Construction
For each submission, we construct a fingerprint consisting of problem\-specific and manually specified features\. Concretely, the fingerprint for the kissing number problem comprises 140 scalar features \(Table[2](https://arxiv.org/html/2606.10402#A3.T2)\), while the fingerprint for the second autocorrelation problem comprises 823 scalar features \(Table[3](https://arxiv.org/html/2606.10402#A3.T3)\)\. These features combine summary features \(robust to small coordinate perturbations and resolution differences\) with fixed\-grid profile features \(preserving coarse shape information lost in histogram\-based summaries\)\. The fingerprints are then used to calculate pairwise similarities between submissions, and lineage edges are introduced between submissions with sufficiently large similarity\. Further details are provided below\.
Table 2:Kissing number \(d=11d=11\) fingerprint features\.Table 3:The second autocorrelation inequality fingerprint features\.### C\.1Shared Lineage Procedure
#### C\.1\.1Similarity computation
After constructing the problem\-specific per\-submission fingerprintsui∈ℝpu\_\{i\}\\in\\mathbb\{R\}^\{p\}, we standardize feature columns:
zik=uik−medianj\(ujk\)IQRj\(ujk\),z\_\{ik\}=\\frac\{u\_\{ik\}\-\\operatorname\{median\}\_\{j\}\(u\_\{jk\}\)\}\{\\operatorname\{IQR\}\_\{j\}\(u\_\{jk\}\)\},whereiidenotes a particular submission,jjindexes across all submissions, andkkindexes a specific feature\. The denominator is replaced by the empirical standard deviation plus a small constant when the IQR is near zero, and with clipping to prevent any single unstable coordinate from dominating distance\. We use median/IQR rather than mean/SD because several fingerprint coordinates — and the pairwise distance distribution below — are heavy\-tailed: most submissions are near\-duplicates or same\-basin refinements while a few are structurally unusual\. Robust centering and scaling, together with clipping, prevent these extremes from dominating either the standardized features or the similarity scale that follows\.
We then compute Euclidean distancesd\(i,j\)=‖zi−zj‖2d\(i,j\)=\\\|z\_\{i\}\-z\_\{j\}\\\|\_\{2\}and from which we calculate similarities by
s\(i,j\)=exp\(−d\(i,j\)CMED\),s\(i,j\)=\\exp\\\!\\left\(\-\\frac\{d\(i,j\)\}\{C\_\{\\mathrm\{MED\}\}\}\\right\),whereCMEDC\_\{\\mathrm\{MED\}\}represents the median across all distinct unordered pairs of submissions, i\.e\.,CMED=mediana<bd\(a,b\)C\_\{\\mathrm\{MED\}\}=\\operatorname\{median\}\_\{a<b\}d\(a,b\)\. The median distance over all unordered pairs in the denominator normalizes a specific per\-pair distance by the ‘typical’ pairwise submission distance\.
#### C\.1\.2Parent and lineage identification
A candidate parent must precede the child in time\. A prior submission by the same agent \(a same\-agent prior submission\) is eligible ifs\(i,j\)≥0\.42s\(i,j\)\\geq 0\.42and a prior submission by a different agent \(a cross\-agent prior submission\) is eligible ifs\(i,j\)≥0\.48s\(i,j\)\\geq 0\.48; otherwise the submission is treated as a structural root\. When at least one candidate clears its threshold, the parent is the prior submission with maximum similarity\. The lower same\-agent cutoff allows close within\-agent refinements to remain connected even when the invariant fingerprint shifts moderately\. The0\.480\.48cross\-agent cutoff was chosen manually: below this level, cross\-agent similarities were generally too weak to support a specific parent claim rather than broad within\-problem similarity\.
## Appendix DConversation Coding and Motif Analysis
##### Conversation motif coding\.
We coded the discussion corpus separately for each problem\. The unit of analysis is a single public forum post or reply\.
We used a two\-stage procedure to construct and apply the motif taxonomy\. In the first stage, we used GPT\-5\.5 to review the public posts for each problem and propose candidate discourse purposes, defined as recurring roles that posts played in the collaborative discussions on the platform\. These included general purposes, such as score reporting, local refinement, structural explanation, verifier or certificate discussion, and synthesis of prior progress, as well as problem\-specific purposes, such as lattice decoding for the kissing number problem or cross\-resolution transfer for second autocorrelation\. For each proposed motif, the LLM generated a motif name, a short definition, and representative keyword patterns\. We consolidated these LLM\-proposed motifs into a fixed codebook by merging categories that captured the same underlying purpose and discarding categories that did not recur across multiple posts or could not be distinguished from a broader category using consistent textual evidence\.
In the second stage, we applied the frozen codebook deterministically to all posts\. Each motif was implemented as a case\-insensitive keyword\-pattern rule\. A post received a motif tag if its text matched the corresponding rule\. Motif tags were multi\-label, so a post could receive more than one tag; for example, a post could both report a new score and describe a refinement method\. For each match, we stored a short evidence snippet around the triggering text for auditability\.
##### Motif tags and primary categories\.
Motif tags are multi\-label: a single post may, e\.g\., announce a new score and explain a refinement method\. For visualization we also assign one primary category per post using a fixed priority order, keeping the primary label focused on the strongest discourse function when a post matches multiple motifs\. In Table[4](https://arxiv.org/html/2606.10402#A4.T4), motifs are listed in priority order from highest to lowest\.
### D\.1Kissing number \(d=11d=11\) conversation motifs
The kissing number \(d=11d=11\) taxonomy emphasizes the geometric and certificate\-oriented structure of the discussion, including local refinement, new\-basin discovery, lattice decoding, and exact\-feasibility claims \(Table[4](https://arxiv.org/html/2606.10402#A4.T4)\)\.
### D\.2Second\-Autocorrelation Conversation Motifs
The second\-autocorrelation taxonomy emphasizes the functional\-analytic and numerical\-optimization structure of the discussion, including fractional programming, packet updates, autoconvolution shape, spectral analysis, and cross\-resolution transfer \(Table[4](https://arxiv.org/html/2606.10402#A4.T4)\)\.
Table 4:Conversation motifs used for discourse analysis, grouped by shared and problem\-specific categories\.Similar Articles
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
Introduces SciAgentArena, a benchmark of ~200 tasks for evaluating AI agents in real scientific research. Finds agents effective for well-specified data-analysis workflows but struggle with novel insights and open-ended exploration.
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
The paper introduces EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery that achieves state-of-the-art results on math, kernel engineering, and ML tasks with low computational costs.
@hasantoxr: Someone just built a decentralized AI network where agents bid on scientific problems, form teams, and verify each othe…
Science Earth is a newly launched decentralized AI network where autonomous agents bid on, team up for, and verify scientific research tasks without central authority.
@TheTuringPost: AutoScientists – a research lab made of agents @Harvard researchers connected agents into a self-organizing scientific …
Harvard researchers present AutoScientists, a multi-agent system that forms self-organizing scientific teams without a central coordinator, achieving strong results on BioML-Bench and optimization tasks.
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
EvoScientist is an adaptive multi-agent framework for end-to-end scientific discovery that continuously improves through persistent memory modules, comprising three specialized agents for idea generation, experiment execution, and knowledge distillation. It outperforms 7 state-of-the-art systems in scientific idea generation and improves code execution success rates through multi-agent evolution.