@Montreal_AI: A 0.6B model learned to manage giants. That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Pe…

X AI KOLs Timeline 05/22/26, 02:26 PM Papers

Summary

TRINITY is a lightweight 0.6B parameter coordinator that learns to orchestrate multiple LLMs by assigning them roles (Thinker, Worker, Verifier) using an evolutionary strategy. It outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks.

A 0.6B model learned to manage giants. That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang. The paper is not asking: “How do we build one model that knows everything?” It is asking something more interesting: “How do we build a small intelligence layer that knows who should think, who should act, and who should verify?” TRINITY is a lightweight coordinator for LLMs. It does not merge weights. It does not require architectural compatibility. It does not need access to closed-model internals. It does not try to turn the coordinator into the smartest model in the room. Instead, it orchestrates a pool of strong models at test time, including closed and open models. At each turn, TRINITY chooses a model and gives it one of three roles: Thinker — plan and decompose Worker — solve and execute Verifier — critique and accept/revise That may sound simple. It is not. Too many multi-agent systems are still prompts plus hope. TRINITY learns the coordination policy. A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse. The result is not just better routing. It is learned division of labor. The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D. The most important idea here is bigger than the benchmark. The future of AI may not be a single supermodel. It may be an organization of models. A small conductor. A team of specialists. A protocol for planning, execution, and verification. An intelligence layer that learns how to allocate cognition. This feels like a real shift: from bigger models to better systems from raw capability to coordinated capability from “which model is best?” to “what structure makes many models better together?” Full credit to the authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang. Paper: TRINITY: An Evolved LLM Coordinator https://arxiv.org/abs/2512.04695 I’m attaching the first page because the abstract is worth reading closely. The future of AI may not be monolithic. It may be coordinated. #ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms

Original Article

View Cached Full Text

Cached at: 05/23/26, 02:01 AM

A 0.6B model learned to manage giants.

That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang.

The paper is not asking:

“How do we build one model that knows everything?”

It is asking something more interesting:

“How do we build a small intelligence layer that knows who should think, who should act, and who should verify?”

TRINITY is a lightweight coordinator for LLMs.

It does not merge weights. It does not require architectural compatibility. It does not need access to closed-model internals. It does not try to turn the coordinator into the smartest model in the room.

Instead, it orchestrates a pool of strong models at test time, including closed and open models.

At each turn, TRINITY chooses a model and gives it one of three roles:

Thinker — plan and decompose Worker — solve and execute Verifier — critique and accept/revise

That may sound simple.

It is not.

Too many multi-agent systems are still prompts plus hope.

TRINITY learns the coordination policy.

A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse.

The result is not just better routing.

It is learned division of labor.

The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D.

The most important idea here is bigger than the benchmark.

The future of AI may not be a single supermodel.

It may be an organization of models.

A small conductor. A team of specialists. A protocol for planning, execution, and verification. An intelligence layer that learns how to allocate cognition.

This feels like a real shift:

from bigger models to better systems

from raw capability to coordinated capability

from “which model is best?” to “what structure makes many models better together?”

Full credit to the authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang.

Paper: TRINITY: An Evolved LLM Coordinator https://arxiv.org/abs/2512.04695

I’m attaching the first page because the abstract is worth reading closely.

The future of AI may not be monolithic.

It may be coordinated.

#ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms

Trinity: An Evolved LLM Coordinator

Source: https://arxiv.org/html/2512.04695 Jinglue Xu1,Qi Sun1,311footnotemark:1,Peter Schwendeman2, Stefan Nielsen1,Edoardo Cetin1,Yujin Tang1 1Sakana AI, Japan2University of Michigan, USA3Institute of Science Tokyo, Japan

Abstract

Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs.Trinityaddresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (≈0.6\approx 0.6B parameters) and a lightweight head (≈10\approx 10K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation.Trinityprocesses queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker,Worker, orVerifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Extensive experiments demonstrate thatTrinityconsistently outperforms individual models and existing methods in various tasks, including coding, math, reasoning, and domain knowledge, while robustly generalizing to out-of-distribution tasks. On established benchmarks,Trinityachieves state-of-the-art performance, including a new record of 86.2% on LiveCodeBench. Theoretical and empirical analyses highlight two key factors driving this success: (1) the coordinator’s hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy algorithm provides substantial advantages over RL, imitation learning, and random search, leveraging potential block-ε\varepsilon-separability.

††footnotetext:Correspondence to:[email protected], [email protected], [email protected]## 1Introduction

A prominent line of work involving large language models (LLMs) aspires to scale in line with empirical scaling laws, targeting gains by enlarging model size, training tokens, and compute(Kaplanet al.,2020; Hoffmannet al.,2022). Yet the extent to which such scaling remains efficient and yields sustained returns is uncertain and often resource intensive. An alternative at the micro level is model merging(Akibaet al.,2025; Wortsmanet al.,2022; Yanget al.,2024; Kurokiet al.,2024), which seeks parameter-level integration. However, this approach is frequently impractical due to architectural incompatibilities and the closed-source nature of many high-performing models. In light of these limitations, we adopt amacro-levelapproach: test-time model composition via coordination, which fuses the complementary strengths of multiple state-of-the-art models from diverse providers without modifying their weights. Leveraging prior data and training investments, this coordination can deliver performance improvements without retraining individual models.

The central challenge for such a coordinator is to acquire a rich contextual understanding of a given query to make an effective decision. We posit that this signal can be efficiently extracted from the internal representation of a compact language model, specifically, its hidden states(Allen-Zhu and Li,2023). In a self-attention-based transformer model, hidden states encode contextual representations of the input (and, after generation, the output) sequence. Hidden states extracted from inputs alone reflect input context, and those taken post-generation additionally capture the model’s produced output and latent reasoning. For output sequences, the penultimate token’s hidden state carries rich context. It attends over the entire sequence and guides the prediction of a special token (such as<\think>or the EOS token), ensuring a stable output distribution. This leads to our central hypothesis that contextual representations from a small language model (SLM) contain sufficient semantic signal for a lightweight head to coordinate multiple LLMs effectively, a possibility that remains underexplored in existing works (see Section5).

Refer to caption Figure 1:Overview and an example of our coordination method.Left:The cyclical coordination architecture. In each turn, the full conversation transcript is passed to a compact coordinator model. A lightweight head selects an LLM and assigns it one of three roles: Thinker (T), Worker (W), or Verifier (V). A message processing module injects a role-specific prompt before the request is sent to the chosen LLM.Right:An example of multi-turn coordination. To solve a complex depreciation problem,Trinityinvokes a Thinker (Turn 1) to decompose the task, a Worker (Turn 2) to perform the calculation, and a Verifier (Turn 3) to validate the answer and identify edge cases.Given these contextual representations, our method,Trinity, employs an SLM (0.6B parameters) with a lightweight head to orchestrate multiple LLMs (both open- and closed-source models) in a multi-turn protocol, with the total number of learnable parameters under 20K. At each turn,Trinityselects an LLM and constructs its input by concatenating the original query with the full transcript of prior turns. To ensure the coordinator remains lightweight and offloads complex skill acquisition,Trinityassigns the selected agent one of three distinct roles: (1) athinkerto devise high-level strategies and decompositions; (2) aworkerto perform concrete problem-solving steps; and (3) averifierto evaluate the current solution’s soundness and completeness. The process halts when the verifier is selected and accepts current response as the final answer, or when a fixed-turn budget is exhausted. Figure1gives an overview of our method, together with an example of our coordination.

Optimizing this representation-to-coordination mapping is challenging. We observe weak coupling among parameters — each has only a tiny influence on the scalar reward, making traditional methods like REINFORCE’s per-parameter gradients low-SNR and therefore ineffective. Training is further constrained by cost, since each step requires running the coordinated agents for inference. We find that a derivative-free Covariance Matrix Adaptation Evolution Strategy (CMA-ES)(Hansenet al.,2003)with diagonal covariance, separable CMA-ES (sep-CMA-ES)(Ros and Hansen,2008), is effective in this particular regime: high dimensionality, weak parameter correlations, and high per-step cost. We provide theoretical and empirical evidence that, in this extremely budget-tight scenario (1.5k–40k evaluations for a 10k-dimensional problem), sep-CMA-ES significantly outperforms RL and the random search baseline, suggesting strong block-ε\varepsilon-separability (see Definition1) in the optimization objective.

Across four in-distribution benchmarks including Math500(Lightmanet al.,2023), MMLU(Hendryckset al.,2020), RLPR(Yuet al.,2025), and LiveCodeBench(Jainet al.,2024),Trinityconsistently outperforms prior methods, achieving a mean relative error reduction of21.9%21.9\%over the second-best approach. It also outperforms all single-model baselines with fair, adjusted output-token budgets. Remarkably,Trinitysets a new state-of-the-art on LiveCodeBench (Jan - Aprl 2025), achieving a pass@1 of86.2±0.5%86.2\pm 0.5\%. Furthermore,Trinityis able to zero-shot transfer to four unseen tasks consisting AIME(Veeraboina,2023), BigCodeBench(Zhuoet al.,2024), MT-Bench(Baiet al.,2024), and GPQA-D(Reinet al.,2024), with performance surpassing each of the single models it orchestrates.

Our main contributions are summarized as follows:

•A lightweight and effective coordination mechanism.We show that rich contextual signals from the hidden states of an SLM are sufficient for a tiny head to coordinate multiple diverse LLMs (with the total number of learnable parameters under 20K), a previously underexplored approach to model composition.
•A highly efficient training methodology.We demonstrate theoretically and empirically that under the challenging, budget-constrained conditions of our problem, sep-CMA-ES is a superior optimization choice over RL, imitation learning, and random search.
•State-of-the-art performance and generalization.Trinitysets a new record on LiveCodeBench and outperforms existing methods on a wide range of benchmarks. It also generalizes robustly to unseen tasks and develops emergent, task-aware coordination strategies.

2Problem formulation

Let𝒮\mathcal{S}be the set of interaction statesss(the original query together with the full multi-turn conversation so far). An SLM maps eachssto arepresentation stateh(s)∈ℋ⊂ℝdh(s)\in\mathcal{H}\subset\mathbb{R}^{d}(e.g., a penultimate-token hidden vector). A lightweight coordination head with parametersθ∈𝒫⊂ℝn\theta\in\mathcal{P}\subset\mathbb{R}^{n}takesh(s)h(s)as input and outputs logits over a finite action set𝒜\mathcal{A}of agent–role pairs:

fθ:ℋ→ℝ|𝒜|,πθ(a∣s)∝exp⁡(fθ(h(s))a),a∈𝒜.f_{\theta}:\ \mathcal{H}\to\mathbb{R}^{|\mathcal{A}|},\qquad\pi_{\theta}(a\mid s)\ \propto\ \exp\!\big(f_{\theta}(h(s))_{a}\big),\ a\in\mathcal{A}. The policyπθ\pi_{\theta}induces a distribution overall multi-turn trajectories𝒯\mathcal{T}, where a trajectory isτ=(s0,a0,…,sT)\tau=(s_{0},a_{0},\ldots,s_{T})with horizonT≤BturnT\leq B_{\mathrm{turn}}, whereBturnB_{\mathrm{turn}}denotes a fixed turn budget. A terminal rewardR(τ)∈{0,1}R(\tau)\in\{0,1\}is revealed at the end. The optimization objective

J(θ):=𝔼τ∼πθ[R(τ)]J(\theta)\ :=\ \mathbb{E}_{\tau\sim\pi_{\theta}}[R(\tau)]is the expected terminal reward of thecoordinatorθ\theta. In short, the representation spaceℋ\mathcal{H}provides contextual features, while thecoordination space𝒫\mathcal{P}parametrizes policies overalltrajectories in𝒯\mathcal{T}. We regard each single, complete, end-to-end run (i.e., sampling of a trajectoryτ\tau) as an atomic evaluation, or a Bernoulli call since the rewards follow the Bernoulli distribution. And since each run involves multiple LLM calls, which is a cost we wish to constrain, we seekθ⋆∈arg⁡maxθ∈𝒫⁡J(θ)\theta^{\star}\in\arg\max_{\theta\in\mathcal{P}}J(\theta)under a tightatomic evaluation budgetBenvB_{\mathrm{env}}that counts individual Bernoulli calls of the terminal reward used when estimatingJ(θ)J(\theta)(e.g., via replication/averaging).

3Trinity

To address the problem outlined in Section2, we proposeTrinity, a lightweight and adaptive framework for coordinating multiple diverse LLMs (Figure1, left). At its core, our approach introduces a coordinator, optimized via sep-CMA-ES, that learns to orchestrate a pool of external LLMs and assign them distinct roles throughout a multi-turn reasoning process.

3.1Efficient parametrization

To efficiently derive the representation and coordination space, the coordinator employs a highly efficient parametrization scheme, as illustrated in Figure2. We use a pre-trained SLM as a backbone and introduce two distinct sets of trainable parameters.

First, we append a lightweight head directly after the coordinator SLM’s final hidden layer. To coordinateLLagents, this head projects a hidden stateh∈ℝdh\in\mathbb{R}^{d}to an output of sizeL+3L+3, which provides two sets of logits:LLlogits for selecting an LLM and three logits for assigning its role. This head defines the fundamental structure of the coordination space. Second, inspired by recent work in efficient fine-tuning(Sunet al.,2025), we adapt a small set of the backbone’s layers using a singular value fine-tuning approach. For a selected subset of the coordinator SLM’s weight matrices, we perform a singular value decomposition and only learn the singular value scales, keeping the orthogonal matrices fixed. This parameterization scheme is highly efficient, keeping the total number of learnable parameters below 20K, orders of magnitude smaller than typical fine-tuning, while still yielding representational benefits (Figure5).

Crucially, our method only relies on the head’s logit outputs, and the coordinator’s generated text is discarded because the job of prompting is delegated to the LLMs in the pool (Section3.2). Rather than waiting for a full generation, this allows the coordinator to take hidden states corresponding to an earlier token instead of the penultimate to make a quick decision. This combination of extreme parameter efficiency and the potential to make rapid inference makes training the entireTrinitysystem with evolutionary strategies uniquely feasible (Section3.3), avoiding the significant data and computational overhead of imitation learning or RL.

Refer to caption Figure 2:Parametrization of theTrinitycoordinator.A lightweight head (see AppendixA.4) operates in parallel to the base model’s LM head. It takes the hidden statehhcorresponding to the penultimate output token as its sole input. This headfθf_{\theta}is responsible for coordination decisions, producing two sets of logits, one to select an LLM from the pool ofLLmodels, and another to assign one of three roles. We also fine-tune the singular value scales of the parameter matrices in the SLM’s layers, indicated by the red diagonal lines. In the figure, the hidden state at the position marked by “<<Head Input>>” is the input to lightweight head. Note that the semantic correspondence of the decoded message “<<BOS>>…” to the hidden state is only for illustrative purpose, as the lightweight head operates on the internal hidden state from that position, not the final decoded text.

3.2Tri-role coordination

Next, we discuss the set of multi-agent interaction patterns available to the coordinator, which are the remaining constructs that define the coordination space.

A key principle of our approach is that the coordinator itself need not be as capable as the underlying agents, its primary function is toleverageandorchestratediverse LLMs. Coordination proceeds over at mostKKturns for a given user queryQQ. Let the transcript afterk−1k{-}1turns be𝒞k−1=(Q,O1,…,Ok−1)\mathcal{C}_{k-1}=\big(Q,O_{1},\ldots,O_{k-1}\big). At turnkk, the coordinator selects an agent (i.e., an LLM)AkA_{k}from the poolℳ\mathcal{M}and a roleRk∈{Thinker (T),Worker (W),Verifier (V)}R_{k}\in\{\text{Thinker (T)},\text{Worker (W)},\text{Verifier (V)}\}. The coordinator then prepares a brief, role-specific prompt based on𝒞k−1\mathcal{C}_{k-1}, queriesAkA_{k}to obtain a messageMkM_{k}, and lightly post-processesMkM_{k}intoOkO_{k}, which is appended to the transcript for the next turn.

InTrinity, we define three roles, namelyThinker,Worker, andVerifier, each of which enforces a distinct contract between the coordinator and the selected LLM:

•Thinker strategizes.The thinker analyzes the current state and returns meta-level guidance, including high-level plans, decompositions, or critiques of partial solutions. Formally, it may propose a plan over subgoals, which the coordinator condenses intoOkO_{k}to steer subsequent turns, it can also specify the role of the next agent along with the plan.
•Worker executes.The worker acts directly on the task to make concrete progress toward a final solution. Given𝒞k−1\mathcal{C}_{k-1}, it produces actionable content (e.g., a derivation, code snippet, or numerical result). The coordinator extracts the key information and stores it asOkO_{k}.
•Verifier evaluates.The verifier checks whether the accumulated solution in𝒞k−1\mathcal{C}_{k-1}is correct, complete, and responsive toQQ. It outputs a judgmentuk∈{ACCEPT,REVISE}u_{k}\in\{\texttt{ACCEPT},\texttt{REVISE}\}and an optional diagnosisδk\delta_{k}. The coordinator records(uk,δk)(u_{k},\delta_{k})asOkO_{k}and, ifuk=ACCEPTu_{k}=\texttt{ACCEPT}, signals termination.

The termination time isτ=min⁡{k≤K:Rk=Vanduk=ACCEPT}\tau=\min\{\,k\leq K:R_{k}=\mathrm{V}\ \text{and}\ u_{k}=\texttt{ACCEPT}\,\}, withτ=K\tau=Kif no acceptance occurs. The final answer returned to the user isOτO_{\tau}. This rule provides a simple, verifiable stopping condition while allowing the coordinator to allocate the compute budget adaptively across planning, execution, and quality control. See Figure1(right) for an example.

3.3Learning with an evolutionary strategy

To determine a suitable training algorithm, we examine the structure of our problem objective. By varying the head architecture (see AppendixA.4), we observe that the headblock-diagonal-10retains a large fraction of the performance despite its tiny parameter count (see Section4.7). These observations reveal that the optimization problem defined in Section2, when embodied in our representation and coordination space, exhibits strong block-ε\varepsilonseparability (Definition1). This geometry strongly favors diagonal methods: most of the informative signal is concentrated within blocks, while inter-block interference remains negligible. Conversely, this geometry undermines the REINFORCE baseline (as shown in Section4.8): noisy global returns swamp weak inter-block signals, yielding ill-conditioned gradients, poor credit assignment, and unstable learning.

We therefore adopt sep-CMA-ES, a black-box evolutionary strategy that iteratively improves a central “parent” policy by sampling a population of perturbed parameter vectors, evaluating each candidate to obtain a fitness score, and recombining candidates via fitness-weighted averaging to form the next parent. Unlike full CMA-ES, sep-CMA-ES maintains only a diagonal covariance matrix, making the algorithm especially well suited to block-diagonal landscapes.

In AppendixA.1, we provide a theoretical analysis tailored to our specific problem regime: a coordination head with about 10K parameters, tight evaluation budgets, binary terminal rewards, and weak but nonzero cross-block couplings. In the following, we present a short comprehensive summary.

Letnnbe the head dimension,λ=⌈4+3ln⁡n⌉\lambda=\lceil 4+3\ln n\rceilbe the CMA-ES population size, andmCMAm_{\mathrm{CMA}}/mRSm_{\mathrm{RS}}be the replication counts (number of evaluations per candidate). DenoteTTas the optimization iteration count. Then, for the small-TTregime,Proposition1shows that sep-CMA-ES’s improvement grows roughly linearly with the number of iterations, while random search (RS) grows only with the logarithm of how many candidates it can test. Thus, for modestTT, sep-CMA-ES outperforms RS. In the specific regime of our study (n≈10000n\!\approx\!10000,λ≈32\lambda\!\approx\!32,mCMA=16m_{\mathrm{CMA}}=16,mRS=32m_{\mathrm{RS}}=32), budget matching yields about16T16TRS candidates; the gain ratio behaves likeTln⁡(16T)⋅η2\tfrac{T}{\ln(16T)}\cdot\eta^{2}, whereη\etais a reliability factor between0and11, usually close to one. This ratio is greater than one even for smallTT.

Proposition2states that after aboutnniterations of calibration, sep-CMA-ES enters a steady regime where each step reduces the remaining error by a fraction of order1/n1/n, with a rate constant close toκ¯μ,λ\bar{\kappa}_{\mu,\lambda}, where the constantκ¯μ,λ=Θ(1)\bar{\kappa}_{\mu,\lambda}=\Theta(1)denotes the CMA recombination efficiency. By contrast, RS continues to gain only logarithmically even with repeated rounds. Hence, asTTincreases, sep-CMA-ES becomes better and the gap compared to RS grows wider.

4Experiments

We demonstrate the effectiveness ofTrinitythrough three key dimensions. First, we directly compare it against both multi- and single- agent baselines in controlled settings. We also show thatTrinityestablishes a state-of-the-art performance on the LiveCodeBench task. We then evaluate its generalization capabilities across a diverse set of unseen tasks. Finally, we present analytical results, including ablations and the contextual information encoded in the extracted hidden states, and compare our evolution-based approach against RL, imitation learning, and RS.

4.1Experimental setup

Coordinator and agents.We use Qwen3-0.6B(Yanget al.,2025)as the coordinator’s SLM, paired with a single linear layer of 10K parameters as the simple but effective head, and select the second-to-last layer of the 0.6B model for singular value fine-tuning. Table6reports the parameter counts of the various head architectures and the number of parameters updated during singular value fine-tuning. Our model pool contains seven models from both open-source communities and closed-source API providers. These are, three top-tier closed-source models currently available (GPT-5(OpenAI,2025), Gemini-2.5-pro(Comaniciet al.,2025), and Claude-4-Sonnet(Anthropic,2025)), and four well-known open-source models (Gemma-3-27B-It(Teamet al.,2025), DeepSeek-R1-Distill-Qwen-32B(Guoet al.,2025), Qwen-3-32B (direct), and Qwen-3-32B (reasoning)). Our LLM and training task selection principle is detailed in AppendixA.6.

Tasks and protocols.We train and evaluateTrinityacross four diverse tasks, including MATH500, MMLU, RLPR, and LiveCodeBench. For each task, we train on the designated training set and assess performance on the corresponding test set, utilizing official splits where available. For LiveCodeBench specifically, we use the V1 release (400 samples) for training and conduct evaluation on the newly introduced questions in the V6 release (175 samples). To ensure consistency between open and closed models and facilitate training, we set the default maximum generated tokens to 4096 for each LLM, with minimal reasoning effort. We also set the maximum number of coordination turns to five. For assessing generalization capabilities, we further evaluate our approaches on four challenging held-out tasks (AIME2025, BigCodeBench, MT-Bench, and GPQA-D), spanning diverse domains and problem types.

Baselines.We compareTrinityagainst several categories of baselines. For multi-agent routing methods, we compare against state-of-the-art approaches including MasRouter(Yueet al.,2025), RouterDC(Chenet al.,2024), Smoothie(Guhaet al.,2024), MoA(Wanget al.,2024)and random agent selection. We also evaluate individual LLMs in our pool (GPT-5, Gemini-2.5-pro, Claude-4-Sonnet) at both 4K and 20K (marked as 5x CTX) inference tokens to assess performance under accumulated inference budget, and single agent self-reflection over five turns (5x SR). In addition, we include a majority-voting baseline at 5 samples and a baseline with an LLM as the coordinator (see AppendixA.7.3). Detailed experimental settings are provided in AppendixA.7.1.

Refer to caption Figure 3:Trinityoutperforms single- and multi-model baselines across four benchmarks.Our approach (boldface on the x-axis) achieves the highest performance across four tasks, surpassing the baseline methods. In Math500, MMLU and LiveCodeBench, our performance is close to “Per-Question-Best”, representing an upper bound achieved by taking the union of all correct answers from the single LLMs.

4.2In-distribution evaluation

As shown in Figure3,Trinityconsistently outperforms existing multi-agents methods across all four benchmarks, demonstrating its superior ability to harness the strengths of a diverse LLM pool. While some baseline methods achieve moderate performance on individual tasks, such as MoA’s strong results on Math500 (0.83) and RLPR (0.38), they fail to maintainconsistencyacross tasks, as evidenced by its relatively weaker performance on LiveCodeBench (0.39). This inconsistency highlights the difficulty of effectively coordinating diverse agents. Notably, some collaboration approaches even degrade performance below random baselines, as seen with Router DC’s RLPR score of 0.28 compared to random selection’s 0.32, further emphasizing the challenge.

In contrast,Trinityachieves robustly high performance across the board, including a remarkable 0.61 pass@1 score on LiveCodeBench v6, substantially surpassing all competing methods. Also, we achieve a 11.76% relative error reduction on MATH500 compared to the 2nd best method (Gemini Pro 2.5 with 5x CTX). These results suggest that, while diverse agent capabilities offer significant potential, effective collaboration requires sophisticated mechanisms for optimal organizational decisions, which simple or heuristic-based routing approaches cannot easily achieve.

Compared with single-model baselines,Trinityoutperforms every individual model in the pool, even when they are enhanced with either extended inference budget (5x CTX) or self-reflection settings (5x SR). The 5× inference budget matches our maximum turn setting of five, ensuring that comparisons are fair, and in some cases even favorable to the baselines. A closer look at Figure3reveals distinct strengths and limitations for each model. For example, Gemini excels on RLPR and MATH500 but shows moderate performance on LiveCodeBench while GPT-5 dominates it. Remarkably,Trinityachieves optimal performance across all tasks, demonstrating its ability to dynamically leverage each model’s strengths and compose them effectively for different challenges. To further contextualizeTrinity’s capability, we also include an upper bound (“Per-Question-Best”) representing the performance achieved by taking the union of all correct answers from the seven LLMs in the pool. Our method approaches this limit closely on three of four tasks, demonstrating its ability to harness the collective capabilities of the model ensemble.Trinityalso exhibits upper-tier token efficiency compared to other methods, especially coordination methods. (see AppendixA.7.4for detailed comparison).

4.3Zero-shot transfer to unseen tasks

Table 1:Performance Across Hold-Out Tasks Refer to caption Figure 4:LiveCodeBench Results.Top:Trinityachieves state-of-the-art.Bottom:Trinitybenefits from increasing maximum turns budgets.This suggests thatTrinitydoes more than simply select the best agent for a task. To assessTrinity’s generalization capability, we tested its zero-shot performance on four held-out benchmarks. As summarized in Table1,Trinityachieves the highest average score (54.21) and outperforms every individual baseline on each of the four tasks. It secures top performance on AIME (50.00), MT-Bench (9.60) and GPQA-D (76.82), and ties for first on BigCodeBench (35.80). This result highlights a key advantage of our approach. While individual models exhibit specialization strengths and weaknesses (e.g., Gemini Pro 2.5 and GPT-5 perform better on reasoning tasks compared to coding benchmarks, and Claude-4-Sonnet shows relatively balanced performance),Trinitydelivers consistent results across all domains. It effectively synthesizes the capabilities of the entire pool to achieve emergent performance that surpasses any single constituent model. Surprisingly, we find that Qwen3-32B reasoning mode underperforms direct mode on BigCodeBench, this is mostly due to its verbose reasoning, causing formatting failures in certain test cases.

4.4Unleashing full power

Due to hardware constraints in serving open-source models, we limited the maximum output length for all LLMs in the pool for fair comparisons in the previous experiments. For the LiveCodeBench task, the coordinator’s LLM selection narrows down to the three closed-models after training. This allows us to remove the output length constraint and observe the full power ofTrinityon LiveCodeBench. Notice that we simply remove the constraints and do not retrainTrinity.

In Figure4(top),Trinitydemonstrates substantial improvements over the constituent models, and achieves state-of-the-art performance with a pass@1 score of 0.862 on LiveCodeBench V6, newly-released questions spanning January to April 2025. This represents a significant improvement over leading baselines: GPT-5 (0.838), Gemini 2.5-Pro (0.672), and Claude-4-Sonnet (0.465). In addition, Figure4(bottom) also shows thatTrinitybenefits from increasing max collaboration turns, improving from 0.823 to 0.863 as turns increase from 2 to 6. This pattern makes intuitive sense and implies thatTrinity’s capability stems from complex coordination and goes beyond naive routing.

4.5Ablation studies

We conduct ablation studies to verify the effectiveness of our design choices, with results summarized in Table2. First, removing the singular value fine-tuning consistently lowers scores, confirming the benefit of adapting the coordinator model’s internal representation directly, which allows it to generate more effective signals for the head. Next, gradually removing the tri-role selection —first the thinker role, then the entire tri-role selection— proves detrimental to complex reasoning, causing substantial degradation on MATH500 (-6.0 points) and RLPR (-4.57 points). Additionally, switching to the final token, which often corresponds to a semantically sparse EOS token, causes a severe performance collapse, particularly on LiveCodeBench (more than 10 points drop). Finally, when we remove agent selection and instead send all queries to a single fixed agent while retaining only role selection, performance is significantly undermined. Together, these findings underscore the necessity of the fullTrinitydesign.

Table 2:Ablation study results.We compare the performance on in-distribution tasks when we (1) remove the singular value fine-tuning in SLM; (2) remove the thinker-role selection (3) remove the tri-role selection; and (4) use the last instead of the penultimate token.Trinityachieves the best overall performance. (5) remove agent selection but keep role selection

4.6Separability in representation space

The success of our lightweight coordinator depends on a well-structured representation space where hidden states are separable by task. We verify this by extracting hidden states from the coordinator during in-distribution runs and analyzing their linear separability using a suite of methods, including linear classifier (SVM) and dimensionality reduction (t-SNE).

Refer to caption Figure 5:Task type separability in extracted hidden states.Both are based on penultimate-token hidden states processed by the SLM on the input sequence, and the labels are from the task metadata.AppendixA.3reports the full results, and Figure5presents two key analyses of the representation space. A linear SVM achieves perfect classification, far above chance level (0.25 for four classes), indicating near-perfect linear separability. The t-SNE visualization likewise exhibits clear, well-separated clusters, corroborating strong non-linear separability. This high degree of separability is a key factor that enables our lightweight, linear head to make effective coordination decisions with extreme parameter efficiency. Additional experiments in AppendixA.3also indicate a positive correlation between the separability in the representation space and the coordinator’s performance.

4.7Separability in problem objective

Changing the head architecture (see AppendixA.4) not only alters coordinator performance, but also reveals structural properties of the problem objective, namely the mapping from hidden states to agent/role choices that maximizes downstream task reward. Table3shows thatlinearis the most reliable choice overall across LiveCodeBench, RLPR, Math500, and MMLU, withsparseedging it out by a negligible margin on MMLU only. Theblock-diagonal-10head paired with anargmaxoutput conversion is intentionally designed to maximize independence among the ten logits—one block per agent/role—thereby suppressing inter-logit correlations. Parameter-count wise, this head uses onlydhd_{h}weights (about10×10\timesfewer than thelinear’sdhnad_{h}n_{a}; e.g.,1,0241{,}024vs.10,24010{,}240whendh=1024d_{h}{=}1024,na=10n_{a}{=}10) and still retains competitive mid-tier performance. Importantly,argmaxfurther increases independence by removing the softmax simplex constraint. Withargmax, decisions depend only on the largest logit, so perturbations to non-maximal blocks neither reduce nor redistribute probability mass, which reduces cross-block interference in both inference and fitness attribution. This result suggests strong block-ε\varepsilonseparability (Definition1) as a property of the coordination objective, in addition to the geometric separability of hidden states studied in Section4.6.

Table 3:Results by varying heads and output conversion.By default, the output conversion is softmax normalization. Forblock-diagonal-10, the output conversion is argmax.

4.8Sep-CMA-ES vs random search vs REINFORCE vs Supervised Fine-tuning

To empirically demonstrate the advantages of sep-CMA-ES for our setting (Section2), we compare it against REINFORCE(Williams,1992), SFT, and RS with fitness averaging, which is appropriate for binary rewards (see AppendixA.5). Table4shows that sep-CMA-ES outperforms other algorithms for training the coordinator, consistent with our theory (Section3.3, AppendixA.1). REINFORCE exhibits jagged, high-variance learning curves with weak overall progress, which is expected under terminal (binary) rewards and weak parameter correlation.

Table 4:Comparison of sep-CMA-ES with REINFORCE, SFT, and RS.We compare the performance on in-distribution tasks for four learning algorithms under comparable budgets Refer to caption Figure 6:LLM selection distribution evolves as the coordinator learning progresses.Left:Distribution evolution from sep-CMA-ES.Right:Distribution evolution from REINFORCE.As shown in Figure6, sep-CMA-ES adapts to a meaningful agent selection distribution that favors high-performing LLMs. By contrast, REINFORCE maintains an almost uniform selection pattern, indicating ineffective policy improvement. Although not shown in the figure, RS often collapses to unipolar choices, over-selecting a single agent or role and thereby significantly limiting diversity of agents and roles, which degrades performance. While SFT achieves competitive gains, it does not scale to multi-turn coordination due to the prohibitive cost of label generation (see AppendixA.2).

5Related Works

We useModel fusionto refer to methods that combine multiple models into a more capable system. Prior work divides into two complementary levels:micro-levelfusion inparameter space, where weights of parent models are merged into a child model, andmacro-levelfusion indata-flow space, where activations or outputs are passed across fixed models or model components.

Micro-level. Early approaches in micro-level utilize on static recipes such as weight averaging or task-balanced interpolation to integrate multiple model capabilities across domains with minimal computation(Goddardet al.,2024). More recent work has introduced optimization-based methods to model-merging: for example, an evolutionary framework that searches over “merging recipes”, demonstrating that learned strategies can outperform hand-designed ones and yield stronger generalization(Akibaet al.,2025). However, because micro-level model-fusion is performed in the parameter-space, these methods face the core limitation that they require access to model weights with compatibility requirements(Yadavet al.,2023; Yuet al.,2024). This confines their applicability to open-source checkpoints, while excluding the closed-source models that currently define the frontier of performance. Consequently, micro-level fusion cannot incorporate the strongest available models, motivating the exploration of data-space approaches that treat models as black boxes.

Macro-level. Model fusion in the data-flow space can itself be performed at multiple degrees. Earlier works have allowed propagation of tensors through layers taken from different models(Bansalet al.,2021)or sequentially processing individual tokens by different models(Muqeethet al.,2024). Our work most relates to a broader view of macro-level model fusion in which methods create stronger singular models by scaffolding or routing between multiple agents. In particular, multi-agent scaffolding techniques like Mixture of Agents (MoA)(Wanget al.,2024)and Multi-Agent Debate (MAD)(Lianget al.,2023)form networks of agents which can extract capabilities from each individual model. Routing methods, such as Smoothie(Guhaet al.,2024)or RouterDC(Chenet al.,2024)aim to choose the best model or model response for a given question. Similarly, MasRouter(Yueet al.,2025)combines both by routing agents and human-designed scaffolds to form an adaptive multi-agent model per question. These methods rely on expensive multi-model inference or static, human-designed collaboration patterns. In contrast,Trinityintroduces a lightweight, learned coordinator that adaptively assigns dynamic roles to LLMs, utilizing the contextual representation generated from a SLM.

6Conclusions

In this work, we introduceTrinity, a framework demonstrating that a lightweight coordinator can orchestrate diverse LLMs to achieve state-of-the-art performance. Leveraging a tri-role protocol and trained with a highly efficient evolutionary strategy, our results suggest a promising path forward lies in engineering collaborative AI ecosystems rather than scaling monolithic models. A key limitation, however, is the gap between abstract reasoning and grounded execution, as the system can devise plans involving tools but cannot yet act on them. Future work will therefore focus on integrating a more heterogeneous pool of agents, including code interpreters and APIs, to bridge this gap and create a more general and capable problem-solving system.

Acknowledgements

We thank Koshi Eguchi and Kou Misaki for the infrastructure support, and the entire Sakana AI R&D team for their valuable comments and suggestions.

Authors Contributions

Jinglue Xu designed and curated the training datasets, conducted both theoretical and empirical analyses, and led the subsequent algorithmic and implementation development. Qi Sun proposed, named, and implemented the role selection algorithm, and led the training and evaluation experiments. Stefan Nielsen designed training and evaluation configurations, implemented baselines, and contributed to shaping the trinity roles. Edoardo Cetin contributed to shaping early algorithmic designs and advised the project. Yujin Tang initiated and led the project, implemented the initial algorithm, and conducted the first experiments. All authors contributed to the experimental design and paper writing.

Ethics statement.Our approach focuses on collaboration between agents to achieve better performance on existing benchmarks. As this work involves only computational improvements to established evaluation tasks without involving human subjects, sensitive data, or potential misuse applications, we identify no ethical concerns.

Reproducibility statement.To ensure full reproducibility of our results, we provide comprehensive resources in the supplementary material, including source code and trained model weights. We also detail all model and task selection decisions within the paper. All base models and datasets used in this work are publicly available.

References

T. Akiba, M. Shing, Y. Tang, Q. Sun, and D. Ha (2025)Evolutionary optimization of model merging recipes.Nature Machine Intelligence7(2),pp. 195–204.Cited by:§1,§5.
Z. Allen-Zhu and Y. Li (2023)Physics of language models: part 3.1, knowledge storage and extraction.arXiv preprint arXiv:2309.14316.Cited by:§1.
Anthropic (2025)Claude sonnet 4.Note:https://www.anthropic.com/claude/sonnetAccessed: 2025-08-29Cited by:§4.1.
G. Bai, J. Liu, X. Bu, Y. He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng,et al.(2024)Mt-bench-101: a fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762.Cited by:§1.
Y. Bansal, P. Nakkiran, and B. Barak (2021)Revisiting model stitching to compare neural representations.Advances in neural information processing systems34,pp. 225–236.Cited by:§5.
S. Chen, W. Jiang, B. Lin, J. Kwok, and Y. Zhang (2024)Routerdc: query-based router by dual contrastive learning for assembling large language models.Advances in Neural Information Processing Systems37,pp. 66305–66328.Cited by:§4.1,§5.
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al.(2025)Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261.Cited by:§4.1.
X. Glorot and Y. Bengio (2010)Understanding the difficulty of training deep feedforward neural networks.InProceedings of the thirteenth international conference on artificial intelligence and statistics,pp. 249–256.Cited by:§A.4.
C. Goddard, S. Siriwardhana, M. Ehghaghi, L. Meyers, V. Karpukhin, B. Benedict, M. McQuade, and J. Solawetz (2024)Arcee’s MergeKit: a toolkit for merging large language models.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,F. Dernoncourt, D. Preoţiuc-Pietro, and A. Shimorina (Eds.),Miami, Florida, US,pp. 477–485.External Links:Link,DocumentCited by:§5.
N. Guha, M. Chen, T. Chow, I. Khare, and C. Re (2024)Smoothie: label free language model routing.Advances in Neural Information Processing Systems37,pp. 127645–127672.Cited by:§4.1,§5.
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi,et al.(2025)Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948.Cited by:§4.1.
N. Hansen, S. D. Müller, and P. Koumoutsakos (2003)Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es).Evolutionary computation11(1),pp. 1–18.Cited by:§1.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2020)Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300.Cited by:§1.
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,et al.(2022)Training compute-optimal large language models.arXiv preprint arXiv:2203.15556.Cited by:§1.
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica (2024)Livecodebench: holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974.Cited by:§1.
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020)Scaling laws for neural language models.arXiv preprint arXiv:2001.08361.Cited by:§1.
D. P. Kingma and J. Ba (2017)Adam: a method for stochastic optimization.External Links:1412.6980,LinkCited by:§A.2.1.
S. Kuroki, T. Nakamura, T. Akiba, and Y. Tang (2024)Agent skill acquisition for large language models via cycleqd.arXiv preprint arXiv:2410.14735.Cited by:§1.
T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, S. Shi, and Z. Tu (2023)Encouraging divergent thinking in large language models through multi-agent debate.arXiv preprint arXiv:2305.19118.Cited by:§5.
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023)Let’s verify step by step.arXiv preprint arXiv:2305.20050.Cited by:§1.
M. Muqeeth, H. Liu, Y. Liu, and C. Raffel (2024)Learning to route among specialized experts for zero-shot generalization.arXiv preprint arXiv:2402.05859.Cited by:§5.
OpenAI (2025)Introducing gpt-5.Note:https://openai.com/index/introducing-gpt-5/Accessed: 2025-08-29Cited by:§4.1.
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024)Gpqa: a graduate-level google-proof q&a benchmark.InFirst Conference on Language Modeling,Cited by:§1.
R. Ros and N. Hansen (2008)A simple modification in cma-es achieving linear time and space complexity.InInternational conference on parallel problem solving from nature,pp. 296–305.Cited by:§1.
Q. Sun, E. Cetin, and Y. Tang (2025)Transformer-squared: self-adaptive LLMs.InThe Thirteenth International Conference on Learning Representations,External Links:LinkCited by:§3.1.
G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière,et al.(2025)Gemma 3 technical report.arXiv preprint arXiv:2503.19786.Cited by:§4.1.
H. Veeraboina (2023)Cited by:§1.
J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou (2024)Mixture-of-agents enhances large language model capabilities.arXiv preprint arXiv:2406.04692.Cited by:§4.1,§5.
R. J. Williams (1992)Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning8(3),pp. 229–256.Cited by:§4.8.
M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith,et al.(2022)Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.InInternational conference on machine learning,pp. 23965–23998.Cited by:§1.
P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal (2023)Ties-merging: resolving interference when merging models.Advances in Neural Information Processing Systems36,pp. 7093–7115.Cited by:§5.
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al.(2025)Qwen3 technical report.arXiv preprint arXiv:2505.09388.Cited by:§4.1.
E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao (2024)Model merging in llms, mllms, and beyond: methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666.Cited by:§1.
L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li (2024)Language models are super mario: absorbing abilities from homologous models as a free lunch.InForty-first International Conference on Machine Learning,Cited by:§5.
T. Yu, B. Ji, S. Wang, S. Yao, Z. Wang, G. Cui, L. Yuan, N. Ding, Y. Yao, Z. Liu,et al.(2025)RLPR: extrapolating rlvr to general domains without verifiers.arXiv preprint arXiv:2506.18254.Cited by:§1.
Y. Yue, G. Zhang, B. Liu, G. Wan, K. Wang, D. Cheng, and Y. Qi (2025)Masrouter: learning to route llms for multi-agent systems.arXiv preprint arXiv:2502.11133.Cited by:§4.1,§5.
T. Y. Zhuo, M. C. Vu, J. Chim, H. Hu, W. Yu, R. Widyasari, I. N. B. Yusuf, H. Zhan, J. He, I. Paul,et al.(2024)BigCodeBench: benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877.Cited by:§1.

Appendix AAppendix

A.1Theoretical analysis of sep-CMA-ES

In this section, we compare sep-CMA-ES with random search (RS) for maximizingJJover𝒫\mathcal{P}under binary rewards and strict budgets. All analyses are carried out in a covariance-normalized chart and mapped back through the current diagonalDtD_{t}, fixing the metric mismatch between selection (in whitened coordinates tied to𝒫\mathcal{P}) and stepping (in the original coordinates of𝒫\mathcal{P}). A Hessian-based block-ε\varepsilonseparability condition ong:=−Jg:=-Jcontrols inter-block couplings after a positive diagonal scaling and links to the algorithm’s dynamic scaling via a diagonal-comparability assumption. Concentration bounds translate replicationmminto rank-quality attenuation without moment swapping. We also instantiate the specific case in our study (n≈10000n\approx 10000,mRS=32m_{\mathrm{RS}}=32,mCMA=16m_{\mathrm{CMA}}=16), directly tying the representationh(s)∈ℋh(s)\in\mathcal{H}and the coordinator parametersθ∈𝒫\theta\in\mathcal{P}to the observed efficiency of sep-CMA-ES.

A.1.1Definitions and Assumptions

Notations.

We optimizeJ(θ)=𝔼[R(τ)]J(\theta)=\mathbb{E}[R(\tau)]over thecoordination space𝒫⊂ℝn\mathcal{P}\subset\mathbb{R}^{n}induced by the headfθ:ℋ→ℝ|𝒜|f_{\theta}:\mathcal{H}\to\mathbb{R}^{|\mathcal{A}|}acting onrepresentation statesh(s)∈ℋ⊂ℝdh(s)\in\mathcal{H}\subset\mathbb{R}^{d}. Setg(θ):=−J(θ)g(\theta):=-J(\theta)andH(θ):=∇2g(θ)H(\theta):=\nabla^{2}g(\theta). We analyze contraction toward the origin (w.l.o.g. by re-centering𝒫\mathcal{P}) in a compact domain𝒟⊂𝒫\mathcal{D}\subset\mathcal{P}. sep-CMA-ES maintains mean (iterate)mt∈𝒫m_{t}\in\mathcal{P}with radiusrt:=‖mt‖r_{t}:=\|m_{t}\|, step-sizeσt>0\sigma_{t}>0, and diagonal scaling

Dt=diag(s1,t,…,sn,t)≻0,y=mt+σtDtz,z∼𝒩(0,In).D_{t}=\mathrm{diag}(\sqrt{s_{1,t}},\ldots,\sqrt{s_{n,t}})\succ 0,\qquad y\;=\;m_{t}+\sigma_{t}D_{t}z,\ \ z\sim\mathcal{N}(0,I_{n}).Whitened chart:x=Dt−1(y−mt)=σtzx=D_{t}^{-1}(y-m_{t})=\sigma_{t}z(isotropic sampling). Direction:ut:=mt/‖mt‖u_{t}:=m_{t}/\|m_{t}\|, projectionZ∥:=⟨ut,z⟩∼𝒩(0,1)Z_{\parallel}:=\langle u_{t},z\rangle\sim\mathcal{N}(0,1). Population sizeλ=⌈4+3ln⁡n⌉(≥2)\lambda=\lceil 4+3\ln n\rceil\ (\geq 2),μ\muparents with weights(wj)j=1μ(w_{j})_{j=1}^{\mu};zj:λz_{j:\lambda}are order statistics. RS uses a fixed replication count per candidate; in this appendix we setmRS:=32m_{\mathrm{RS}}:=32. The atomic budgetBenvB_{\mathrm{env}}counts Bernoulli calls.

Blocks, scaling, and operators.

Let{B1,…,BM}\{B_{1},\dots,B_{M}\}partition{1,…,n}\{1,\dots,n\}(coordinate blocks in𝒫\mathcal{P}). For any matrixMM,off⁡(M)\operatorname{off}(M)zeroes its diagonal;offinter⁡(M)\operatorname{off_{inter}}(M)zeroes diagonal and within-block entries. For diagonalDD, letsmax(D)s_{\max}(D),smin(D)s_{\min}(D)be its largest/smallest diagonal square-roots and define

κD:=smax(D)2smin(D)2,κD(t):=smax(Dt)2smin(Dt)2.\kappa_{D}:=\frac{s_{\max}(D)^{2}}{s_{\min}(D)^{2}},\qquad\kappa_{D}(t):=\frac{s_{\max}(D_{t})^{2}}{s_{\min}(D_{t})^{2}}.

Definition 1(Hessian-based block-ε\varepsilonseparability in𝒫\mathcal{P})

There exists a structural diagonal scalingS=diag(s1,…,sn)≻0S=\mathrm{diag}(s_{1},\dots,s_{n})\succ 0such that the scaled HessianHS(θ):=S1/2H(θ)S1/2H_{S}(\theta):=S^{1/2}H(\theta)S^{1/2}is uniformly nearly block-diagonal on𝒟\mathcal{D}. WithD(θ):=diag(HS(θ))D(\theta):=\mathrm{diag}(H_{S}(\theta)), one of the following dimensionless bounds holds with a commonεH∈[0,1)\varepsilon_{H}\in[0,1):

supθ∈𝒟‖D(θ)−1/2offinter⁡(HS(θ))D(θ)−1/2‖2\displaystyle\sup_{\theta\in\mathcal{D}}\ \big\|D(\theta)^{-1/2}\,\operatorname{off_{inter}}\!\big(H_{S}(\theta)\big)\,D(\theta)^{-1/2}\big\|_{2}≤εH,\displaystyle\leq\varepsilon_{H},(B1)supθ∈𝒟maxi∈Bp,j∈Bqp≠q⁡|[HS(θ)]ij|[HS(θ)]ii[HS(θ)]jj\displaystyle\sup_{\theta\in\mathcal{D}}\ \max_{\begin{subarray}{c}i\in B_{p},\ j\in B_{q}\\ p\neq q\end{subarray}}\frac{|[H_{S}(\theta)]_{ij}|}{\sqrt{[H_{S}(\theta)]_{ii}[H_{S}(\theta)]_{jj}}}\≤εH,\displaystyle\leq\ \varepsilon_{H},(B2)supθ∈𝒟maxi∈Bp⁡∑j∈Bq,q≠p|[HS(θ)]ij|[HS(θ)]ii\displaystyle\sup_{\theta\in\mathcal{D}}\ \max_{i\in B_{p}}\frac{\sum_{j\in B_{q},\ q\neq p}|[H_{S}(\theta)]_{ij}|}{[H_{S}(\theta)]_{ii}}\≤εH(<1).\displaystyle\leq\ \varepsilon_{H}\ (<1).(B3)Within-block structure is unrestricted;0<μi≤[HS(θ)]ii≤Li<∞0<\mu_{i}\leq[H_{S}(\theta)]_{ii}\leq L_{i}<\inftyon𝒟\mathcal{D}.

Assumption 1(Diagonal comparability)

There exist constantsccmp,Ccmp>0c_{\mathrm{cmp}},C_{\mathrm{cmp}}>0such that for allt,it,i,

ccmp≤sisi,t≤Ccmp,equivalentlyCcmp/ccmp=O(suptκD(t)).c_{\mathrm{cmp}}\ \leq\ \frac{s_{i}}{s_{i,t}}\ \leq\ C_{\mathrm{cmp}},\qquad\text{equivalently }\ C_{\mathrm{cmp}}/c_{\mathrm{cmp}}=O\!\big(\sup_{t}\kappa_{D}(t)\big).This links the structural scalingSSin Definition1to the algorithm’s dynamic scalingDtD_{t}.

Definition 2(Metric–alignment factor)

For any unituuand diagonalD≻0D\succ 0,

χ(u,D):=(u⊤Du)2u⊤D2u∈[1κD,1],\chi(u,D)\ :=\ \frac{(u^{\top}Du)^{2}}{u^{\top}D^{2}u}\ \in\ \Big[\frac{1}{\kappa_{D}},\,1\Big],the squared correlation between the ranking score⟨u,z⟩\langle u,z\rangle(whitened) and the progress score⟨Du,z⟩\langle Du,z\rangle(original metric on𝒫\mathcal{P}).

Assumption 2(Local linear score with curvature remainder)

There existγ>0\gamma>0,Lcurv≥0L_{\mathrm{curv}}\geq 0, and a step-size window such that along the trajectory (in whitened coordinates)

J(mt+σDtz)=12+γσ⟨ut,z⟩+ξt(z),|ξt(z)|≤(Lcurv+cHεH)σ2‖z‖2,J(m_{t}+\sigma D_{t}z)\;=\;\tfrac{1}{2}+\gamma\,\sigma\,\langle u_{t},z\rangle+\xi_{t}(z),\qquad|\xi_{t}(z)|\ \leq\ (L_{\mathrm{curv}}+c_{H}\varepsilon_{H})\,\sigma^{2}\|z\|^{2},with constantcH>0c_{H}>0.

Definition 3(Rank attenuation under replication)

LetN=⌊Benv/mRS⌋N=\lfloor B_{\mathrm{env}}/m_{\mathrm{RS}}\rfloorbe the total number of RS candidates evaluated under the budget andZ∥⋆:=min1≤k≤N⁡⟨ut,z(k)⟩Z_{\parallel}^{\star}:=\min_{1\leq k\leq N}\langle u_{t},z^{(k)}\rangle. Withx−:=min⁡{x,0}x_{-}:=\min\{x,0\},

ρ~RS2:=𝔼[(Z∥sel)−2]𝔼[(Z∥⋆)−2]∈[0,1],ρ~CMA2:=𝔼[⟨ut,∑j=1μwjzj:λ(q^mCMA)⟩2]𝔼[⟨ut,∑j=1μwjzj:λ(−Z∥)⟩2]∈[0,1].\tilde{\rho}_{\mathrm{RS}}^{2}:=\frac{\mathbb{E}\!\left[\big(Z_{\parallel}^{\mathrm{sel}}\big)_{-}^{2}\right]}{\mathbb{E}\!\left[\big(Z_{\parallel}^{\star}\big)_{-}^{2}\right]}\in[0,1],\qquad\tilde{\rho}_{\mathrm{CMA}}^{2}:=\frac{\mathbb{E}\!\left[\left\langle u_{t},\sum_{j=1}^{\mu}w_{j}z_{j:\lambda}^{(\widehat{q}_{m_{\mathrm{CMA}}})}\right\rangle^{2}\right]}{\mathbb{E}\!\left[\left\langle u_{t},\sum_{j=1}^{\mu}w_{j}z_{j:\lambda}^{(-Z_{\parallel})}\right\rangle^{2}\right]}\in[0,1].

Assumption 3(Metric–alignment comparability)

There existsCχ≥1C_{\chi}\geq 1such that for relevanttt,

1Cχ≤χ(ut,Dt)/κD(t)χ(u0,D0)/κD(0)≤Cχ.\frac{1}{C_{\chi}}\ \leq\ \frac{\chi(u_{t},D_{t})/\kappa_{D}(t)}{\chi(u_{0},D_{0})/\kappa_{D}(0)}\ \leq\ C_{\chi}.Thus the alignment efficiencyχ(ut,Dt)/κD(t)\chi(u_{t},D_{t})/\kappa_{D}(t)stays within a bounded factor of its initial value.

A.1.2sep-CMA-ES vs random search with fitness averaging

Rank noise and attenuation.

Consider two candidatesz1,z2z_{1},z_{2}in the same batch with linear score gapΔ:=γσ|⟨ut,z1−z2⟩|\Delta:=\gamma\sigma|\langle u_{t},z_{1}-z_{2}\rangle|. AveragingmmBernoulli draws per candidate yields the misorder bound

Pr⁡[f^(mt+σDtz1)≤f^(mt+σDtz2)butJ(mt+σDtz1)>J(mt+σDtz2)]\Pr\!\Big[\widehat{f}(m_{t}+\sigma D_{t}z_{1})\leq\widehat{f}(m_{t}+\sigma D_{t}z_{2})\ \text{but}\ J(m_{t}+\sigma D_{t}z_{1})>J(m_{t}+\sigma D_{t}z_{2})\Big]≤Ce−cmΔ2+Pr⁡(𝖼𝗎𝗋𝗏>Δ2),\ \leq\ Ce^{-c\,m\,\Delta^{2}}+\Pr(\mathsf{curv}>\tfrac{\Delta}{2}),where the curvature event{𝖼𝗎𝗋𝗏>Δ/2}\{\mathsf{curv}>\Delta/2\}is due toξt\xi_{t}and admits the tail

Pr⁡(𝖼𝗎𝗋𝗏>Δ2)≤C′exp⁡(−c′Δσ2εH)+O(εH).\Pr(\mathsf{curv}>\tfrac{\Delta}{2})\ \leq\ C^{\prime}\exp\!\Big(-c^{\prime}\frac{\Delta}{\sigma^{2}\varepsilon_{H}}\Big)+O(\varepsilon_{H}).Hence, forσ\sigmain a local monotonicity window (Assumption2) the signal-to-curvature ratio is order1/εH1/\varepsilon_{H}, giving an exponential suppression of curvature-induced flips. To scale this pairwise guarantee to batch selection, restrict attention to theO(log⁡N)O(\log N)(RS) orO(log⁡λ)O(\log\lambda)(CMA) most competitive order statistics: by extreme-value theory, the typical spacing between the winner and the next competitors isΘ(1/ln⁡N)\Theta(1/\sqrt{\ln N}), and union-bounding only within this top cluster yields

ρ~RS2≥1−C1Nlog⁡N⋅pflip(mRS)−C2εH,ρ~CMA2≥1−C1λlog⁡λ⋅pflip(mCMA)−C2εH,\tilde{\rho}_{\mathrm{RS}}^{2}\ \geq\ 1-C_{1}N\log N\cdot p_{\mathrm{flip}}(m_{\mathrm{RS}})-C_{2}\varepsilon_{H},\qquad\tilde{\rho}_{\mathrm{CMA}}^{2}\ \geq\ 1-C_{1}\lambda\log\lambda\cdot p_{\mathrm{flip}}(m_{\mathrm{CMA}})-C_{2}\varepsilon_{H},withpflip(m)≲e−cmγ2σ2+e−c′/(εH)+O(εH)p_{\mathrm{flip}}(m)\lesssim e^{-cm\gamma^{2}\sigma^{2}}+e^{-c^{\prime}/(\varepsilon_{H})}+O(\varepsilon_{H}). In particular, choosing

m≥1cγ2σ2(ln⁡N+ln⁡ln⁡N+ln⁡1δ)orm≥1cγ2σ2(ln⁡λ+ln⁡ln⁡λ+ln⁡1δ)m\ \geq\ \frac{1}{c\,\gamma^{2}\sigma^{2}}\Big(\ln N+\ln\ln N+\ln\tfrac{1}{\delta}\Big)\quad\text{or}\quad m\ \geq\ \frac{1}{c\,\gamma^{2}\sigma^{2}}\Big(\ln\lambda+\ln\ln\lambda+\ln\tfrac{1}{\delta}\Big)ensuresρ~2≥1−δ−O(εH)\tilde{\rho}^{2}\geq 1-\delta-O(\varepsilon_{H})for RS or CMA respectively. This gives a direct budget–replication trade-off inside𝒫\mathcal{P}.

Budget-normalized single-round RS gain.

LetZ∥⋆=min1≤k≤N⁡⟨u0,z(k)⟩Z_{\parallel}^{\star}=\min_{1\leq k\leq N}\langle u_{0},z^{(k)}\rangleandvN2:=𝔼[(−Z∥⋆)2]=2ln⁡N+O(ln⁡ln⁡N)v_{N}^{2}:=\mathbb{E}[(-Z_{\parallel}^{\star})^{2}]=2\ln N+O(\ln\ln N). Define the high-probability event controlling batch norms

EN:={max1≤k≤N⁡‖z(k)‖2≤n+2nt+2t},t=ln⁡(N/c0),E_{N}:=\Big\{\max_{1\leq k\leq N}\|z^{(k)}\|^{2}\leq n+2\sqrt{nt}+2t\Big\},\qquad t=\ln(N/c_{0}),so thatPr⁡(EN)≥1−c0\Pr(E_{N})\geq 1-c_{0}by a Laurent–Massart tail plus a union bound. OnENE_{N}, the oracle step alongD0zselD_{0}z^{\mathrm{sel}}(withzselz^{\mathrm{sel}}the noisy-rank-selected candidate) is

σ⋆=(−⟨m0,D0zsel⟩‖D0zsel‖2)∨0,r02−‖m0+σ⋆D0zsel‖2=r02(⟨u0,D0zsel⟩)−2‖D0zsel‖2.\sigma^{\star}=\big(-\frac{\langle m_{0},D_{0}z^{\mathrm{sel}}\rangle}{\|D_{0}z^{\mathrm{sel}}\|^{2}}\big)\vee 0,\quad r_{0}^{2}-\|m_{0}+\sigma^{\star}D_{0}z^{\mathrm{sel}}\|^{2}=r_{0}^{2}\frac{\big(\langle u_{0},D_{0}z^{\mathrm{sel}}\rangle\big)_{-}^{2}}{\|D_{0}z^{\mathrm{sel}}\|^{2}}.Because selection is driven by⟨u0,z⟩\langle u_{0},z\ranglein the whitened chart and geometric progress depends on⟨D0u0,z⟩\langle D_{0}u_{0},z\rangle, the squared correlation factorχ(u0,D0)=(u0⊤D0u0)2u0⊤D02u0∈[1/κD,1]\chi(u_{0},D_{0})=\frac{(u_{0}^{\top}D_{0}u_{0})^{2}}{u_{0}^{\top}D_{0}^{2}u_{0}}\in[1/\kappa_{D},1]appears multiplicatively in the numerator’s expectation, while the denominator is controlled byκD=smax(D0)2/smin(D0)2\kappa_{D}=s_{\max}(D_{0})^{2}/s_{\min}(D_{0})^{2}onENE_{N}. After integrating out the event complement (which contributesO(c0𝔼[(Z∥sel)−4])=O((ln⁡N)1/2N−1/2)O(\sqrt{c_{0}\,\mathbb{E}[(Z_{\parallel}^{\mathrm{sel}})_{-}^{4}]})=O((\ln N)^{1/2}N^{-1/2})), we obtain

r02−𝔼[minσ≥0⁡‖m0+σD0zsel‖2]r02≥χ(u0,D0)⋅(1−δN)ρ~RS2vN2κD(n+2nln⁡(N/c0)+2ln⁡(N/c0))−CεH,\frac{r_{0}^{2}-\mathbb{E}[\min_{\sigma\geq 0}\|m_{0}+\sigma D_{0}z^{\mathrm{sel}}\|^{2}]}{r_{0}^{2}}\ \geq\ \chi(u_{0},D_{0})\cdot\frac{(1-\delta_{N})\,\tilde{\rho}_{\mathrm{RS}}^{\,2}\,v_{N}^{2}}{\kappa_{D}\big(n+2\sqrt{n\ln(N/c_{0})}+2\ln(N/c_{0})\big)}\ -\ C\varepsilon_{H},(1)for a universalC>0C>0andδN=O((ln⁡N)1/2N−1/2)\delta_{N}=O((\ln N)^{1/2}N^{-1/2}). A fixedσ\sigmawithin the local monotonicity window loses only a universal constant factor.

Per-iteration CMA gain and geometric regime.

Let

αμ,λ:=𝔼[⟨ut,∑j=1μwjzj:λ⟩],βμ,λ:=𝔼[‖∑j=1μwjzj:λ‖2],κμ,λ:=αμ,λ2βμ,λ=Θ(1/n),\alpha_{\mu,\lambda}:=\mathbb{E}\!\left[\Big\langle u_{t},\sum_{j=1}^{\mu}w_{j}z_{j:\lambda}\Big\rangle\right],\quad\beta_{\mu,\lambda}:=\mathbb{E}\!\left[\Big\|\sum_{j=1}^{\mu}w_{j}z_{j:\lambda}\Big\|^{2}\right],\quad\kappa_{\mu,\lambda}:=\frac{\alpha_{\mu,\lambda}^{2}}{\beta_{\mu,\lambda}}=\Theta(1/n),andκ¯μ,λ:=nκμ,λ=Θ(1)\bar{\kappa}_{\mu,\lambda}:=n\,\kappa_{\mu,\lambda}=\Theta(1). The oracle scalar step alongDt∑j=1μwjzj:λD_{t}\sum_{j=1}^{\mu}w_{j}z_{j:\lambda}yields

𝔼[rt2−rt+12]rt2≥χ(ut,Dt)⋅1κD(t)κμ,λρ~CMA2−CεH.\frac{\mathbb{E}[r_{t}^{2}-r_{t+1}^{2}]}{r_{t}^{2}}\ \geq\ \chi(u_{t},D_{t})\cdot\frac{1}{\kappa_{D}(t)}\,\kappa_{\mu,\lambda}\,\tilde{\rho}_{\mathrm{CMA}}^{\,2}\ -\ C\varepsilon_{H}.(2)The factorρ~CMA\tilde{\rho}_{\mathrm{CMA}}absorbs all rank noise effects (including sign inversions of the recombination direction);χ(ut,Dt)/κD(t)\chi(u_{t},D_{t})/\kappa_{D}(t)quantifies directional metric mismatch; and theO(εH)O(\varepsilon_{H})term accounts for inter-block perturbations. Under a standard diagonal learning rateccov=Θ(1/n)c_{\mathrm{cov}}=\Theta(1/n), block-εH\varepsilon_{H}separability and diagonal comparability imply that afterT=Θ(n)T=\Theta(n)iterationsDtD_{t}enters anO(εH)O(\varepsilon_{H})-neighborhood of a stationary point, with

𝔼[rt+12∣rt]≤(1−κ¯μ,λntildeρCMA2(1−O(εH)))rt2,\mathbb{E}[r_{t+1}^{2}\mid r_{t}]\ \leq\ \Big(1-\frac{\bar{\kappa}_{\mu,\lambda}}{n}\,\\ tilde\rho_{\mathrm{CMA}}^{\,2}(1-O(\varepsilon_{H}))\Big)\,r_{t}^{2},(3)so the method achieves geometric decay at rateΩ(1/n)\Omega(1/n)per iteration once stabilized.

Head-to-head ratio and multi-round RS.

Under a common atomic budgetBenvB_{\mathrm{env}}, CMA usesmCMAλm_{\mathrm{CMA}}\lambdaevaluations per iteration soT=⌊Benv/(mCMAλ)⌋T=\lfloor B_{\mathrm{env}}/(m_{\mathrm{CMA}}\lambda)\rfloor, while RS evaluatesN=⌊Benv/mRS⌋N=\lfloor B_{\mathrm{env}}/m_{\mathrm{RS}}\rfloorcandidates. Combining equation1and equation2, and invoking Assumption3to cancelχ/κD\chi/\kappa_{D}up to a constant, gives

CMA gainRS gain≳κ¯μ,λ2⋅BenvmCMAλ⋅n+2nln⁡N+2ln⁡NvN2⋅ρ~CMA2ρ~RS2−CεH,vN2∼2ln⁡N.\frac{\text{CMA gain}}{\text{RS gain}}\ \gtrsim\ \frac{\bar{\kappa}_{\mu,\lambda}}{2}\cdot\frac{B_{\mathrm{env}}}{m_{\mathrm{CMA}}\lambda}\cdot\frac{n+2\sqrt{n\,\ln N}+2\ln N}{v_{N}^{2}}\cdot\frac{\tilde{\rho}_{\mathrm{CMA}}^{\,2}}{\tilde{\rho}_{\mathrm{RS}}^{\,2}}\ -\ C\varepsilon_{H},\ \ \ v_{N}^{2}\sim 2\ln N.(4)If RS expends its budget acrossTTrounds with fresh batchesNtN_{t}and fixed (or monotone)σ\sigmawithin the window, then gains add roughly as∑tΘ((ln⁡Nt)/n)\sum_{t}\Theta((\ln N_{t})/n), which is at mostΘ((ln⁡Benv)/n)\Theta((\ln B_{\mathrm{env}})/n)for balancedNtN_{t}—still logarithmic in budget—whereas CMA accumulateslinearlyacross iterations (until stabilization), explaining the systematic advantage in budget-tight regimes.

Trinity-scale instantiation.

Forn≈10000n\approx 10000,λ=⌈4+3ln⁡n⌉=⌈4+3ln⁡10000⌉=32\lambda=\lceil 4+3\ln n\rceil=\lceil 4+3\ln 10000\rceil=32. WithmCMA=16m_{\mathrm{CMA}}=16andmRS=32m_{\mathrm{RS}}=32, budget matching acrossTTCMA iterations yieldsN=⌊(mCMAλ/mRS)T⌋=⌊(16⋅32/32)T⌋≈⌊16T⌋N=\big\lfloor(m_{\mathrm{CMA}}\lambda/m_{\mathrm{RS}})\,T\big\rfloor=\big\lfloor(16\cdot 32/32)\,T\big\rfloor\approx\lfloor 16\,T\rfloor. This givesvN2≈2ln⁡Nv_{N}^{2}\approx 2\ln N. Replication ensuresρ~CMA2≈1\tilde{\rho}_{\mathrm{CMA}}^{2}\approx 1(up toO(εH)O(\varepsilon_{H})). Plugging these into equation4shows that with the sameBenvB_{\mathrm{env}}CMA’s gain dominates for modestTT(a few to a few dozen iterations), consistent with empirical results where the head acts onh(s)∈ℋh(s)\in\mathcal{H}and updatesθ∈𝒫\theta\in\mathcal{P}under strict budgets.

Proposition 1

FixT∈[2,60]T\in[2,60]and let the CMA budget beBenv=mCMAλTB_{\mathrm{env}}=m_{\mathrm{CMA}}\lambda T. If the replication schedule ensuresρ~CMA/ρ~RS≥η∈(0,1]\tilde{\rho}_{\mathrm{CMA}}/\tilde{\rho}_{\mathrm{RS}}\geq\eta\in(0,1]and the metric-alignment efficiency stays comparable across iterations (Assumption3), then, up to anO(εH)O(\varepsilon_{H})term,

CMA gain inJRS gain inJ≳κ¯μ,λ2⋅Tln⁡(max⁡{e,⌊(mCMAλ/mRS)T⌋})⋅η2\frac{\text{\emph{CMA gain in } }J}{\text{\emph{RS gain in } }J}\ \gtrsim\ \frac{\bar{\kappa}_{\mu,\lambda}}{2}\cdot\frac{T}{\ln\!\big(\max\{e,\lfloor(m_{\mathrm{CMA}}\lambda/m_{\mathrm{RS}})\,T\rfloor\}\big)}\cdot\eta^{2}−Cln⁡(max⁡{e,⌊(mCMAλ/mRS)T⌋}).\ -\ \frac{C}{\ln\!\big(\max\{e,\lfloor(m_{\mathrm{CMA}}\lambda/m_{\mathrm{RS}})\,T\rfloor\}\big)}.The inequality holds for oracle step-sizes and, up to a universal constant factor, for fixed step-sizes within the local monotonicity window (Assumption2).

Proposition1. Proof.SetN=⌊(mCMAλ/mRS)T⌋N=\lfloor(m_{\mathrm{CMA}}\lambda/m_{\mathrm{RS}})T\rfloorso both methods consume the same budget. Use equation1withvN2=2ln⁡N+O(ln⁡ln⁡N)v_{N}^{2}=2\ln N+O(\ln\ln N)andδN=O((ln⁡N)1/2N−1/2)\delta_{N}=O((\ln N)^{1/2}N^{-1/2})to bound RS improvement. Sum equation2overt=0,…,T−1t=0,\dots,T-1to get CMA improvement at least∑t(χ(ut,Dt)/κD(t))κμ,λρ~CMA2−CTεH\sum_{t}(\chi(u_{t},D_{t})/\kappa_{D}(t))\kappa_{\mu,\lambda}\tilde{\rho}_{\mathrm{CMA}}^{2}-CT\varepsilon_{H}. Apply Assumption3to replace iteration-wise factors by a constant multiple; the metric terms cancel in the ratio. Substituteκμ,λ=κ¯μ,λ/n\kappa_{\mu,\lambda}=\bar{\kappa}_{\mu,\lambda}/nand comparenntovN2∼2ln⁡Nv_{N}^{2}\sim 2\ln Nto obtain the bound with(ρ~CMA/ρ~RS)2(\tilde{\rho}_{\mathrm{CMA}}/\tilde{\rho}_{\mathrm{RS}})^{2}.

Proposition 2

Under Definition1, Assumptions1,2, and3, and a replication schedule withρ~CMA2=1−O(εH)\tilde{\rho}_{\mathrm{CMA}}^{2}=1-O(\varepsilon_{H}), sep-CMA-ES achieves, after aΘ(n)\Theta(n)transient, the per-iteration contraction

κ¯μ,λn(1−O(εH)),\frac{\bar{\kappa}_{\mu,\lambda}}{n}\,(1-O(\varepsilon_{H})),i.e.,𝔼[rT2]≲exp⁡(−c′T/n)r02\mathbb{E}[r_{T}^{2}]\lesssim\exp(-c^{\prime}T/n)\,r_{0}^{2}for somec′>0c^{\prime}>0depending onκ¯μ,λ\bar{\kappa}_{\mu,\lambda}and the residualO(εH)O(\varepsilon_{H}). Restricting to diagonal covariances incurs only anO(εH)O(\varepsilon_{H})multiplicative loss relative to the block-diagonal optimum.

Proposition2. Proof.(i)*Scale stabilization:*Withccov=Θ(1/n)c_{\mathrm{cov}}=\Theta(1/n)and block-εH\varepsilon_{H}separability plus diagonal comparability, standard CMA drift showsDtD_{t}reaches anO(εH)O(\varepsilon_{H})-neighborhood of a stationary point inT0=Θ(n)T_{0}=\Theta(n)steps; thenκD(t)=Θ(1)\kappa_{D}(t)=\Theta(1)and typicalχ(ut,Dt)=Θ(1)\chi(u_{t},D_{t})=\Theta(1). (ii)*Uniform per-iteration gain:*Insert these bounds into equation2to get𝔼[rt+12∣rt]≤(1−κ¯μ,λρ~CMA2/n(1−O(εH)))rt2\mathbb{E}[r_{t+1}^{2}\mid r_{t}]\leq(1-\bar{\kappa}_{\mu,\lambda}\tilde{\rho}_{\mathrm{CMA}}^{2}/n\,(1-O(\varepsilon_{H})))r_{t}^{2}; iterate to obtain geometric decay with rateΩ(1/n)\Omega(1/n). (iii)*Closeness to the independent-block ideal:*SinceHS(θ)H_{S}(\theta)isO(εH)O(\varepsilon_{H})-close (operator norm) to block-diagonal on𝒟\mathcal{D}, the population-optimal full-covariance CMA differs from its block-diagonal part byO(εH)O(\varepsilon_{H}), so using only diagonals losesO(εH)O(\varepsilon_{H})in the contraction constant. (iv)*Rank reliability:*Replication withmCMA≳(γ2σ2)−1log⁡λm_{\mathrm{CMA}}\gtrsim(\gamma^{2}\sigma^{2})^{-1}\log\lambdakeepsρ~CMA2=1−O(εH)\tilde{\rho}_{\mathrm{CMA}}^{2}=1-O(\varepsilon_{H}).

A.2Supervised fine-tuning

A.2.1Experiment details

In this section, we describe our setup and results for experiments with a widely used imitation learning method, SFT. Concretely, we use a direct single-step state–action formulation where each training example consists of a state and a discrete action corresponding to the choice of a single LLM from the pool. SFT trains on these observed state–action pairs to imitate an oracle policy. Given a state, the model is optimized to predict the oracle’s action via maximum-likelihood estimation. In our setting, the state is the coordinator’s hidden-state representation of the input, and the action is the index of the selected LLM.

Dataset.We first extract the labels from our per-question-best oracle results. Specifically, each label is generated by first identifying, for each seed independently, which LLM achieved the highest reward on that question. When multiple LLMs tie at the maximum reward, we uniformly sample one from the tied set. We then aggregate these per-seed selections across all seeds via majority voting. The LLM selected most frequently across seeds becomes the final label for that question. In cases where multiple LLMs receive equal votes, we uniformly sample from the tied candidates to ensure unbiased label assignment. This approach yields a realistic per-trial performance estimate while maintaining label diversity across the model pool. Table5shows the resulting agent label distribution over different tasks.

Table 5:Agent label distribution by task.Percentage and count of datapoints where each agent was selected as best via majority vote across seeds.Training.We optimize the coordinator using Adam(Kingma and Ba,2017)with the frozen SLM, training only the linear head. After experimenting with various learning rates and batch sizes, we found that a learning rate of1×10−61\times 10^{-6}and batch size of 64 yield the best coordinator performance. The trained coordinator achieves scores of 0.592, 0.786, 0.906, and 0.360 on LiveCodeBench, MATH500, MMLU, and RLPR respectively. Figure7shows the learned agent selection distribution, illustrating which agents the coordinator preferentially select for each task type.

Refer to caption Figure 7:Agent selection distribution by task.Percentage of datapoints where each agent was selected by the trained coordinator.

A.2.2Cost in label generation

The cost profiles of SFT and label-free training methods, such as sep-CMA-ES, REINFORCE, and RS differ substantially. For SFT, the dominant cost lies in label generation. Labels can be produced at reasonable cost for a direct mapping from representation space to single-step agent selection, but become quickly intractable in multi-turn settings. In our direct-mapping setting, generating labels requires running33seeds on7k7\text{k}datapoints across77agents, resulting in3×7k×7=147k3\times 7\text{k}\times 7=147\text{k}LLM queries.

For multi-turn coordination, the label complexity grows exponentially. Under our experimental configuration with up to 5 turns and 7 candidate agents per turn, the number of required LLM queries for agent selection alone scales by a factor of74≈2.4×1037^{4}\approx 2.4\times 10^{3}relative to the single-step setting. Moreover, in multi-turn settings the role selection (among 3 roles) is also relevant at each of the 5 turns, introducing an additional factor of35=243≈2.4×1023^{5}=243\approx 2.4\times 10^{2}. In total, this yields a multiplicative factor of74⋅35=583,443≈5.8×1057^{4}\cdot 3^{5}=583{,}443\approx 5.8\times 10^{5}, inflating the cost to an enormous1.5×105×5.8×105≈8.7×10101.5\times 10^{5}\times 5.8\times 10^{5}\approx 8.7\times 10^{10}LLM queries. By contrast, label-free training methods such as sep-CMA-ES require no explicit label generation and instead optimize the coordinator directly based on task rewards.

In summary, while SFT can provide performance gains for a direct representation-to-agent mapping, its prohibitive label-generation cost makes it unsuitable for training multi-turn coordinators, limiting its scalability.

A.3Full analysis of separability in representation space

This section examines how well the extracted hidden states and the coordinator’s output logits separate relevant classes. For hidden states, greater separability implies that the SLM’s representations encode richer context, providing a stronger signal for the lightweight head to make task-aware decisions.

First, we examine separability along three complementary axes: (i)Notion of separability: linear vs. non-linear; (ii)Label source: task-type labels (from metadata; input-side) vs. agent/role selection labels (from the head’s logits; decision-side); (iii)Feature space: raw SLM hidden states (representation space) vs. the coordinator head’s output logits (coordination space).

For each cross-combination, we use standard dimensionality-reduction visualizations (PCA/LDA for linear structure; t-SNE/UMAP for non-linear structure) and report classification accuracy using linear and RBF SVMs as quantitative proxies for linear and non-linear separability, respectively. Features are standardized; visualizations are used qualitatively, and SVM accuracies provide the quantitative assessment. Figures8–13summarize the results.

Refer to caption Figure 8:PCA analysis.All four plots demonstrate clear clustering patterns.Figure 9:LDA analysis.The Fisher’s ratios indicate that the between-class scatter is approximately two to three times greater than the within-class scatter.Figure 10:UMAP analysis.The clustering patterns indicate strong non-linear separability. Refer to caption Figure 11:t-SNE analysis.The analysis demonstrates particularly strong separability of task types in the hidden states extracted from the SLM.Figure 12:SVM analysis on hidden states extracted from the SLM.Classification accuracies: Linear SVM (task type) = 1.000, RBF SVM (task type) = 1.000, Linear SVM (agent selection) = 0.713, RBF SVM (agent selection) = 0.776. Refer to caption Figure 13:SVM analysis on output logits.Classification accuracies: Linear SVM (task type) = 0.945, RBF SVM (task type) = 0.955, Linear SVM (agent selection) = 0.786, RBF SVM (agent selection) = 0.783.From Figures8–11, both linear (PCA/LDA) and non-linear (UMAP/t-SNE) views reveal clear structure. LDA’s reported Fisher ratios (between/within scatter≈\approx2–3×\times) corroborate that much of the variance aligns with task-discriminative directions, while PCA shows separation already in the top components, suggesting a substantial linearly aligned subspace.

The SVM results (Figures12–13) are especially revealing: in therepresentation space, task-type classification is near-perfect for both linear and RBF kernels, implying that penultimate-token hidden states encode task semantics in a nearly linearly separable manifold even after standardization and class-balance controls. In thecoordination space(head logits), task-type accuracy decreases while agent/role-selection accuracy increases (notably for the linear SVM, which aligns with the headlinear(see AppendixA.4)), indicating that the head compresses and reorients input semantics toward low-dimensional, decision-aligned axes. This redistribution is consistent with a policy that projects context onto agent-specific logit directions, yielding simpler, more linearly separable boundaries for agent selection.

Next, we investigate how representation space separability relates to coordinator performance. We train linear SVMs to predict agent selections from the hidden states extracted from the SLM , using the coordinator’s agent selection as labels. Across the four datasets, LiveCodeBench, MATH500, MMLU, and RLPR, the classification accuracies are 0.844, 0.764, 0.679, and 0.544, respectively.

This ranking aligns with our experimental findings: Sections4.2and4.4demonstrate thatTrinityshows stronger performance advantages over baseline methods on LiveCodeBench and MATH500 compared to MMLU and RLPR. While these agent selection labels reflect the coordinator’s learned behavior rather than ground truth assignments, the correlation between classification accuracy and relative performance gains suggests that tasks exhibiting greater separability in the representation space may be more amenable to effective coordination.

To directly examine the relationship between the intrinsic separability among the datapoints in one task in the representation space and the coordinator’s performance, we conduct a controlled experiment using synthetic datasets. Directly controlling separability in real task distributions is impractical, as interventions such as injecting noise into hidden states may introduce confounding factors beyond separability changes (e.g., distributing samples out-of-distribution or altering semantic structure). Therefore, we generate synthetic datasets that replicate the exact structure of the coordinator’s representation space (1024 dimensions, 7 agent classes, 4 task type clusters) while systematically varying separability levels. We control separability by systematically scaling the distances between class centers while maintaining consistent within-class covariance, generating datasets whose measured separability index (between-class variance / total variance) vary.

Refer to caption Figure 14:Separability index vs head classification accuracy.Trained on synthetic datasets with systematically varied separability, the headlinearexhibits a strong positive correlation between separability index and test classification accuracy.We train the exact same head used in our experiments,linear, on these synthetic datasets. Figure14reveals a strong positive correlation between the separability index and the head’s classification accuracy, with test classification accuracy increasing steadily as separability index rises. This controlled experiment indicates that higher intrinsic separability for a task in the representation space enables better head’s performance on the task, independent of task-specific confounds.

A.4Head architecture design

We describe four heads that maps the SLM’s hidden state𝐡∈ℝdh\mathbf{h}\in\mathbb{R}^{d_{h}}to agent and role selection logits𝐳∈ℝna\mathbf{z}\in\mathbb{R}^{n_{a}}, subsequently turned into probabilities with a softmax or argmax.

Linearhead refers to the most direct affine mapping (without bias) from hidden states to logits. It computes

𝐳=𝐖𝐡,𝐖∈ℝna×dh.\mathbf{z}=\mathbf{W}\mathbf{h},\qquad\mathbf{W}\in\mathbb{R}^{n_{a}\times d_{h}}.(5)This head has exactlydhnad_{h}n_{a}trainable parameters and serves as a strong baseline. It allows unrestricted linear combinations of hidden dimensions to express agent and role preferences while remaining simple and fast to train.

Low-rankhead refers to a factorized bottleneck with a nonlinearity that replaces a single dense map by two smaller projections. We use

𝐮\displaystyle\mathbf{u}=ELU(𝐔𝐡),ELU(x)={x,x≥0α(ex−1),x<0,α=0.1,\displaystyle=\mathrm{ELU}(\mathbf{U}\mathbf{h}),\qquad\mathrm{ELU}(x)=\begin{cases}x,&x\geq 0\\ \alpha(e^{x}-1),&x<0\end{cases},\ \alpha=0.1,(6)𝐳\displaystyle\mathbf{z}=𝐕𝐮⋅σ,\displaystyle=\mathbf{V}\mathbf{u}\cdot\sigma,(7)with𝐔∈ℝr×dh\mathbf{U}\in\mathbb{R}^{r\times d_{h}},𝐕∈ℝna×r\mathbf{V}\in\mathbb{R}^{n_{a}\times r}, and a fixed non-trainable scaleσ∈ℝ\sigma\in\mathbb{R}. In this work wefixthe bottleneck tor=14r=14. This choice can result inmoreparameters than a strictly compressed low-rank setting, but it intentionally adds depth and nonlinearity so the head can capture non-linear patterns at reduced per-projection cost versus a single wide mapping. We initialize with Xavier-uniform(Glorot and Bengio,2010)using adaptive gains:

𝐔\displaystyle\mathbf{U}∼𝒰(−6dh+r,6dh+r),𝐕∼𝒰(−18r+na,18r+na).\displaystyle\sim\mathcal{U}\!\left(-\sqrt{\tfrac{6}{d_{h}+r}},\sqrt{\tfrac{6}{d_{h}+r}}\right),\qquad\mathbf{V}\sim\mathcal{U}\!\left(-\sqrt{\tfrac{18}{r+n_{a}}},\sqrt{\tfrac{18}{r+n_{a}}}\right).(8) Sparsehead refers to a learnable dimension-selection mechanism that gates hidden features before a linear projection. The logits are

𝐳=𝐖(𝐡⊙𝜶),𝐖∈ℝna×dh,\mathbf{z}=\mathbf{W}\,(\mathbf{h}\odot\bm{\alpha}),\qquad\mathbf{W}\in\mathbb{R}^{n_{a}\times d_{h}},(9)where𝜶∈ℝdh\bm{\alpha}\in\mathbb{R}^{d_{h}}is a data-agnostic, learnable selection vector. The target number of active dimensions isk=max⁡(1,⌊dh⋅(1−sigmoid(ρ))⌋)k=\max\!\bigl(1,\lfloor d_{h}\cdot(1-\mathrm{sigmoid}(\rho))\rfloor\bigr)with a learnable sparsity logitρ\rho. During training we form a differentiable top-kkmask by sampling Gumbel noise and sharpening with a temperatureτ∈[1.0,20.0]\tau\!\in[1.0,\,20.0]:

𝐬~\displaystyle\tilde{\mathbf{s}}=(𝐬+ϵ)/τ,ϵ∼Gumbel(0,1),\displaystyle=(\mathbf{s}+\bm{\epsilon})/\tau,\quad\bm{\epsilon}\sim\mathrm{Gumbel}(0,1),(10)𝜶soft\displaystyle\bm{\alpha}_{\text{soft}}=TopKsoft(𝐬~,k),𝜶=𝜶soft⋅k∑i=1dhαsoft,i.\displaystyle=\mathrm{TopK}_{\text{soft}}(\tilde{\mathbf{s}},k),\qquad\bm{\alpha}=\frac{\bm{\alpha}_{\text{soft}}\cdot k}{\sum_{i=1}^{d_{h}}\alpha_{\text{soft},i}}.(11)At inference, we use a hard top-kkbinary mask𝜶=TopKhard(𝐬,k)\bm{\alpha}=\mathrm{TopK}_{\text{hard}}(\mathbf{s},k). This head hasdhna+dh+2d_{h}n_{a}+d_{h}+2parameters (projection weights, importance scores, temperature, and sparsity logit) and offers both regularization and interpretability by exposing which hidden dimensions drive agent and role selections.

Block-diagonalhead refers to structuring the projection matrix with disjoint blocks that couple only subsets of hidden dimensions to subsets of agents or roles:

𝐖=[𝐖1𝟎⋯𝟎𝟎𝐖2⋯𝟎⋮⋮⋱⋮𝟎𝟎⋯𝐖B],𝐳=[𝐖1𝐡1𝐖2𝐡2⋮𝐖B𝐡B],\mathbf{W}=\begin{bmatrix}\mathbf{W}_{1}&\mathbf{0}&\cdots&\mathbf{0}\\ \mathbf{0}&\mathbf{W}_{2}&\cdots&\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{0}&\mathbf{0}&\cdots&\mathbf{W}_{B}\end{bmatrix},\quad\mathbf{z}=\begin{bmatrix}\mathbf{W}_{1}\mathbf{h}_{1}\\ \mathbf{W}_{2}\mathbf{h}_{2}\\ \vdots\\ \mathbf{W}_{B}\mathbf{h}_{B}\end{bmatrix},(12)with𝐡=[𝐡1;…;𝐡B]\mathbf{h}=[\mathbf{h}_{1};\ldots;\mathbf{h}_{B}],𝐖i∈ℝai×hi\mathbf{W}_{i}\in\mathbb{R}^{a_{i}\times h_{i}}. We use two concrete variants.Block-diagonal-2setsB=2B=2and partitions both hidden and agent/role dimensions proportionally, e.g.,

ai=min⁡(⌈na2⌉,na−∑j<iaj),hi={⌊aidhna⌋,i<2dh−∑j<2hj,i=2.a_{i}=\min\!\Bigl(\Bigl\lceil\frac{n_{a}}{2}\Bigr\rceil,\,n_{a}-\!\!\sum_{j<i}a_{j}\Bigr),\qquad h_{i}=\begin{cases}\bigl\lfloor\tfrac{a_{i}d_{h}}{n_{a}}\bigr\rfloor,&i<2\\[2.0pt] d_{h}-\sum_{j<2}h_{j},&i=2\end{cases}.Block-diagonal-10denotes the high-independence case corresponding to our setting withna=10n_{a}=10logits. It creates one block per agent/role (B=10B=10,ai=1a_{i}=1) and distributes hidden dimensions as evenly as possible, yielding

zj=𝐰j⊤𝐡j,hj={⌊dh10⌋+1,j≤(dhmod10)⌊dh10⌋,otherwise.z_{j}=\mathbf{w}_{j}^{\top}\mathbf{h}_{j},\qquad h_{j}=\begin{cases}\left\lfloor\tfrac{d_{h}}{10}\right\rfloor+1,&j\leq(d_{h}\bmod 10)\\[2.0pt] \left\lfloor\tfrac{d_{h}}{10}\right\rfloor,&\text{otherwise}\end{cases}.Block-diagonal-2blocks moderate amount of parameter correlations, whereasblock-diagonal-10maximizes independence across the ten logits.

Table 6:Parameter size distribution in training.The size is calculated based on the SLM Qwen3-0.6B. SVF refers to singular value fine-tuning.Table6compares the parameter counts of the different head architectures alongside the parameters trained in singular value fine-tuning.Block-diagonal-10achieves an exact10×10\timesreduction in head parameters relative tolinear(1,0241{,}024vs.10,24010{,}240parameters fordh=1024d_{h}=1024,na=10n_{a}=10). In contrast,low-rankreplaces the singledh×nad_{h}\times n_{a}projection with two matrices𝐔∈ℝr×dh\mathbf{U}\in\mathbb{R}^{r\times d_{h}}and𝐕∈ℝna×r\mathbf{V}\in\mathbb{R}^{n_{a}\times r}(withr=14r=14) and an ELU nonlinearity, increasing the head size to20,68020{,}680parameters. This is roughly a2×2\timesincrease overlinear, trading parameter efficiency for additional depth and non-linearity in the mapping from hidden states to logits.

A.5Experimentation with learning algorithms

We also compare our learning strategy with the REINFORCE algorithm and RS with fitness averaging. To ensure the total evaluation budgets were equivalent, we configured the baselines as follows. For REINFORCE, we used a batch size equal to the per-iteration evaluation size of sep-CMA-ES and ran for 60 iterations. For RS, we performed 32 trials for each sampled parameter vector, continuing until the total number of trials matched the evaluation count of sep-CMA-ES.

For RS, we warmstart it by calibrating the sampling range using the high-performing weights obtained via sep-CMA-ES. Specifically, we sample uniformly from[−0.5,0.5][-0.5,0.5], a band that slightly exceeds the observed extrema of those weights. For each sampled parameter vector, we run 32 independent trials and compare the average reward.

A.6Dataset–Agent Subset Selection

To construct a pool of complementary agents and a curriculum of datasets that together amplify coordination gains, we cast selection as a joint subset selection over datasets and agents. Our formulation and procedure adhere to two principles: (i) evaluate gains in the error space to capture practical improvements across varying accuracy regimes; (ii) enforce complementarity, not merely strength, so the coordinator can exploit heterogeneous capabilities.

Objective: maximize relative error reduction under joint constraints

Letℳ={M1,…,MX}\mathcal{M}=\{M_{1},\dots,M_{X}\}be candidate agents and𝒟={D1,…,DY}\mathcal{D}=\{D_{1},\dots,D_{Y}\}be candidate datasets. LetE(Dy,Mx)∈[0,1]E(D_{y},M_{x})\in[0,1]denote the observed accuracy of agentMxM_{x}on datasetDyD_{y}under a fixed inference protocol (without coordination, identical output-token budget, and prompting). For any dataset subsetC⊆𝒟C\subseteq\mathcal{D}and agent subsetℳ′⊆ℳ\mathcal{M}^{\prime}\subseteq\mathcal{M}, define

ZC,ℳ′=1|C|∑Dy∈CmaxMx∈ℳ′⁡E(Dy,Mx),SC,ℳ′∗=maxMx∈ℳ′⁡1|C|∑Dy∈CE(Dy,Mx).Z_{C,\mathcal{M}^{\prime}}=\frac{1}{|C|}\sum_{D_{y}\in C}\max_{M_{x}\in\mathcal{M}^{\prime}}E(D_{y},M_{x}),\qquad S^{*}_{C,\mathcal{M}^{\prime}}=\max_{M_{x}\in\mathcal{M}^{\prime}}\frac{1}{|C|}\sum_{D_{y}\in C}E(D_{y},M_{x}).(13)Here,ZC,ℳ′Z_{C,\mathcal{M}^{\prime}}denotes thecombinationperformance—i.e., the best-per-dataset accuracy obtained by coordinating eachDy∈CD_{y}\in Cto its highest-performing agent inℳ′\mathcal{M}^{\prime}—and thus ignores potential synergistic interactions among agents within a dataset. WhileZC,ℳ′Z_{C,\mathcal{M}^{\prime}}may not fully reflect end-to-end coordinated performance, it serves as a tractable proxy that is typically positively correlated with it. In contrast,SC,ℳ′∗S^{*}_{C,\mathcal{M}^{\prime}}is thebest single-agentbaseline on the sameCC, obtained by fixing one agent inℳ′\mathcal{M}^{\prime}for all datasets. We then optimize therelative error reduction(RER):

RER(C,ℳ′)=ZC,ℳ′−SC,ℳ′∗1−SC,ℳ′∗.\mathrm{RER}(C,\mathcal{M}^{\prime})=\frac{Z_{C,\mathcal{M}^{\prime}}-S^{*}_{C,\mathcal{M}^{\prime}}}{1-S^{*}_{C,\mathcal{M}^{\prime}}}.(14)This criterion rewards settings where no single agent dominates across all datasets and where specialization materially lowers error. Our joint selection problem is

(C∗,ℳ∗)∈arg⁡maxC⊆𝒟,ℳ′⊆ℳ⁡RER(C,ℳ′).(C^{*},\mathcal{M}^{*})\in\arg\max_{C\subseteq\mathcal{D},\,\mathcal{M}^{\prime}\subseteq\mathcal{M}}\ \mathrm{RER}(C,\mathcal{M}^{\prime}).(15)

Joint Dataset–Agent Subset Selection

For each datasetDyD_{y}and a chosen model subsetℳ′⊆ℳ\mathcal{M}^{\prime}\subseteq\mathcal{M}, thebest model for the individual dataset(doubly constrained) is

My,ℳ′∗=arg⁡maxMx∈ℳ′⁡E(Dy,Mx).M^{*}_{y,\mathcal{M}^{\prime}}=\arg\max_{M_{x}\in\mathcal{M}^{\prime}}E(D_{y},M_{x}).(20)Given subsetsC⊆𝒟C\subseteq\mathcal{D}with|C|≤Y|C|\leq Yandℳ′⊆ℳ\mathcal{M}^{\prime}\subseteq\mathcal{M}with|ℳ′|≤X|\mathcal{M}^{\prime}|\leq X, thejoint combination strategy performanceaverages the per-dataset best-in-subset performance:

ZC,ℳ′=1|C|∑Dy∈CE(Dy,My,ℳ′∗).Z_{C,\mathcal{M}^{\prime}}=\frac{1}{|C|}\sum_{D_{y}\in C}E(D_{y},M^{*}_{y,\mathcal{M}^{\prime}}).(21)In contrast, thesingle-model performance on the dataset combinationfixes one modelMx∈ℳ′M_{x}\in\mathcal{M}^{\prime}for all datasets inCC:

Sx,C=1|C|∑Dy∈CE(Dy,Mx).S_{x,C}=\frac{1}{|C|}\sum_{D_{y}\in C}E(D_{y},M_{x}).(22)Thebest single model for the joint combinationis therefore

MC,ℳ′∗=arg⁡maxMx∈ℳ′⁡Sx,C,M^{*}_{C,\mathcal{M}^{\prime}}=\arg\max_{M_{x}\in\mathcal{M}^{\prime}}S_{x,C},(23)with corresponding performance

SC,ℳ′∗=SMC,ℳ′∗,C=maxMx∈ℳ′⁡Sx,C.S^{*}_{C,\mathcal{M}^{\prime}}=S_{M^{*}_{C,\mathcal{M}^{\prime}},C}=\max_{M_{x}\in\mathcal{M}^{\prime}}S_{x,C}.(24) Problem.Find the optimal subsets(C∗,ℳ∗)(C^{*},\mathcal{M}^{*})that maximize the relative error reduction:

(C∗,ℳ∗)=arg⁡maxC⊆𝒟,|C|≤Yℳ′⊆ℳ,|ℳ′|≤X⁡ZC,ℳ′−SC,ℳ′∗1−SC,ℳ′∗(C^{*},\mathcal{M}^{*})=\arg\max_{\begin{subarray}{c}C\subseteq\mathcal{D},\,|C|\leq Y\\ \mathcal{M}^{\prime}\subseteq\mathcal{M},\,|\mathcal{M}^{\prime}|\leq X\end{subarray}}\frac{Z_{C,\mathcal{M}^{\prime}}-S^{*}_{C,\mathcal{M}^{\prime}}}{1-S^{*}_{C,\mathcal{M}^{\prime}}}(25)Equivalently:

(C∗,ℳ∗)=arg⁡maxC⊆𝒟,|C|≤Yℳ′⊆ℳ,|ℳ′|≤X⁡(1−SC,ℳ′∗)−(1−ZC,ℳ′)1−SC,ℳ′∗(C^{*},\mathcal{M}^{*})=\arg\max_{\begin{subarray}{c}C\subseteq\mathcal{D},\,|C|\leq Y\\ \mathcal{M}^{\prime}\subseteq\mathcal{M},\,|\mathcal{M}^{\prime}|\leq X\end{subarray}}\frac{(1-S^{*}_{C,\mathcal{M}^{\prime}})-(1-Z_{C,\mathcal{M}^{\prime}})}{1-S^{*}_{C,\mathcal{M}^{\prime}}}(26) Expanded form.

maxC⊆𝒟ℳ′⊆ℳ⁡1|C|∑Dy∈CmaxMx∈ℳ′⁡E(Dy,Mx)−maxMx∈ℳ′⁡1|C|∑Dy∈CE(Dy,Mx)1−maxMx∈ℳ′⁡1|C|∑Dy∈CE(Dy,Mx)\max_{\begin{subarray}{c}C\subseteq\mathcal{D}\\ \mathcal{M}^{\prime}\subseteq\mathcal{M}\end{subarray}}\frac{\frac{1}{|C|}\sum_{D_{y}\in C}\max_{M_{x}\in\mathcal{M}^{\prime}}E(D_{y},M_{x})-\max_{M_{x}\in\mathcal{M}^{\prime}}\frac{1}{|C|}\sum_{D_{y}\in C}E(D_{y},M_{x})}{1-\max_{M_{x}\in\mathcal{M}^{\prime}}\frac{1}{|C|}\sum_{D_{y}\in C}E(D_{y},M_{x})}(27)

Candidate filtering via a top-5% performance frontier

The joint search space for the problem is combinatorial. We begin by computing the performance matrix, estimatingE(Dy,Mx)E(D_{y},M_{x})for all pairs(Dy,Mx)(D_{y},M_{x})under the standardized protocol. Next, we perform a quantile filter at the top 5%: letτ\taudenote the 95th percentile of{E(Dy,Mx)}\{E(D_{y},M_{x})\}across all pairs and define

𝒦95={(Dy,Mx):E(Dy,Mx)≥τ}.\mathcal{K}_{95}=\{(D_{y},M_{x}):E(D_{y},M_{x})\geq\tau\}.(16)This top-5% filtering concentrates the subsequent selection on strong, demonstrably effective pairings while preserving diversity across tasks.

Joint selection via exhaustive enumeration (and when heuristics are needed)

Although joint subset selection over datasets and agents is exponential in general, our experimental regime admitted anexactsolution. The candidate sets were sufficiently small to permitexhaustive enumerationunder our evaluation budget. Concretely, we enumerate all pairs(C,ℳ′)(C,\mathcal{M}^{\prime})satisfying the coverage constraint, computeZC,ℳ′Z_{C,\mathcal{M}^{\prime}},SC,ℳ′∗S^{*}_{C,\mathcal{M}^{\prime}}, andRER(C,ℳ′)\mathrm{RER}(C,\mathcal{M}^{\prime})for each, and select(C∗,ℳ∗)(C^{*},\mathcal{M}^{*})maximizing RER. When two candidates exhibit statistically indistinguishable RER, we break ties by prioritizing diversity in task and agent types. For example, we favor a balanced mixture of reasoning agents and agents with direct inference capabilities.

Exhaustive enumeration scales poorly as|𝒟|+|ℳ||\mathcal{D}|+|\mathcal{M}|grows; beyond moderate sizes, even after frontier pruning, the search can become prohibitive. In such regimes, the same objective can be pursued with budget-aware heuristics (e.g., greedy seeding followed by annealed or beam-style refinement) while retaining the coverage constraint and the complementarity-based tie-breaking.

A.7Experimental Details.

A.7.1Baseline Setup

•Individual Agent:We compare against the strongest individual models in our agent pool—GPT-5, Gemini-2.5-pro, and Claude-4-Sonnet—evaluated at both 4K and 20K(5x) maximum token limits to account for the accumulated context at each hop in our multi-turn framework.
•Random Agent Selection:A simple baseline where an agent is selected randomly at each turn during the multi-turn collaboration process, providing a lower bound for structured agent coordination. And the max turn number is 5, same asTrinitysetting.
•Self-Reflection:An extended version of standard reflection where a single agent produces an initial answer and then reflects on its own output over five turns, representing iterative self-improvement without collaboration with others.
•MasRouter:A recently method trained using the same dataset as our approach, with model selection based on best validation loss. The training follows recommend settings and employs cost-regularization and as detailed in the original paper, using the MMRL dataset with 256 samples, validating every 5 epochs, and selecting the best checkpoint after observing sufficient evidence of overfitting.
•RouterDC:A routing method trained on 500 samples from the MMRL dataset to match the conditions specified in the original paper. Each sample is repeated 5 times to collect average performance across all workers for a given question, with this average performance incorporated as part of the training label.
•Smoothie:Applied as a test-time method to questions and outputs from each agents, evaluated under both dependent strategies (selecting one agent per individual question) and independent strategies (selecting one single agent for the entire test set).
•Mixture of Agents (MoA):Implemented as a test-time scaffold with a single MoA layer and single aggregator layer, totaling 8 model calls per question, where a random model is chosen to serve as the final aggregator.
•Per Question Best:A theoretical upper bound representing the optimal performance achievable by correctly selecting the best-performing worker model for each individual question, providing the argmax baseline for comparison.

A.7.2Agent Distribution across Tasks.

Refer to caption Figure 15:Agent distribution over tasks.A0: GPT-5, A1: Claude-Sonnet-4-20250514, A2: Gemini-2.5-pro, A3: DeepSeek-R1-Distill-Qwen-32B, A4: Gemma-3-27b-It, A5: Qwen3-32B (reasoning), A6: Qwen/Qwen3-32B (direct).Trinitydemonstrates strong task-aware agent selection strategy.

A.7.3Additional baseline results.

Parallel Sampling.We report additional baselines using majority voting over 5 samples per question. WhileTrinityis designed to handle a broad spectrum of tasks, majority voting with 5 samples is only applicable to settings with a small, discrete set of candidate outputs, such as multiple-choice benchmarks. Table7summarizes the resulting performance on MMLU.

Table 7:Majority@5 baseline results for MMLU.For each question, the answer is chosen based on a majority voting over 5 parallel inquiries.LLM as Coordinator.We also evaluated an approach where an LLM is directly prompted to select the model and role at each turn. Gemini Pro 2.5 was chosen as the coordinator given its superior performance among agents. However, this prompting-based method underperformsTrinity’s trained coordinator (64.14 vs 70.44 average score). We observe that the LLM struggles to comprehend and manage the properties of all 7 agents, resulting in inconsistent and suboptimal selections. This demonstrates that prompting with closed sourced LLMs is insufficient for capturing agents’ inherent characteristics, which a coordinator acquires through training. See Table8for detailed results.

Table 8:Comparison betweenTrinityand LLM as Coordinator

A.7.4Token usage tables on in-distribution tasks.

Table 9:Average output token number of coordination methodsTable 10:Average output token number of each model in 5× Self-ReflectionTable 11:Average output token number of each model in 5× ContextTable 12:Average output token number of each model in Default Context (4096)