Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Hugging Face Daily Papers Papers

Summary

Introduces Agents-A1, a 35B Mixture-of-Experts agentic model that achieves trillion-parameter-level performance through long-horizon trajectory scaling and a three-stage training approach including SFT, domain-level teachers, and multi-teacher distillation. The model outperforms or matches much larger models on long-horizon agent benchmarks.

We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities. To support this goal, we build a long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes, producing agentic trajectories with an average length of 45K tokens. Based on this, we train Agents-A1 with a three-stage recipe. First, we perform full-domain supervised fine-tuning to align the base model with broad agentic behaviors. Second, we train domain-level teacher models to capture specialized expertise in each domain. Third, we propose a multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to improve knowledge transfer efficiency across different domains, unifying six heterogeneous domains into one deployable student model. Agents-A1 achieves strong and broad performance for long-horizon agent benchmarks. Compared with 1T-parameter model such as Kimi-K2.6 and DeepSeek-V4-pro, Agents-A1 achieves leading results on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6) and BrowseComp (75.5). We hope this work provides the community with a practical path for scaling the horizon using a 35B agent that can reach or match the performance of 1T models on long-horizon tasks.
Original Article
View Cached Full Text

Cached at: 06/30/26, 03:33 AM

Paper page - Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Source: https://huggingface.co/papers/2606.30616 Published on Jun 29

#1 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Agents-A1, a 35B Mixture-of-Experts Agentic Model, achieves trillion-parameter-level performance through long-horizon trajectory scaling and heterogeneous agent ability scaling via a three-stage training approach involving supervised fine-tuning, domain-level teacher models, and multi-teacher distillation.

We introduce Agents-A1, a 35BMixture-of-ExpertsAgentic Modelthat reaches trillion-parameter-level performance by scaling theagent horizon. We investigate agent-horizon scaling from two perspectives: scalinglong-horizon trajectoriesand scalingheterogeneous agent abilities. To support this goal, we build a long-horizonknowledge-action infrastructurethat connects external knowledge, actions, observations, and verifier outcomes, producingagentic trajectorieswith an average length of 45K tokens. Based on this, we train Agents-A1 with a three-stage recipe. First, we perform full-domainsupervised fine-tuningto align the base model with broad agentic behaviors. Second, we traindomain-level teacher modelsto capture specialized expertise in each domain. Third, we propose amulti-teacher domain-routed on-policy distillationwithsalient vocabulary alignmentto improve knowledge transfer efficiency across different domains, unifying six heterogeneous domains into one deployable student model. Agents-A1 achieves strong and broad performance for long-horizon agent benchmarks. Compared with 1T-parameter model such as Kimi-K2.6 and DeepSeek-V4-pro, Agents-A1 achieves leading results on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6) and BrowseComp (75.5). We hope this work provides the community with a practical path for scaling the horizon using a 35B agent that can reach or match the performance of 1T models on long-horizon tasks.

View arXiv pageView PDFProject pageGitHub34Add to collection

Get this paper in your agent:

hf papers read 2606\.30616

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### InternScience/Agents-A1 Text Generation• 35B• Updated42 minutes ago • 55 • 18

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.30616 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.30616 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

InternScience/Agents-A1 · Hugging Face

Reddit r/LocalLLaMA

Agents-A1 is a 35B Mixture-of-Experts agentic model from InternScience that achieves competitive performance against frontier-scale systems like GPT-5.5 and DeepSeek-V4-pro using long-horizon trajectory scaling and multi-teacher multi-domain distillation.

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Hugging Face Daily Papers

TMAS introduces a multi-agent framework that enhances large language model reasoning by scaling test-time compute through structured collaboration and hierarchical memory systems. The approach uses specialized agents, cross-trajectory information flow, and hybrid reward reinforcement learning to improve iterative scaling and stability on challenging reasoning benchmarks.

@dair_ai: Outstanding paper on long-horizon agents. (bookmark it) Similar to humans, how do you make agents persist on a difficul…

X AI KOLs Following

AutoLab is a new benchmark evaluating 17 frontier models on 36 expert-curated long-horizon tasks (system optimization, model development, CUDA kernels, puzzles), finding that persistence—not initial attempt quality—is the dominant predictor of success. Claude-opus-4.6 led all categories, while most other models terminated prematurely or exhausted budgets with minimal progress.