Scaling Agents via Continual Pre-training

Papers with Code Trending Papers

Summary

Proposes Agentic Continual Pre-training to build agentic foundation models, achieving state-of-the-art results on 10 benchmarks with AgentFounder-30B, including 39.9% on BrowseComp-en and 43.3% on BrowseComp-zh.

Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
Original Article
View Cached Full Text

Cached at: 06/01/26, 01:01 PM

Paper page - Scaling Agents via Continual Pre-training

Source: https://huggingface.co/papers/2509.13310 Published on Sep 16, 2025

#1 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

AgentFounder, a deep research agent model incorporating Agentic Continual Pre-training, achieves state-of-the-art performance in agentic tasks while maintaining strong tool-use ability.

Large language models(LLMs) have evolved intoagentic systemscapable ofautonomous tool useandmulti-step reasoningfor complex problem-solving. However,post-training approachesbuilding upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporatingAgentic Continual Pre-training(Agentic CPT) into thedeep research agentstraining pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model namedAgentFounder. We evaluate ourAgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% onBrowseComp-en, 43.3% onBrowseComp-zh, and 31.5% Pass@1 onHLE.

View arXiv pageView PDFProject pageGitHub19.1kAdd to collection

Get this paper in your agent:

hf papers read 2509\.13310

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2509.13310 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2509.13310 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2509.13310 in a Space README.md to link it from this page.

Collections including this paper19

Browse 19 collections that include this paper

Similar Articles

Position: Agentic AI System Is a Foreseeable Pathway to AGI

arXiv cs.AI

This paper argues that monolithic scaling of a single model is insufficient for achieving AGI and proposes Agentic AI with multi-agent collaboration as a necessary paradigm, demonstrating theoretically that agentic systems achieve exponentially superior generalization and sample efficiency.

Turning local agents into self-optimizing agents

Reddit r/LocalLLaMA

A self-optimizing agentic pipeline that improves benchmark performance from ~30% to ~90% on TerminalBench, and can be extended to everyday chats by logging interactions, reflecting with a local model, and injecting lessons into future system prompts.