@omarsar0: Interesting position paper on agentic AI as a foreseeable pathway to AGI. (bookmark it) There has been strong debate on…

X AI KOLs Following 05/14/26, 07:00 PM Papers

agentic-ai agi position-paper multi-agent scaling ai-research

Summary

This position paper argues that agentic AI systems—incorporating memory, reasoning, tool use, self-improvement, and alignment—are a more foreseeable route to AGI than simply scaling monolithic models, and it formalizes these components as separable axes with distinct bottlenecks.

Interesting position paper on agentic AI as a foreseeable pathway to AGI. (bookmark it) There has been strong debate on whether a larger single model get us there or a multi-agent system. The authors argue that agentic AI systems, not bigger foundation models on their own, are the most foreseeable route to AGI. Formalizes what "agentic" actually contributes beyond the base model: memory, reasoning, tool use, self-improvement, alignment. Each is a separable axis with its own bottlenecks (long-horizon coherence, credit assignment, safety auditing). They argues that none of those bottlenecks get solved by another order of magnitude on pretraining compute. Paper: https://arxiv.org/abs/2605.12966 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 05/15/26, 04:58 AM

Interesting position paper on agentic AI as a foreseeable pathway to AGI. (bookmark it) There has been strong debate on whether a larger single model get us there or a multi-agent system. The authors argue that agentic AI systems, not bigger foundation models on their own, are the most foreseeable route to AGI. Formalizes what “agentic” actually contributes beyond the base model: memory, reasoning, tool use, self-improvement, alignment. Each is a separable axis with its own bottlenecks (long-horizon coherence, credit assignment, safety auditing). They argues that none of those bottlenecks get solved by another order of magnitude on pretraining compute. Paper: https://arxiv.org/abs/2605.12966 Learn to build effective AI agents in our academy: https://academy.dair.ai

Position: Agentic AI System Is a Foreseeable Pathway to AGI

Source: https://arxiv.org/html/2605.12966

Abstract

Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture-of-Experts, reinterpret the instability of current multi-agent frameworks, and call for greater research focus on Agentic AI.

Machine Learning, ICML

1Introduction

The No Free Lunch Theorem(Wolpert and Macready,1997)dictates that no universal intelligence can perform perfectly on every conceivable task. Consequently, given the inductive nature of real-world problems, the objective is to achieve AGI within the context of the human world. But how is AGI defined in this sense? Historically, Machine Intelligence has been subject to numerous interpretations(Gudwin,2000; Horst,2002). Legg and Hutter, after surveying various perspectives, define it as an agent’s ability to“achieve goals in a wide range of environments,”which aligns with most definitions(Legg and Hutter,2007). Furthermore, Chollet posits that“the intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty”(Chollet,2019).In essence, within the scope of our physical existence, AGI necessitates optimal performance across a near-infinite spectrum of human-relevant tasks.

Reedet al.(2022)states that“…such an agent (which is generally capable on a large number of tasks) can be obtained through scaling data, compute and model parameters, continually broadening the training distribution while maintaining performance…”

Despite relentless scaling of data and computation, no single monolithic model commands ubiquitous dominance across all benchmarks(Jimenezet al.,2024; Mialonet al.,2024; Patilet al.,2025; Phanet al.,2025), and the elusive quality of true AGI has notably failed to emerge despite the saturation of high scores. While scaling pushes performance boundaries, it yields diminishing returns at prohibitive costs(Kaplanet al.,2020; Hoffmannet al.,2022; Pearce and Song,2024; Porianet al.,2024), resulting in narrow proficiency peaks rather than superiority across the full spectrum of real-world tasks. This limitation stems from strong biases introduced by specific optimization objectives and training data(Battagliaet al.,2018), a problem that is exacerbated when synthetic data is employed(Dohmatobet al.,2024).

The termAgentic AI is formally proposed as a paradigm marked by multi-agent collaboration, dynamic task decomposition, and coordinated autonomy(Sapkotaet al.,2026). From isolated to coordinated, Agentic AI moves beyond monolithic scaling to bring up more aspects of orchestrating the multi-agent systems. Actually, platforms like Manus AI(Manus,2024)and coding assistants such as Codex(OpenAI,2024), Claude Code(Anthropic,2024), have preliminarily exemplified the power of Agentic AI. However, most AI research centers on monolithic models, and there is still no concrete theoretical proof showing that Agentic AI is overall superior to the monolithic approach.

Refer to caption Figure 1:Agentic AI expands the range of usable tasks and improves performance compared to monolithic models. While monolithic models exhibit narrow performance peaks only on specific tasks they are trained for, Agentic AI demonstrates multi-peak performance across a broader spectrum. This expands usable capabilities, approaching and even surpassing the altitude and breadth of human intelligence.In this work, we present a series of demonstrations and theoretical derivations to substantiate the claim thatAgentic AI is the foreseeable cross-level move towards AGI. This capability arises from its ability to adaptively decompose tasks into correlated atomic ones and orchestrate specific agents with distinct biases, thereby aligning with real-world structures and pushing Pareto optimality. The remainder of the paper is organized as follows: In Section2, we establish the theoretical foundations necessary for our proof by reviewing constraints from learning theory. In Section3, we demonstrate the inability of monolithic models to achieve multi-peak performance and derive the advantage of routing-based Agentic AI. We then extend this analysis in Section4to general Agentic AI represented as directed acyclic graphs (DAGs) of Agents. We also list some alternative views in Section5and reinterpret them after conveying the main idea of the paper. Finally, in Section7, we conclude by positioning Agentic AI as the inevitable successor to monolithic scaling on the path to AGI.

2Theoretical Foundations

2.1Structured Real-World Distribution

The No Free Lunch Theorem asserts that, without prior assumptions on the data distribution, no learning algorithm outperforms any other on average. However, real-world tasks are not uniform noise; they obey specific physical and semantic constraints. To rigorously analyze the advantage of Agentic AI, we formalize the data-generating process not merely as a statistical mixture, but as a collection of functions supported on low-dimensional manifolds.

Definition 2.1(Structured Real-World Distribution).

Let the input space be𝒳⊆ℝD\mathcal{X}\subseteq\mathbb{R}^{D}and the output space be𝒴⊆ℝ\mathcal{Y}\subseteq\mathbb{R}. We define theStructured Real-World Distribution𝒟real\mathcal{D}_{\text{real}}as a measure on𝒳×𝒴\mathcal{X}\times\mathcal{Y}generated by a latent task variablez∈{1,…,K}z\in\{1,\dots,K\}with prior probabilitiesαk=P(z=k)\alpha_{k}=P(z=k). The joint distribution is defined by the tuple(ℳ,ℱ,𝜶)(\mathcal{M},\mathcal{F},\bm{\alpha}), characterized by the following structural properties:

1. Union of Manifolds: The support of the marginal distributionP(x)P(x)is a union ofKKdistinct, compact Riemannian manifolds{ℳk}k=1K\{\mathcal{M}_{k}\}_{k=1}^{K}, where eachℳk⊂ℝD\mathcal{M}_{k}\subset\mathbb{R}^{D}has an intrinsic dimensiondk≪Dd_{k}\ll D:

supp(P(x))⊆⋃k=1Kℳk\text{supp}(P(x))\subseteq\bigcup_{k=1}^{K}\mathcal{M}_{k}(1)2. Local Functional Consistency: For each taskkk, there exists a distinct labeling functionfk:ℳk→𝒴f_{k}:\mathcal{M}_{k}\to\mathcal{Y}such that the conditional distributionP(y|x,z=k)P(y|x,z=k)is concentrated aroundfk(x)f_{k}(x)with noiseξ\xi:

y=fk(Projℳk(x))+ξ,wherex∈ℳky=f_{k}(\text{Proj}_{\mathcal{M}_{k}}(x))+\xi,\quad\text{where }x\in\mathcal{M}_{k}(2)3. Task Divergence: The optimal functions are heterogeneous, meaning for any pairj≠kj\neq k, the functional distance implies distinct optimization landscapes:

infθ∈Θ𝔼x∼ℳk[ℓ(hθ(x),fk(x))]≠infθ∈Θ𝔼x∼ℳj[ℓ(hθ(x),fj(x))]\small\inf_{\theta\in\Theta}\mathbb{E}_{x\sim\mathcal{M}_{k}}[\ell(h_{\theta}(x),f_{k}(x))]\neq\inf_{\theta\in\Theta}\mathbb{E}_{x\sim\mathcal{M}_{j}}[\ell(h_{\theta}(x),f_{j}(x))](3)Consequently, the density of the structured distribution is given by:

𝒟real(x,y)=∑k=1Kαk⋅𝕀ℳk(x)⋅P(y|fk(x))\mathcal{D}_{\text{real}}(x,y)=\sum_{k=1}^{K}\alpha_{k}\cdot\mathbb{I}_{\mathcal{M}_{k}}(x)\cdot P(y|f_{k}(x))(4)

This definition elevates the premise from a simple probabilistic mixture to a piecewise-smooth manifold learning problem.

2.2Theorems on Generalization Bounds

The Curse of Dimensionality(Bellmanet al.,1957)creates volumetric sparsity as dimensionDDincreases. This is illustrated by the vanishing ratio of a hypersphere’s volume to its enclosing hypercube:

limD→∞Vsphere(r,D)Vcube(r,D)=0\lim_{D\to\infty}\frac{V_{\text{sphere}}(r,D)}{V_{\text{cube}}(r,D)}=0Consequently, high-dimensional data concentrates in the domain’s “corners”. This increases the average distance between nearest neighbors, rendering local density estimation intractable.

Due to the volumetric sparsity discussed above, covering the domainΩ\Omegasufficiently to ensure small‖x−x′‖2\|x-x^{\prime}\|_{2}requires a sample sizeNNthat grows exponentially withDD. This limitation is formally quantified by the minimax lower bound.

Proposition 2.2(Minimax Lower Bound on Compact Domains(Stone,1982)).

LetℱL(Ω)\mathcal{F}_{L}(\Omega)be the class of L-Lipschitz functions restricted to a compact subsetΩ⊂ℝD\Omega\subset\mathbb{R}^{D}. Under the standard non-parametric regression model, the minimax risk for any estimatorf^N\hat{f}_{N}based onNNsamples satisfies:

inff^Nsupf∈ℱL𝔼[∫Ω|f^N(x)−f(x)|𝑑P(x)]≥C⋅N−12+D\inf_{\hat{f}_{N}}\sup_{f\in\mathcal{F}_{L}}\mathbb{E}\left[\int_{\Omega}|\hat{f}_{N}(x)-f(x)|\,dP(x)\right]\geq C\cdot N^{-\frac{1}{2+D}}(5)whereP(x)P(x)is the marginal distribution of inputs supported onΩ\Omega, andC>0C>0is a constant independent ofNN.

The termN−12+DN^{-\frac{1}{2+D}}reflects the curse: to maintain a fixed error level,NNmust scale exponentially withDD, mirroring the geometric expansion of the volume.

Recent theoretical advancements provide a rigorous foundation for understanding the efficiency of Transformer-based architectures. WhileYunet al.(2020)established that Transformers are universal approximators capable of implementing precise contextual mappings,Jiang and Li (2024)advanced this further by deriving explicit Jackson-type approximation rates. They proved that the generalization error is intrinsically governed by the spectral decay properties of the target function’s temporal coupling, represented by the singular value decay rateα\alphaof the attention mechanism.

By linking these spectral properties to the model’s capacity, we can express the approximation errorℰ\mathcal{E}as a function of the parameter countPPand the task’s intrinsic dimensiondd. Under the standard architectural assumption that parameters scale quadratically with the hidden dimension (P∝mh2P\propto m_{h}^{2})(Hoffmannet al.,2022)and the spectral theoretical observation that the decay rateα\alphascales inversely with dimension (α∝1/d\alpha\propto 1/d), the approximation error follows a dimensionality-dependent power law:

ℰ(P)≈C⋅P−κd\mathcal{E}(P)\approx C\cdot P^{-\frac{\kappa}{d}}(6)whereCCis a task-dependent constant andκ\kapparepresents the regularity (smoothness) of the target function.

2.3Multi-Class Learning

Since Agentic AI may involve routing problems, specifically, choosing a proper agent for a specific input, we introduce some multi-class learning theories. Let𝒳\mathcal{X}be the instance space and𝒴={1,…,K}\mathcal{Y}=\{1,\dots,K\}be the label space withKKclasses. We consider a hypothesis classℋ⊆{h:𝒳→𝒴}\mathcal{H}\subseteq\{h:\mathcal{X}\to\mathcal{Y}\}.

Natarajan Dimension(Natarajan,1989)is the generalization of VC dimension(Vapnik and Chervonenkis,1971)for multiclass classification problems (where the number of labelsK>2K>2).

A setS={x1,…,xm}⊆𝒳S=\{x_{1},\dots,x_{m}\}\subseteq\mathcal{X}is Natarajan-shattered byℋ\mathcal{H}if there exist two ”witness” functionsf0,f1:S→𝒴f_{0},f_{1}:S\to\mathcal{Y}such thatf0(xi)≠f1(xi)f_{0}(x_{i})\neq f_{1}(x_{i})for allii, and for any binary vector𝐛∈{0,1}m\mathbf{b}\in\{0,1\}^{m}, there existsh∈ℋh\in\mathcal{H}such that:

h(xi)={f0(xi)ifbi=0f1(xi)ifbi=1h(x_{i})=\begin{cases}f_{0}(x_{i})&\text{if }b_{i}=0\\ f_{1}(x_{i})&\text{if }b_{i}=1\end{cases}The Natarajan DimensiondN(ℋ)d_{N}(\mathcal{H})is the maximum size of such a shattered set.

Jin (2023)gave the upper bounds on the Natarajan dimension,dN(ℋ)d_{N}(\mathcal{H}), for the tree-based and neural network function classes as below.

Theorem 2.3(Natarajan Dimension Upper Bound for Tree-based Classifiers(Jin,2023)).

Consider multi-class classification problems withddclasses and inputs inℝp\mathbb{R}^{p}. LetΠL,ddtree\Pi_{L,d}^{dtree}be the class of decision trees of depthLL. LetΠL,T,dforest\Pi_{L,T,d}^{forest}be the class of random forests consisting ofTTsuch decision trees. The Natarajan dimensions for these classes are upper bounded by:

dN(ΠL,ddtree)\displaystyle d_{N}(\Pi_{L,d}^{dtree})=𝒪(L2Llog⁡(pd)),\displaystyle=\mathcal{O}(L2^{L}\log(pd)),(7)dN(ΠL,T,dforest)\displaystyle d_{N}(\Pi_{L,T,d}^{forest})=𝒪(LT2Llog⁡(pd)).\displaystyle=\mathcal{O}(LT2^{L}\log(pd)).(8)

Theorem 2.4(Natarajan Dimension Upper Bound for Neural Network Classifiers(Jin,2023)).

LetΠp,Sσ\Pi_{p,S}^{\sigma}denote the class of feed-forward neural networks with a fixed structureSSand at mostppparameters fordd-class classification. If the activation functions are restricted to binary or linear sets (denoted asΠp,Sbinary\Pi_{p,S}^{binary}), or if the activation functions additionally include ReLU (denoted asΠp,SReLU\Pi_{p,S}^{ReLU}), then the Natarajan dimension for both cases is upper bounded by:

dN(Πp,Sbinary)=dN(Πp,SReLU)=𝒪(d⋅p2).d_{N}(\Pi_{p,S}^{binary})=d_{N}(\Pi_{p,S}^{ReLU})=\mathcal{O}(d\cdot p^{2}).(9)

With the Natarajan dimension of a hypothesis class established, the relationship between model complexity and generalization performance can be characterized as follows.

Theorem 2.5(Generalization Error Bounds for Multiclass ERM(Danielyet al.,2011)).

For every hypothesis classℋ\mathcal{H}with a finite label set𝒴\mathcal{Y}, given a sample sizemmand confidence parameterδ\delta:

ϵℋ(m,δ)≤ϵERM(m,δ)≤O(dN(ℋ)ln⁡(|𝒴|)+ln⁡(1δ)m)\small\epsilon_{\mathcal{H}}(m,\delta)\leq\epsilon_{ERM}(m,\delta)\leq O\left(\sqrt{\frac{d_{N}(\mathcal{H})\ln(|\mathcal{Y}|)+\ln(\frac{1}{\delta})}{m}}\right)(10)whereϵℋ(m,δ)\epsilon_{\mathcal{H}}(m,\delta)denotes the minimax (PAC) error achievable by the optimal learning algorithm, andϵERM(m,δ)\epsilon_{ERM}(m,\delta)denotes the uniform ERM error, representing the worst-case guarantee for any Empirical Risk Minimizer.

3Why and How Much Monolithic Learner Falls Behind

In this section, we first provide a formal justification for the negative transfer phenomenon in monolithic models when facing heterogeneous tasks. We frame the Average Trap explicitly as the penalty for ignoring the modular structural bias of𝒟real\mathcal{D}_{\text{real}}. Effectively, the monolithic model attempts to compress a modular reality into a dense parameter space, resulting in optimization conflicts. Then, we model a naive Routing-based Agentic AI and demonstrate that even a merely routing-based Agentic AI can beat the monolithic model exponentially in both sample and parameter complexity.

3.1The Monolithic Dilemma

Let the parameter space beΘ⊆ℝd\Theta\subseteq\mathbb{R}^{d}. The goal of a monolithic model is to minimize the weighted average risk:

θmono∗=arg⁡minθ∈Θℒtotal(θ)=arg⁡minθ∈Θ∑k=1Kαkℒk(θ)\theta^{*}_{\mathrm{mono}}=\mathop{\arg\min}_{\theta\in\Theta}\mathcal{L}_{\mathrm{total}}(\theta)=\mathop{\arg\min}_{\theta\in\Theta}\sum_{k=1}^{K}\alpha_{k}\mathcal{L}_{k}(\theta)(11)In contrast, a specialist model for taskkkseeks the task-specific optimumθk∗=arg⁡minθ∈Θℒk(θ)\theta^{*}_{k}=\mathop{\arg\min}_{\theta\in\Theta}\mathcal{L}_{k}(\theta).

Assumption 3.1(Regularity under Ideal Task Sharding).

Assuming the tasks are perfectly sharded such that each𝒟k\mathcal{D}_{k}represents a distinct, internally consistent function, the loss functionℒk(θ)\mathcal{L}_{k}(\theta)is well-behaved. Specifically, we assumeℒk(θ)\mathcal{L}_{k}(\theta)is twice continuously differentiable (C2C^{2}). Furthermore, in the local neighborhood of its optimal parameterθk∗\theta^{*}_{k},ℒk(θ)\mathcal{L}_{k}(\theta)is strictly convex, implying that its Hessian matrixHk(θ)=∇2ℒk(θ)H_{k}(\theta)=\nabla^{2}\mathcal{L}_{k}(\theta)is positive definite (PD), i.e.,v⊤Hkv>0v^{\top}H_{k}v>0for allv≠0v\neq 0.

Assumption 3.2(Lipschitz Continuous Hessian).

For each taskkk, the loss functionℒk\mathcal{L}_{k}is twice differentiable and has aρk\rho_{k}-Lipschitz continuous Hessian, i.e.,‖∇2ℒk(θ)−∇2ℒk(θ′)‖≤ρk‖θ−θ′‖\|\nabla^{2}\mathcal{L}_{k}(\theta)-\nabla^{2}\mathcal{L}_{k}(\theta^{\prime})\|\leq\rho_{k}\|\theta-\theta^{\prime}\|.

We now state the proposition, which provides a lower bound on the monolithic risk. Rather than simple degradation, it demonstrates the inevitability of a suboptimal compromise: to accommodate the conflicting gradients of heterogeneous tasks, the monolithic model is forced to sacrifice peak proficiency in specialized domains, resulting in a flattened and averaged performance profile.

Proposition 3.3(The Average Trap).

Letℒtotal(θmono∗)\mathcal{L}_{\text{total}}(\theta^{*}_{\text{mono}})be the converged risk of the monolithic model. Under Assumption3.1, if the tasks are heterogeneous such that their optimal parameters do not coincide (i.e.,∃i,j,θi∗≠θj∗\exists i,j,\theta^{*}_{i}\neq\theta^{*}_{j}), strictly positive lower boundϵ>0\epsilon>0exists:

ℒtotal(θmono∗)≈∑k=1Kαkℒk(θk∗)+∑k=1Kαk2‖θmono∗−θk∗‖Hk2⏟ϵ\mathcal{L}_{\mathrm{total}}(\theta^{*}_{\mathrm{mono}})\approx\sum_{k=1}^{K}\alpha_{k}\mathcal{L}_{k}(\theta^{*}_{k})+\underbrace{\sum_{k=1}^{K}\frac{\alpha_{k}}{2}\|\theta^{*}_{\mathrm{mono}}-\theta^{*}_{k}\|_{H_{k}}^{2}}_{\epsilon}(12)where‖v‖Hk2=v⊤Hkv\|v\|_{H_{k}}^{2}=v^{\top}H_{k}vdenotes the squared Mahalanobis distance induced by the task curvature.

See AppendixA.1for the proof. Thus, we formally prove the inevitability of the “Generalist’s Penalty”: as the diversity of tasks increases, a monolithic model must trade away its expert-level acuity to maintain stability, resulting in a representation that is broadly usable but universally distinct from the optimum.

Refer to caption Figure 2:A demonstration of the Average Trap. The monolithic optimum is pulled towards the sharp task, illustrating the curvature-induced bias described in Proposition3.3.

3.2A Merely Routing-based Agentic AI Dominates

The limitations of the Monolithic learner, as proven in Theorem3.3, stem from its attempt to approximate a global function over the complex union⋃ℳk\bigcup\mathcal{M}_{k}. It is forced to smooth over the discontinuities between disjoint manifolds, expending capacity on the empty ambient space.

Now, we formalize a naive Routing-based Agentic AI (denoted asMR-AgenticM_{\text{R-Agentic}}), which in contrast, bypasses the Average Trap by explicitly aligning its architecture with the topological structure of𝒟real\mathcal{D}_{\text{real}}. Instead of solving for a compromised global optimum, the system exploits the geometric decomposability of the task mixture. We formalize the routed agentic hypothesis by assuming the target functionffcan be factorized through a routing mechanismπ\piand a set of local maps:

fR-Agentic(x)=∑k=1K𝕀[π(x)=k]⋅fk(ϕk(x))f_{\text{R-Agentic}}(x)=\sum_{k=1}^{K}\mathbb{I}[\pi(x)=k]\cdot f_{k}(\phi_{k}(x))(13)whereπ:𝒳→{1,…,K}\pi:\mathcal{X}\to\{1,\dots,K\}acts as a geometric router identifying the active manifold, andϕk:ℳk→ℝdk\phi_{k}:\mathcal{M}_{k}\to\mathbb{R}^{d_{k}}represents the local coordinate chart (or projection) that maps the high-dimensional input onto the low-dimensional intrinsic manifold of taskkk.

In this analysis, we focus on the over-parameterized regime (P→∞P\rightarrow\infty), assuming the model possesses sufficient capacity to fully interpolate the finite training set. Under this assumption, the generalization error is no longer bottlenecked by model expressivity, but is strictly governed by the sample complexity relative to the intrinsic geometry of the data.

Monolithic Baseline

Consider a Monolithic LearnerMmonoM_{\text{mono}}that attempts to approximateffdirectly in the joint spaceℝD\mathbb{R}^{D}. In the absence of structural assumptions, the model must populate the entireDD-dimensional domain. Given a fixed training budget ofNNsamples, the generalization errorℰmono\mathcal{E}_{\text{mono}}follows the standard convergence rate for Lipschitz functions in high-dimensional spaces by Proposition2.2:

ℰmono(N)≈𝒪(N−1D)\mathcal{E}_{\text{mono}}(N)\approx\mathcal{O}\left(N^{-\frac{1}{D}}\right)(14)This relationship highlights that the error convergence is bottlenecked by the total dimensionDD. As the task complexity (DD) increases linearly, the sample size required to maintain a constant error rate grows exponentially (N∝ϵ−DN\propto\epsilon^{-D}).

Routing-based Agentic Decomposition

In the Routing-based Agentic AI framework, the problem is explicitly decomposed intoKKdistinct sub-tasks. Each agentAkA_{k}is responsible for learning a sub-functionfk:ℝdk→ℝf_{k}:\mathbb{R}^{d_{k}}\to\mathbb{R}. Assuming the aggregation (or routing) functionπ\piis fixed or introduces negligible error, the system’s complexity is determined by the complexity of its sub-components.

Assuming the training budgetNNis distributed among the agents (e.g.,N/KN/Ksamples per agent), the total error bound is dominated by the sub-task with the highest dimensionality. Letdmax=maxk⁡(dk)d_{\max}=\max_{k}(d_{k})and letLkL_{k}be the Lipschitz constant forfkf_{k}. The error upper bound for Routing-based Agentic AI is given by:

ℰR-Agentic(N)\displaystyle\mathcal{E}_{\text{R-Agentic}}(N)=∑k=1Kℰk≈∑k=1KLk⋅𝒪((NK)−1dk)\displaystyle=\sum_{k=1}^{K}\mathcal{E}_{k}\approx\sum_{k=1}^{K}L_{k}\cdot\mathcal{O}\left(\left(\frac{N}{K}\right)^{-\frac{1}{d_{k}}}\right)(15)≤∑k=1KLk⋅𝒪((NK)−1dk)+ℰrouting\displaystyle\leq\sum_{k=1}^{K}L_{k}\cdot\mathcal{O}\left(\left(\frac{N}{K}\right)^{-\frac{1}{d_{k}}}\right)+\mathcal{E}_{\text{routing}}(16)Assuming ideal routing and considering the dominance of the most complex sub-task, we obtain:

ℰR-Agentic(N)≈𝒪(K⋅N−1dmax)\mathcal{E}_{\text{R-Agentic}}(N)\approx\mathcal{O}\left(K\cdot N^{-\frac{1}{d_{\max}}}\right)(17)Sincedmax≪Dd_{\max}\ll D, the exponent−1/dmax-1/d_{\max}is significantly larger in magnitude (more negative) than−1/D-1/D, implying a substantially faster decay of error.

We further quantify the advantage of the Routing-based Agentic AI by comparing the ratio of the expected errors. Neglecting constant factors, we derive the following relation:

ℰR-Agentic(N)ℰmono(N)≈K⋅N−1dmaxN−1D=K⋅N(1D−1dmax)\frac{\mathcal{E}_{\text{R-Agentic}}(N)}{\mathcal{E}_{\text{mono}}(N)}\approx\frac{K\cdot N^{-\frac{1}{d_{\max}}}}{N^{-\frac{1}{D}}}=K\cdot N^{\left(\frac{1}{D}-\frac{1}{d_{\max}}\right)}(18)Sincedmax≪Dd_{\max}\ll D, the exponent(1D−1dmax)\left(\frac{1}{D}-\frac{1}{d_{\max}}\right)is strictly negative, indicating that the error of Routing-based Agentic AI vanishes exponentially faster relative to the Monolithic error asNNgrows.

The implication of the negative exponent is profound when interpreted through the lens of sample complexity. Specifically, to achieve a target error rateϵ\epsilon, the Monolithic model requiresNmono∝ϵ−DN_{\text{mono}}\propto\epsilon^{-D}samples, whereas the Routing-based Agentic AI requires onlyNR-Agentic∝Kdmaxϵ−dmaxN_{\text{R-Agentic}}\propto K^{d_{\text{max}}}\epsilon^{-d_{\max}}. The ratio of data requirements is:

NR-AgenticNmono∝KdmaxϵD−dmax\frac{N_{\text{R-Agentic}}}{N_{\text{mono}}}\propto K^{d_{\text{max}}}\epsilon^{D-d_{\max}}(19)Sinceϵ\epsilonis typically small (ϵ≪1\epsilon\ll 1) and the dimensionality gapD−dmaxD-d_{\max}is substantial, the termϵD−dmax\epsilon^{D-d_{\max}}asymptotically dominates the ratio. Although the pre-factorKdmaxK^{d_{\max}}introduces a polynomial overhead corresponding to the number of agents, it is negligible compared to the exponential reduction driven by the dimensionality reduction asϵ→0\epsilon\to 0.

This aligns with empirical evidence that specialized agents are significantly more data-efficient.(Huet al.,2022). By decomposing the high-dimensional manifold into lower-dimensional ones, the Routing-based Agentic AI effectively circumvents the Curse of Dimensionality, transforming an overwhelming learning problem into a set of solvable ones.

Based on the scaling lawℰ(P)∝P−κd\mathcal{E}(P)\propto P^{-\frac{\kappa}{d}}established in Eq. (6), parameter efficiency follows the same dimensionality-dependent power law as sample efficiency. Consequently, the analysis above applies symmetrically to model size: the Monolithic learner’s error decay is stifled by the ambient dimension (𝒪(P−κD)\mathcal{O}(P^{-\frac{\kappa}{D}})), whereas the Routing-based Agentic AI benefits from a faster rate governed by the lower intrinsic dimension (𝒪(P−κdmax)\mathcal{O}(P^{-\frac{\kappa}{d_{\max}}})).

The Routing Regret

We now analyze the omittedℰrouting\mathcal{E}_{\text{routing}}and explain why it can be ideally omitted in the inequality (15). We define theRouting Regret, denoted asℰrouting\mathcal{E}_{\text{routing}}, as the expected performance deficit caused by selecting a sub-optimal expert. Formally, letk∗(x)k^{*}(x)be the index of the optimal expert for inputxx, andπ(x)\pi(x)be the expert selected by the router. The routing error can be decomposed into the probability of error and the severity of the mismatch:

ℰrouting=𝔼x∼𝒟real[𝕀(π(x)≠k∗(x))⏟ϵπ:Routing Error Rate⋅(L(Aπ(x)(x))−L(Ak∗(x)(x)))⏟Δ(x):Mismatch Penalty]\tiny\mathcal{E}_{\text{routing}}=\mathbb{E}_{x\sim\mathcal{D}_{\text{real}}}\bigg[\underbrace{\mathbb{I}(\pi(x)\neq k^{*}(x))}_{\epsilon_{\pi}:\text{Routing Error Rate}}\cdot\underbrace{\left(L(A_{\pi(x)}(x))-L(A_{k^{*}(x)}(x))\right)}_{\Delta(x):\text{Mismatch Penalty}}\bigg](20)To derive a tractable bound, we analyze the two components of this expectation: the Routing Error Rate (ϵπ\epsilon_{\pi}) and the Mismatch Penalty (Δ\Delta).

The router essentially solves aKK-way classification problem, mapping the input space𝒳\mathcal{X}to the set of agent indices𝒴={1,…,K}\mathcal{Y}=\{1,\dots,K\}. The hardness of this task is governed by the complexity of the hypothesis classℋrouter\mathcal{H}_{\text{router}}employed by the router.

We quantify the Routing Error Rateϵπ\epsilon_{\pi}as the generalization error of the router. Invoking Theorems from Section2.3, for the most common routers trained onNrouterN_{\text{router}}samples, we further obtain how the bounds scale with the number of agentsKKas follows:

ϵπ∝{𝒪~(log⁡KNrouter),ifπis a Tree-based RouterKNrouter,ifπis a Neural Router\epsilon_{\pi}\propto\begin{cases}\tilde{\mathcal{O}}\left(\frac{\log K}{\sqrt{N_{\text{router}}}}\right),&\text{if}\ \pi\ \text{is a Tree-based Router}\\ \sqrt{\frac{K}{N_{\text{router}}}},&\text{if}\ \pi\ \text{is a Neural Router}\end{cases}(21)The routing error rate of both kinds of routers increases asKKincreases, though for the tree-based one, the polylogarithmic dependence allows a stronger guarantee for the scalability.

The severity of a routing error depends on the orthogonality of the experts. We define the maximum mismatch penalty asΔmax=supx,j≠k∗|L(Aj(x))−L(Ak∗(x))|\Delta_{\text{max}}=\sup_{x,j\neq k^{*}}|L(A_{j}(x))-L(A_{k^{*}}(x))|.

We verify the intuition that the cost of mismatchΔmax\Delta_{\text{max}}scales with the granularity of specializationKK. We model this relationship using subspace information loss.

Assumption 3.4(Manifold Alignment with Orthogonal Subspaces).

Following Definition2.1, we assume each manifoldℳk\mathcal{M}_{k}is contained within a feature subspaceSk⊂ℝDS_{k}\subset\mathbb{R}^{D}. These subspaces form an orthogonal decomposition of the feature space, such thatℝD=⨁k=1KSk\mathbb{R}^{D}=\bigoplus_{k=1}^{K}S_{k}andSj⟂SkS_{j}\perp S_{k}forj≠kj\neq k.

Lemma 3.5.

Letx∼𝒟jx\sim\mathcal{D}_{j}be an input belonging to taskjj. Ideally, the information required to solvexxis contained inSjS_{j}. However, if misrouted to expertAkA_{k}(k≠jk\neq j), the expert processes the projectionPkxP_{k}x. The information preservation ratioρ\rhois given by the cosine similarity between the required subspace and the expert’s subspace:

ρ(j,k)=‖Pkx‖2‖x‖2\rho(j,k)=\frac{\|P_{k}x\|^{2}}{\|x\|^{2}}(22)Under Assumption3.4, ifj≠kj\neq k, thenSj⟂SkS_{j}\perp S_{k}, implyingPkx≈0P_{k}x\approx 0. In a relaxed setting with partial overlap, asKKincreases, the subspaces become increasingly disjoint. We model the residual information as inversely proportional toKK:

𝔼[‖Pkx‖2]∝1K−1‖x‖2(forj≠k)\mathbb{E}[\|P_{k}x\|^{2}]\propto\frac{1}{K-1}\|x\|^{2}\quad(\text{for }j\neq k)(23)

Let the loss functionLLbeλ\lambda-Lipschitz continuous. The mismatch penalty is bounded by the distance in the feature space caused by the projection loss:

Δ(x)≤λ‖x−Pkx‖=λ‖x‖(1−‖Pkx‖‖x‖)\Delta(x)\leq\lambda\|x-P_{k}x\|=\lambda\|x\|\left(1-\frac{\|P_{k}x\|}{\|x\|}\right)(24)Substituting the expected information preservation ratio from Lemma3.5, and defining the maximum potential loss on the domain asLmax≜λ𝔼[‖x‖]L_{\text{max}}\triangleq\lambda\mathbb{E}[\|x\|](representing the loss scaling with input magnitude), we obtain:

Δmax(K)≈Lmax(1−1K−1)∼Lmax(1−1K)\Delta_{\text{max}}(K)\approx L_{\text{max}}\left(1-\sqrt{\frac{1}{K-1}}\right)\sim L_{\text{max}}\left(1-\frac{1}{\sqrt{K}}\right)(25)asymptotically, asK→∞K\to\infty, using the Taylor expansion(1−ϵ)−1/2≈1+ϵ/2(1-\epsilon)^{-1/2}\approx 1+\epsilon/2.

Finally, combining the routing error rate and the mismatch penalty, we derive an upper bound for the Routing Regret:

ℰrouting≤{CtreeLmax(1−1K)poly(log⁡K)Nrouter,if Tree-based RouterCNNLmax(1−1K)KNrouter,if Neural Router\small\mathcal{E}_{\text{routing}}\leq\begin{cases}C_{\text{tree}}L_{\text{max}}\left(1-\frac{1}{\sqrt{K}}\right)\sqrt{\frac{\text{poly}(\log K)}{N_{\text{router}}}},&\text{if Tree-based Router}\\ C_{\text{NN}}L_{\text{max}}\left(1-\frac{1}{\sqrt{K}}\right)\sqrt{\frac{K}{N_{\text{router}}}},&\text{if Neural Router}\end{cases}(26)

Joint Bound and Optimal Granularity

The preceding analysis factored out routing error for clarity. We now present a joint bound that unifies specialization gain and routing cost. Substituting the routing errorϵπ\epsilon_{\pi}and mismatchΔ\Deltainto the agentic bound (15):

ℰR-Agentic(K,N)≤KCexpN1/dmax⏟decreases withK+Δmax(K)⋅ϵπ(K)⏟increases withK\mathcal{E}_{\mathrm{R\text{-}Agentic}}(K,N)\leq\underbrace{\frac{KC_{\exp}}{N^{1/d_{\max}}}}_{\text{decreases with }K}+\underbrace{\Delta_{\max}(K)\cdot\epsilon_{\pi}(K)}_{\text{increases with }K}(27)This yields a U-shaped error profile inKK: too few agents (K→1K\to 1) provides insufficient specialization, while too many (K→∞K\to\infty) causes routing overhead to dominate, with an optimalK∗K^{*}in between. Expanding for specific router types:

For tree-based routers:

ℰ≤KCN1/dmax+CtreeLmax(1−1K)poly(log⁡K)Nrouter\small\mathcal{E}\leq\frac{KC}{N^{1/d_{\max}}}+C_{\mathrm{tree}}L_{\max}\left(1-\frac{1}{\sqrt{K}}\right)\sqrt{\frac{\mathrm{poly}(\log K)}{N_{\mathrm{router}}}}(28)The modularity cost grows polylogarithmically, so specialization dominates for largeKK.

For neural routers:

ℰ≤KCN1/dmax+CNNLmax(1−1K)KNrouter\small\mathcal{E}\leq\frac{KC}{N^{1/d_{\max}}}+C_{\mathrm{NN}}L_{\max}\left(1-\frac{1}{\sqrt{K}}\right)\sqrt{\frac{K}{N_{\mathrm{router}}}}(29)The cost rises asK\sqrt{K}, restrictingK∗K^{*}unlessNrouter∝KN_{\mathrm{router}}\propto K. In both cases, the specialization gain (N−1/dmaxN^{-1/d_{\max}}vs.N−1/DN^{-1/D}) dominates the polynomial routing cost for sufficiently largeNN, sincedmax≪Dd_{\max}\ll D.

Consequently, for a fixed data budget, the optimal number of agentsK∗K^{*}is the solution to∂ℰtotal∂K=0\frac{\partial\mathcal{E}_{\mathrm{total}}}{\partial K}=0. System designers face a dichotomy: use tree-based routing to maximize scalability (K≫1K\gg 1) or neural routing to handle complex, non-axis-aligned task boundaries at the cost of a smaller feasible agent pool.

In a data-scarce regime, the tree-based router is superior. Its error scales with𝒪(log⁡K)\mathcal{O}(\sqrt{\log K}), allowing massive scaling ofKKeven with limited routing data. Here, the Routing Regret is negligible. In a data-rich regime, if the sample sizeNNis sufficient (N≫KN\gg K), the linear penalty of neural routers (𝒪(K)\mathcal{O}(\sqrt{K})) is suppressed by the large denominator. In this regime, neural routing becomes preferable despite its higher sample complexity, as it avoids the inductive bias of trees and can capture complex, non-hierarchical expert boundaries.

Now we establish that the transition from Monolithic to Routing-based Agentic AI is not merely an architectural preference, but a geometric necessity for mastering heterogeneous, high-dimensional tasks, manifested by real-world task distribution.

4A Closer Look at General Agentic AI

Through the analysis in Section3, we have theoretically established that decomposing a monolithic problem into specialized sub-tasks aligns with the real-world task distribution and yields exponential gains in efficiency and effectiveness. However, it primarily modeled the system as a static routing between expert agents. In real-world Agentic AI, agents rarely operate in isolation; they function as interconnected nodes facilitating the dynamic propagation of information.

To rigorously analyze the generalization bounds, we first establish a formal mathematical definition of Agentic AI. Unlike monolithic models, which approximate a target functionF:𝒳→𝒴F:\mathcal{X}\to\mathcal{Y}via a single dense parameterization, Agentic AI is defined as a structured composition of specialized operators.

Definition 4.1(Agentic AI as a System of a Topological Compositional DAG of Agents).

Let𝒳\mathcal{X}be the global input space and𝒴\mathcal{Y}be the global output space. An Agentic AI system is defined as a tupleΨ=(𝒢,ℱ,Λ)\Psi=(\mathcal{G},\mathcal{F},\Lambda), where:

1.𝒢=(𝒱,ℰ)\mathcal{G}=(\mathcal{V},\mathcal{E})is a Directed Acyclic Graph (DAG) withK=|𝒱|K=|\mathcal{V}|nodes, representing the flow of information. The node set𝒱\mathcal{V}is topologically sorted.

2.ℱ={f1,…,fK}\mathcal{F}=\{f_{1},\dots,f_{K}\}is a set of heterogeneous, learnable mappings (agents). Each agentviv_{i}implements a local functionfi:ℋin(i)×Θi→ℋout(i)f_{i}:\mathcal{H}_{in}^{(i)}\times\Theta_{i}\to\mathcal{H}_{out}^{(i)}, whereΘi\Theta_{i}represents the agent’s specific parameters andℋ\mathcal{H}represents the latent manifold of intermediate representations.

3.Λ\Lambdais a composition operator that maps the outputs of parent nodes to the input of a child node. For any agentviv_{i}, the input statesis_{i}is constructed from the set of parentsPa(i)={vj∣(vj,vi)∈ℰ}Pa(i)=\{v_{j}\mid(v_{j},v_{i})\in\mathcal{E}\}:

xi=fi(Λ({xj}j∈Pa(i));θi)x_{i}=f_{i}\left(\Lambda\left(\{x_{j}\}_{j\in Pa(i)}\right);\theta_{i}\right)(30)

The global system behavior is not a static function, but rather emerges as the topological flow from the source nodes (initialized by𝒳\mathcal{X}) to the sink nodes (projected to𝒴\mathcal{Y}), respecting the partial order of𝒢\mathcal{G}.

For each nodeviv_{i}, letSi∈{0,1}S_{i}\in\{0,1\}be a Bernoulli random variable indicating the success of the specific task assigned to agentii. The execution ofviv_{i}depends on the latent states or outputshPa(i)h_{Pa(i)}from its parent setPa(i)={vj∣(vj,vi)∈ℰ}Pa(i)=\{v_{j}\mid(v_{j},v_{i})\in\mathcal{E}\}. Assuming the Markov property on the graph, the joint probability of a successful execution trajectory is:

P(S1,…,SN)=∏i=1NP(Si∣hPa(i))\small P(S_{1},\dots,S_{N})=\prod_{i=1}^{N}P(S_{i}\mid h_{Pa(i)})(31)Then, we transform the multiplicative success probability into an additive loss function using the negative log-likelihood. The loss for Agentic AIℒAgentic\mathcal{L}_{\text{Agentic}}can be defined as:

ℒAgentic(𝜽)\displaystyle\mathcal{L}_{\text{Agentic}}(\bm{\theta})=−log⁡(∏i=1KP(Si=1∣hPa(i)))\displaystyle=-\log\left(\prod_{i=1}^{K}P(S_{i}=1\mid h_{Pa(i)})\right)=∑i=1K−log⁡P(Si=1∣hPa(i))⏟ℓi(θi)\displaystyle=\sum_{i=1}^{K}\underbrace{-\log P(S_{i}=1\mid h_{Pa(i)})}_{\ell_{i}(\theta_{i})}(32)whereℓi\ell_{i}represents the local loss contribution of agentii.

To explicitly derive the local losslil_{i}, we instantiate the abstract local functionfif_{i}as a stochastic generator parameterized by a policy. Specifically, the execution offif_{i}corresponds to sampling an actionaia_{i}(which constitutes the outputxix_{i}) from a policyπθi(ai∣si)\pi_{\theta_{i}}(a_{i}\mid s_{i})conditioned on the input statesis_{i}. Consequently, the local lossℓi\ell_{i}relates to the agent’s policy via the expectation over actions:

ℓi(θi,si)=−log⁡(∫𝒜ρ(si,ai)πθi(ai∣si)𝑑ai)\ell_{i}(\theta_{i},s_{i})=-\log\left(\int_{\mathcal{A}}\rho(s_{i},a_{i})\pi_{\theta_{i}}(a_{i}\mid s_{i})\,da_{i}\right)whereρ(si,ai)∈[0,1]\rho(s_{i},a_{i})\in[0,1]is the conditional success probability of taking actionaia_{i}in statesis_{i}.

To understand how the Agentic AI generalizes, we must quantify how a local perturbation at a specific agent propagates through complex topologies to affect the loss.

We define theDirect Adjacency Jacobian Matrix𝐉∈ℝK×K\mathbf{J}\in\mathbb{R}^{K\times K}. The entryJjiJ_{ji}captures the local sensitivity of agentjjto its direct parent agentii:

Jji=∂xj∂xi={∂fj∂xiif(i,j)∈ℰ0otherwiseJ_{ji}=\frac{\partial x_{j}}{\partial x_{i}}=\begin{cases}\frac{\partial f_{j}}{\partial x_{i}}&\text{if }(i,j)\in\mathcal{E}\\ 0&\text{otherwise}\end{cases}(33)Then, we can derive the topological weight of a specific agent in the DAG.

Lemma 4.2(Topological Weight).

Letℒ\mathcal{L}be the Agentic AI loss function andωu=‖dℒdxu‖\omega_{u}=\left\|\frac{d\mathcal{L}}{dx_{u}}\right\|be the scalar Topological Weight representing the total sensitivity of the loss to agentuu. The weightωu\omega_{u}is determined by the aggregation of gradient flow along all paths connectinguuto the sink agents:

ωu=‖∑v∈Sinks∂ℒ∂xv∑ρ∈Paths(u→v)(∏(a,b)∈ρJba)‖\small\omega_{u}=\left\|\sum_{v\in\text{Sinks}}\frac{\partial\mathcal{L}}{\partial x_{v}}\sum_{\rho\in\text{Paths}(u\to v)}\left(\prod_{(a,b)\in\rho}J_{ba}\right)\right\|(34)

See AppendixA.2for the proof. Given specific agent weights, we analyze the Agentic AI generalization error,ℰAgentic\mathcal{E}_{\text{Agentic}}. Consistent with Section3, we assume local errors decay via a power law governed by intrinsic dimensiondud_{u}. Using a first-order Taylor expansion around the optimal agent outputs,ℰAgentic\mathcal{E}_{\text{Agentic}}is approximated as the weighted superposition of local errors.

ℰAgentic\displaystyle\mathcal{E}_{\text{Agentic}}≈∑u=1Kωu⋅ℰu≈∑u=1Kωu⋅𝒪((NK)−1du)\displaystyle\approx\sum_{u=1}^{K}\omega_{u}\cdot\mathcal{E}_{u}\approx\sum_{u=1}^{K}\omega_{u}\cdot\mathcal{O}\left(\left(\frac{N}{K}\right)^{-\frac{1}{d_{u}}}\right)(35)≈C(G)⋅(NK)−1deff\displaystyle\approx C(G)\cdot\left(\frac{N}{K}\right)^{-\frac{1}{d_{\text{eff}}}}(36)wheredeffd_{\text{eff}}is the effective intrinsic dimension of the task andC(G)C(G)is the Topology Factor determined by the topology of Agentic AI.

To disentangle the impact of the Topology Factor from the intrinsic difficulty of specific sub-tasks, we assume that the complex global task is divided into sub-tasks of comparable intrinsic difficulty, formally for any sub-taskuu,du≈deffd_{u}\approx d_{\mathrm{eff}}. Then, the convergence rate term becomes uniform across all agents. This allows us to factor the complexity term out of the summation in Equation (35), isolating the Topology Factor. And further, the Topology FactorC(G)C(G)can be formally defined as the sum of Topological Weights:

C(G)≡∑u=1Kωu=∑u=1K‖∑v∈Sinks∂ℒ∂xv(∑ρ∈Pathu→v∏e∈ρJe)‖\small C(G)\equiv\sum_{u=1}^{K}\omega_{u}=\sum_{u=1}^{K}\left\|\sum_{v\in\text{Sinks}}\frac{\partial\mathcal{L}}{\partial x_{v}}\left(\sum_{\rho\in\text{Path}_{u\to v}}\prod_{e\in\rho}J_{e}\right)\right\|(37) The definition allows us to analyze the stability of different agentic orchestrations by evaluating howC(G)C(G)scales with DAG complexity, and confirms that whiledeffd_{\text{eff}}governs the rate of convergence,C(G)C(G)determines the magnitude of the error. Agentic AI succeeds when the topology minimizesC(G)C(G)while maximizing the dimensionality gap.

Theorem 4.3(Agentic AI Convergence Superiority).

As the scale of resources (dataset sizeNNor parameter budgetPP) increases, the generalization error of the Agentic AI decays exponentially faster than that of the Monolithic model, provided the topology satisfies spectral stability (well designed withC(G)<∞C(G)<\infty).

Apart from the overall analysis of the graph, we can further decompose global instability into single-edge contributions to better analyze the connections.

Lemma 4.4(Topological Edge Weight).

Consider a specific edgee∗=(u,v)e^{*}=(u,v)connecting a parent agentuuto a child agentvv. The Topological Edge Weight𝒲(e∗)\mathcal{W}(e^{*})represents the total gradient flux passing through this edge, linking the accumulated history of the parent to the future criticality of the child. It is formally defined as:

𝒲(e∗)=(1+∑k∈𝒫(u)‖∑ρ∈Path(k→u)∏e∈ρJe‖)⏟UpstreamHistory⋅‖Je∗‖⏟LocalValve⋅‖∑z∈Sinks∂ℒ∂xz∑γ∈Path(v→z)∏e′∈γJe′‖⏟DownstreamFuture\small\begin{split}\mathcal{W}(e^{*})&=\underbrace{\left(1+\sum_{k\in\mathcal{P}(u)}\left\|\sum_{\rho\in\mathrm{Path}(k\rightarrow u)}\prod_{e\in\rho}J_{e}\right\|\right)}_{\mathrm{Upstream\ History}}\\ &\qquad\cdot\underbrace{\|J_{e^{*}}\|}_{\mathrm{Local\ Valve}}\cdot\underbrace{\left\|\sum_{z\in\mathrm{Sinks}}\frac{\partial\mathcal{L}}{\partial x_{z}}\sum_{\gamma\in\mathrm{Path}(v\rightarrow z)}\prod_{e^{\prime}\in\gamma}J_{e^{\prime}}\right\|}_{\mathrm{Downstream\ Future}}\end{split}(38)where𝒫(u)\mathcal{P}(u)is the set of predecessors of agentuu, andSinks\mathrm{Sinks}denotes the set of final output agents.

See AppendixA.3for the proof. To minimize the global errorC(G)C(G), we must well orchestrate the necessary agents and find the small combinations of edge weights. This equation reveals a fundamental design principle regarding what kind of edges we should buildat runtimeand gives a reference to analyze each edgepost hoc. Specifically: (1) after long chains (high Upstream History), edges must be contractive (‖Je∗‖<1\|J_{e^{*}}\|<1) to filter accumulated noise, e.g., critic or judge edges. (2) Before critical decisions (high Downstream Sensitivity), edges should satisfy‖Je∗‖≪1\|J_{e^{*}}\|\ll 1, e.g., voting or verification edges that collapse multiple paths into a stable signal.

Consequently, we find that optimal edges function as adaptive valves. An ideal edge is not a passive pipe but an active variational filter that suppresses the noise accumulated from the upstream before it propagates to critical downstream tasks.

Finally, in this section, we not only extend the superiority from a Routing-based Agentic AI to a general Agentic AI by deriving the topological properties of the Agentic AI, but also further analyze the impact of specific agents and edges, giving backgrounds for Agentic AI design and Agentic AI success and failure explanation.

5Alternative Views

Monolithic scaling is enough for AGI

DeepMind ever stated that“…an agent capable on a large number of tasks and able to be adapted with little extra data to succeed at an even larger number of tasks can be obtained by scaling data, compute and parameters…”(Reedet al.,2022).Agüera y Arcas and Norvig (2023)even said that“…the most important parts of it (AGI) have already been achieved by the current generation of advanced AI large language models such as ChatGPT, Bard, LLaMA and Claude.”However, very few researchers firmly admit that AGI has come and are one hundred percent satisfied with one specific monolithic model, at least in coding, not mention in all real-world tasks. We don’t object to the view that scaling is effective, but both neural scaling laws and empirical experiments have demonstrated that the marginal improvement is more obvious as an inevitable bottleneck(Kaplanet al.,2020; Hoffmannet al.,2022). More attention should be drawn to Agentic AI to break the bottleneck.

Agentic AI is conceptually similar to Mixture-of-Experts

Mixture-of-Experts (MoE)(Shazeeret al.,2017)and Agentic AI share a common design principle: both route inputs to specialized sub-networks rather than processing everything through a single monolithic model. Both architectures leverage the insight that task heterogeneity is better handled by specialized components than by a universal compromise, and the empirical success of sparse MoEs(Feduset al.,2022; Lepikhinet al.,2021)validates this core premise of our theory, namely that routing to specialized sub-networks improves performance even when tasks share a common backbone. In our theoretical framework, MoE corresponds to the routing regime of Section3.2, whereC(G)≈∑LuC(G)\approx\sum L_{u}and the system is inherently stable.

However, Agentic AI generalizes beyond MoE in three fundamental aspects. First, in scope: MoE employs fixed expert sub-networks with learned gating within a single forward pass(Feduset al.,2022; Lepikhinet al.,2021), whereas Agentic AI deploys autonomous agents with independent parameters capable of multi-step reasoning(Sapkotaet al.,2026). Second, in topology: MoE implements single-layer routing (router→\toexpert), while Agentic AI extends to arbitrary DAG compositions as formalized in Section4. Third, in routing mechanism: MoE relies on differentiable gating trained end-to-end, whereas agentic routing accommodates iterative refinement, external tool use, and dynamic knowledge retrieval(Sapkotaet al.,2026; Anthropic,2025). While MoE and routing-based Agentic AI share a common design principle, Agentic AI extends to richer topological structures with greater expressivity.

Multi-Agent systems often fail

Empirical evidence indicates that increasing agent quantity often introduces organizational entropy rather than performance gains. Complexity frequently hinders reliability, with failures stemming primarily from system design issues, inter-agent misalignment, and task verification difficulties(Panet al.,2025). Furthermore, current LLMs struggle with coordination tasks requiring Theory of Mind compared to RL methods(Agasheet al.,2025), necessitating dedicated automated methods to diagnose these persistent failures(Zhanget al.,2025).

Recent works exploring LaMAS flaws align with our derivation, attributing failures to Topological Weights and Edge Weights. For instance, misaligned agents introduce toxic topological properties, causing massive downstream variance and hallucination. This necessity for topological awareness is exemplified by the performance surge with well-designed topologies(Anthropic,2025). Agentic AI demands dedicated topological design; most current frameworks are merely static pipeline decompositions based on human priors, masquerading as true Agentic AI.

6Call to Action

Prioritize Agentic AI for accessible AGI research

We urge researchers and institutions, especially those with limited resources, to prioritize Agentic AI, which offers a viable alternative to the prohibitive costs of monolithic scaling and yields exponential gains in both sample and parameter efficiency. This paradigm allows for state-of-the-art generalization without the need for brute-force computation. Since the efficiency advantage grows exponentially with the dimensionality gap between the ambient space and task-intrinsic manifolds, a well-designed agentic system of moderately sized specialists can match or exceed monolithic performance at a fraction of the cost, broadening access to AGI research beyond resource-rich laboratories.

Not only fine-tune weights, but also invent better multi-agent evolution methods for applicable Agentic AI

The community must expand its focus from simply fine-tuning individual agents to a broader and more diverse optimization of the agentic system. Research should look beyond specific weight adjustments and explore various enhancements, such as mitigating organizational entropy, designing graph, tree or forest evolution methods, and ensuring spectral stability. The goal is to move from static pipelines to the evolution of topologically stable multi-agent ecosystems. In particular, automated methods for discovering optimal DAG topologies, routing mechanisms that scale gracefully with agent count, and topology-aware evaluation protocols that attribute failures to specific graph components are all pressing open problems.

7Conclusion

This paper challenges the dogma of monolithic scaling, identifying Agentic AI as the superior pathway to AGI. By formalizing real-world task distributions as unions of low-dimensional manifolds, we prove that monolithic models are trapped in an irreducible compromise, the Average Trap, where conflicting optimization landscapes force a penalty that accumulates with task diversity. In contrast, even a merely routing-based Agentic AI achieves exponentially superior sample and parameter efficiency by aligning its architecture with the intrinsic manifold structure, where each agent operates on a low-dimensional sub-manifold (dk≪Dd_{k}\ll D) rather than the full ambient space. We further extend this analysis to general Agentic AI formalized as DAG topologies, introducing the Compositional CapacityC(G)C(G)and Edge Weight decomposition𝒲(e∗)\mathcal{W}(e^{*})as principled tools for analyzing and designing multi-agent systems. Importantly, we show that the agentic advantage degrades gracefully under partial task overlap and that an optimal agent granularityK∗K^{*}exists, balancing specialization gains against routing costs. We also clarify the relationship between Agentic AI and Mixture-of-Experts, and that current multi-agent failures stem from poor topological design rather than fundamental flaws. Ultimately, we conclude that achieving AGI requires shifting from brute-force scaling to the precise optimization of stable, well-designed Agentic AI ecosystems.

Acknowledgements

This work was supported by National Natural Science Foundation of China (62322603) and Shanghai Municipal Special Program for Basic Research on General AI Foundation Models (Grant No. 2025SHZDZX025D08).

References

S. Agashe, Y. Fan, A. Reyna, and X. E. Wang (2025)LLM-coordination: evaluating and analyzing multi-agent coordination abilities in large language models.InFindings of the Association for Computational Linguistics: NAACL 2025,L. Chiruzzo, A. Ritter, and L. Wang (Eds.),Albuquerque, New Mexico,pp. 8038–8057.External Links:Link,Document,ISBN 979-8-89176-195-7Cited by:§5.
B. Agüera y Arcas and P. Norvig (2023)Artificial general intelligence is already here.Note:Noema MagazineAccessed: 2026-01-24Cited by:§5.
Anthropic (2024)Claude Code.Note:https://claude.com/product/claude-codeAccessed: 2026-01-17Cited by:§1.
Anthropic (2025)Building a multi-agent research system.Note:https://www.anthropic.com/engineering/multi-agent-research-systemAccessed: 2026-01-17Cited by:§5,§5.
P. Battaglia, J. B. C. Hamrick, V. Bapst, A. Sanchez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, A. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani, K. Allen, C. Nash, V. J. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu (2018)Relational inductive biases, deep learning, and graph networks.arXiv.External Links:LinkCited by:§1.
R. Bellman, R.E. Bellman, and R. Corporation (1957)Dynamic programming.Rand Corporation research study,Princeton University Press.External Links:LCCN lc57005444,LinkCited by:§2.2.
F. Chollet (2019)On the measure of intelligence.External Links:1911.01547,LinkCited by:§1.
A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz (2011)Multiclass learnability and the ERM principle.InProceedings of the 24th Annual Conference on Learning Theory,S. M. Kakade and U. von Luxburg (Eds.),Proceedings of Machine Learning Research, Vol.19,Budapest, Hungary,pp. 207–232.External Links:LinkCited by:Theorem 2.5.
Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio (2014)Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.InAdvances in Neural Information Processing Systems (NIPS),Vol.27,pp. 2933–2941.Cited by:Remark B.4.
E. Dohmatob, Y. Feng, P. Yang, F. Charton, and J. Kempe (2024)A tale of tails: model collapse as a change of scaling laws.InProceedings of the 41st International Conference on Machine Learning,ICML’24.Cited by:§1.
W. Fedus, B. Zoph, and N. Shazeer (2022)Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23(120),pp. 1–39.Cited by:§5,§5.
R.R. Gudwin (2000)Evaluating intelligence: a computational semiotics perspective.InSmc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics. ’cybernetics evolving to systems, humans, organizations, and their complex interactions’ (cat. no.0,Vol.3,pp. 2080–2085 vol.3.External Links:DocumentCited by:§1.
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. Rae, and L. Sifre (2022)An empirical analysis of compute-optimal large language model training.InAdvances in Neural Information Processing Systems,S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.),Vol.35,pp. 30016–30030.Cited by:§1,§2.2,§5.
J. Horst (2002)A native intelligence metric for artificial systems.InProceedings of the Performance Metrics for Intelligent Systems (PerMIS) Workshop,Note:Accessed: 2026-01-25External Links:LinkCited by:§1.
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen (2022)LoRA: low-rank adaptation of large language models.InInternational Conference on Learning Representations,External Links:LinkCited by:§3.2.
H. Jiang and Q. Li (2024)Approximation rate of the Transformer architecture for sequence modeling.InAdvances in Neural Information Processing Systems,A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.),Vol.37,pp. 68926–68955.External Links:DocumentCited by:§2.2.
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan (2024)SWE-bench: can language models resolve real-world GitHub issues?.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:§1.
Y. Jin (2023)Upper bounds on the Natarajan dimensions of some function classes.In2023 IEEE International Symposium on Information Theory (ISIT),Vol.,pp. 1020–1025.External Links:DocumentCited by:§2.3,Theorem 2.3,Theorem 2.4.
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020)Scaling laws for neural language models.External Links:2001.08361,LinkCited by:§1,§5.
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017)On large-batch training for deep learning: generalization gap and sharp minima.InInternational Conference on Learning Representations (ICLR),Cited by:Remark B.4.
P. Langley (2000)Crafting papers on machine learning.InProceedings of the 17th International Conference on Machine Learning (ICML 2000),P. Langley (Ed.),Stanford, CA,pp. 1207–1216.Cited by:§B.4.
S. Legg and M. Hutter (2007)Universal intelligence: a definition of machine intelligence.External Links:0712.3329,LinkCited by:§1.
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen (2021)GShard: scaling giant models with conditional computation and automatic sharding.InInternational Conference on Learning Representations (ICLR),Cited by:§5,§5.
B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu (2021)Conflict-averse gradient descent for multi-task learning.Advances in Neural Information Processing Systems34.Cited by:Remark B.4.
Manus (2024)Manus: the general purpose AI agent.Note:https://manus.im/Accessed: 2026-01-17Cited by:§1.
G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom (2024)GAIA: a benchmark for general AI assistants.InThe Twelfth International Conference on Learning Representations,External Links:LinkCited by:§1.
B. K. Natarajan (1989)On learning sets and functions.Mach. Learn.4(1),pp. 67–97.External Links:ISSN 0885-6125,Link,DocumentCited by:§2.3.
OpenAI (2024)OpenAI Codex.Note:https://chatgpt.com/codexAccessed: 2026-01-17Cited by:§1.
M. Z. Pan, M. Cemri, L. A. Agrawal, S. Yang, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, K. Ramchandran, D. Klein, J. E. Gonzalez, M. Zaharia, and I. Stoica (2025)Why do multiagent systems fail?.InICLR 2025 Workshop on Building Trust in Language Models and Applications,External Links:LinkCited by:§5.
S. G. Patil, H. Mao, C. Cheng-Jie Ji, F. Yan, V. Suresh, I. Stoica, and J. E. Gonzalez (2025)The Berkeley function calling leaderboard (BFCL): from tool use to agentic evaluation of large language models.InForty-second International Conference on Machine Learning,Cited by:§1.
T. Pearce and J. Song (2024)Reconciling Kaplan and Chinchilla scaling laws.Transactions on Machine Learning Research.Note:Reproducibility CertificationExternal Links:ISSN 2835-8856,LinkCited by:§1.
L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi,et al.(2025)Humanity’s last exam.External Links:2501.14249,LinkCited by:§1.
T. Porian, M. Wortsman, J. Jitsev, L. Schmidt, and Y. Carmon (2024)Resolving discrepancies in compute-optimal scaling of language models.InAdvances in Neural Information Processing Systems,A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.),Vol.37,pp. 100535–100570.External Links:DocumentCited by:§1.
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Giménez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas (2022)A generalist agent.Transactions on Machine Learning Research.Note:Featured Certification, Outstanding CertificationExternal Links:ISSN 2835-8856,LinkCited by:§1,§5.
R. Sapkota, K. I. Roumeliotis, and M. Karkee (2026)AI agents vs. agentic AI: a conceptual taxonomy, applications and challenges.Information Fusion126,pp. 103599.External Links:ISSN 1566-2535,Link,DocumentCited by:§1,§5.
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017)Outrageously large neural networks: the sparsely-gated mixture-of-experts layer.InInternational Conference on Learning Representations (ICLR),Cited by:§5.
C. J. Stone (1982)Optimal Global Rates of Convergence for Nonparametric Regression.The Annals of Statistics10(4),pp. 1040 – 1053.External Links:Document,LinkCited by:Proposition 2.2.
V. N. Vapnik and A. Ya. Chervonenkis (1971)On the uniform convergence of relative frequencies of events to their probabilities.Theory of Probability & Its Applications16(2),pp. 264–280.External Links:Document,Link,https://doi.org/10.1137/1116025Cited by:§2.3.
Z. Wang (2021)Mitigating negative transfer for better generalization and efficiency in transfer learning.Ph.D. Thesis,Carnegie Mellon University,Pittsburgh, PA.External Links:LinkCited by:Remark B.2.
D.H. Wolpert and W.G. Macready (1997)No free lunch theorems for optimization.IEEE Transactions on Evolutionary Computation1(1),pp. 67–82.External Links:DocumentCited by:§1.
T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn (2020)Gradient surgery for multi-task learning.InProceedings of the 34th International Conference on Neural Information Processing Systems,NIPS ’20,Red Hook, NY, USA.External Links:ISBN 9781713829546Cited by:Remark B.2.
C. Yun, S. Bhojanapalli, A. S. Rawat, S. Reddi, and S. Kumar (2020)Are Transformers universal approximators of sequence-to-sequence functions?.InInternational Conference on Learning Representations,External Links:LinkCited by:§2.2.
S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, J. Zhang, B. Li, C. Wang, H. Wang, Y. Chen, and Q. Wu (2025)Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems.InForty-second International Conference on Machine Learning,External Links:LinkCited by:§5.
Z. Zhang, J. Shen, C. Cao, G. Dai, S. Zhou, Q. Zhang, S. Zhang, and E. Shutova (2024)Proactive gradient conflict mitigation in multi-task learning: a sparse training perspective.External Links:2411.18615,LinkCited by:Remark B.2.

Appendix AProofs for Theorems

A.1Proof of Proposition3.3

Proof.

Sinceθmono∗\theta^{*}_{\text{mono}}is a minimizer of the total convex lossℒtotal\mathcal{L}_{\text{total}}, it satisfies the first-order optimality condition:

∇ℒtotal(θmono∗)=∑k=1Kαk∇ℒk(θmono∗)=0\nabla\mathcal{L}_{\text{total}}(\theta^{*}_{\text{mono}})=\sum_{k=1}^{K}\alpha_{k}\nabla\mathcal{L}_{k}(\theta^{*}_{\text{mono}})=0(39)This implies that the weighted gradients sum to zero. Unless allθk∗\theta^{*}_{k}are identical, for any specific taskkk,∇ℒk(θmono∗)≠0\nabla\mathcal{L}_{k}(\theta^{*}_{\text{mono}})\neq 0. Geometrically,θmono∗\theta^{*}_{\text{mono}}lies within the convex hull of the individual optima{θk∗}k=1K\{\theta^{*}_{k}\}_{k=1}^{K}but coincides with none. Thus, the monolithic solution is a compromise, merely a Pareto stationary point where gradients cancel out destructively.

We expand the lossℒk(θ)\mathcal{L}_{k}(\theta)for each task around its specific optimumθk∗\theta^{*}_{k}. By Assumption3.1,∇ℒk(θk∗)=0\nabla\mathcal{L}_{k}(\theta^{*}_{k})=0. Using the Lagrangian form of the Taylor expansion, we have:

ℒk(θmono∗)\displaystyle\mathcal{L}_{k}(\theta^{*}_{\text{mono}})=ℒk(θk∗)+12(θmono∗−θk∗)⊤Hk(θmono∗−θk∗)+R3(θmono∗,θk∗)\displaystyle=\mathcal{L}_{k}(\theta^{*}_{k})+\frac{1}{2}(\theta^{*}_{\text{mono}}-\theta^{*}_{k})^{\top}H_{k}(\theta^{*}_{\text{mono}}-\theta^{*}_{k})+R_{3}(\theta^{*}_{\text{mono}},\theta^{*}_{k})(40)whereR3R_{3}is the third-order remainder term. Under the assumption ofρ\rho-Lipschitz Hessian (Assumption3.2), this remainder is bounded by|R3|≤ρ6‖θmono∗−θk∗‖3|R_{3}|\leq\frac{\rho}{6}\|\theta^{*}_{\text{mono}}-\theta^{*}_{k}\|^{3}.

Substituting this back into the total risk objective:

ℒtotal(θmono∗)\displaystyle\mathcal{L}_{\text{total}}(\theta^{*}_{\text{mono}})=∑k=1Kαk[ℒk(θk∗)+12‖θmono∗−θk∗‖Hk2+R3(k)]\displaystyle=\sum_{k=1}^{K}\alpha_{k}\left[\mathcal{L}_{k}(\theta^{*}_{k})+\frac{1}{2}\|\theta^{*}_{\text{mono}}-\theta^{*}_{k}\|_{H_{k}}^{2}+R_{3}^{(k)}\right](41)=∑k=1Kαkℒk(θk∗)⏟ℒideal+∑k=1Kαk2‖θmono∗−θk∗‖Hk2+∑k=1Kαk|R3(k)|⏟Higher-order error\displaystyle=\underbrace{\sum_{k=1}^{K}\alpha_{k}\mathcal{L}_{k}(\theta^{*}_{k})}_{\mathcal{L}_{\text{ideal}}}+\sum_{k=1}^{K}\frac{\alpha_{k}}{2}\|\theta^{*}_{\text{mono}}-\theta^{*}_{k}\|_{H_{k}}^{2}+\underbrace{\sum_{k=1}^{K}\alpha_{k}|R_{3}^{(k)}|}_{\text{Higher-order error}}(42)To guarantee that the interference costϵ\epsilonis significant, the quadratic term must dominate the third-order error. Since∥⋅∥Hk2\|\cdot\|_{H_{k}}^{2}scales withΔ2\Delta^{2}while the error scales withΔ3\Delta^{3}, for a local neighborhood around the optima where the task divergence is bounded, the positive curvature (guaranteed by positive definiteHkH_{k}) strictly dominates the higher-order variations. Thus, we derive the lower bound:

ℒtotal(θmono∗)≳ℒideal+∑k=1Kαk2‖θmono∗−θk∗‖Hk2\mathcal{L}_{\text{total}}(\theta^{*}_{\text{mono}})\gtrsim\mathcal{L}_{\text{ideal}}+\sum_{k=1}^{K}\frac{\alpha_{k}}{2}\|\theta^{*}_{\text{mono}}-\theta^{*}_{k}\|_{H_{k}}^{2}(43)This confirms thatϵ>0\epsilon>0holds as long as the conflicting gradients forceθmono∗\theta^{*}_{\text{mono}}away from individual optima, creating an irreducible quadratic penalty. ∎

A.2Proof of Lemma4.2

Proof.

Since agents are topologically sorted,𝐉\mathbf{J}is strictly lower triangular and nilpotent.

By the multivariate chain rule, the total variationdxjdxi\frac{dx_{j}}{dx_{i}}captures the cumulative effect of agentiion agentjjthroughallpossible paths.

dxjdxi=∂xj∂xi⏟Direct edge+∑k∈Pa(j),k≠i∂xj∂xk⏟Direct path⋅dxkdxi⏟Recursive path=∑k∂xj∂xk⏟Jjkdxkdxi⏟Mki\frac{dx_{j}}{dx_{i}}=\underbrace{\frac{\partial x_{j}}{\partial x_{i}}}_{\text{Direct edge}}+\sum_{k\in Pa(j),k\neq i}\underbrace{\frac{\partial x_{j}}{\partial x_{k}}}_{{\text{Direct path}}}\cdot\underbrace{\frac{dx_{k}}{dx_{i}}}_{{\text{Recursive path}}}=\sum_{k}\underbrace{\frac{\partial x_{j}}{\partial x_{k}}}_{J_{jk}}\underbrace{\frac{dx_{k}}{dx_{i}}}_{M_{ki}}(44)Then we define𝐌∈ℝK×K\mathbf{M}\in\mathbb{R}^{K\times K}be theInfluence MatrixwhereMji=dxjdxiM_{ji}=\frac{dx_{j}}{dx_{i}}. The recursive relation can be written in matrix form:

𝐌=𝐉𝐌+𝐈\mathbf{M}=\mathbf{J}\mathbf{M}+\mathbf{I}Here,𝐈\mathbf{I}is defined as the self-influence𝐈ii=dxidxi\mathbf{I}_{ii}=\frac{dx_{i}}{dx_{i}}and anywhere else𝟎\mathbf{0}. Then, rearranging for𝐌\mathbf{M}:

(𝐈−𝐉)𝐌=𝐈⟹𝐌=(𝐈−𝐉)−1(\mathbf{I}-\mathbf{J})\mathbf{M}=\mathbf{I}\implies\mathbf{M}=(\mathbf{I}-\mathbf{J})^{-1}Since𝐉\mathbf{J}is nilpotent (due to the acyclic property of DAGs), this inverse can be expanded as a Neumann Series:

𝐌=∑k=0K−1𝐉k=𝐈+𝐉+𝐉2+…\mathbf{M}=\sum_{k=0}^{K-1}\mathbf{J}^{k}=\mathbf{I}+\mathbf{J}+\mathbf{J}^{2}+\dotsPhysically,𝐉k\mathbf{J}^{k}represents the influence propagation along paths of length exactlykk. The matrix𝐌\mathbf{M}mathematically aggregates all parallel and serial paths automatically. Then, we can derive the topological weight of a specific agent in the DAG.

Let𝐠=[‖∂ℒ∂x1‖,…,‖∂ℒ∂xK‖]T\mathbf{g}=[\|\frac{\partial\mathcal{L}}{\partial x_{1}}\|,\dots,\|\frac{\partial\mathcal{L}}{\partial x_{K}}\|]^{T}be the gradient of the loss with respect to agent outputs (typically non-zero only for sink agents). The total sensitivity ofℒ\mathcal{L}to a specific agentuuis theuu-th component of:

𝝎=𝐌T𝐠\bm{\omega}=\mathbf{M}^{T}\mathbf{g}The scalar Topological Weightωu\omega_{u}for agentuuis explicitly:

ωu=‖dℒdxu‖=‖∑v∈Sinks∂ℒ∂xv∑ρ∈Paths(u→v)(∏(a,b)∈ρJba⏟Weight of pathρ)‖\omega_{u}=\left\|\frac{d\mathcal{L}}{dx_{u}}\right\|=\left\|\sum_{v\in\text{Sinks}}\frac{\partial\mathcal{L}}{\partial x_{v}}\sum_{\rho\in\text{Paths}(u\to v)}\left(\underbrace{\prod_{(a,b)\in\rho}J_{ba}}_{\text{Weight of path}\ \rho}\right)\right\|(45)∎

A.3Proof of Lemma4.4

Proof.

The weight is derived by tracing the full back-propagation path of the loss gradient through the specific edgee∗e^{*}. The total influence is the product of the signal magnitude reaching the parentuuand the distribution of that signal to all upstream ancestors.

First, we isolate the incoming gradient signal from the childvv. By the chain rule, the gradient atuucontributed strictly byvvis∂ℒ∂xvJe∗\frac{\partial\mathcal{L}}{\partial x_{v}}J_{e^{*}}. The magnitude of this local flux is:

‖Fluxv→u‖≤‖Je∗‖⋅‖∂ℒ∂xv‖=‖Je∗‖⋅ωv\|\text{Flux}_{v\to u}\|\leq\|J_{e^{*}}\|\cdot\left\|\frac{\partial\mathcal{L}}{\partial x_{v}}\right\|=\|J_{e^{*}}\|\cdot\omega_{v}(46)Second, we account for the propagation of this flux to the past. The signal distributes touuitself (identity gain) and to every predecessork∈𝒫(u)k\in\mathcal{P}(u). The cumulative amplification factor is the sum of path gains:

Upstream Mass=‖I‖+∑k∈𝒫(u)‖∑ρ∈Path(k→u)∏e∈ρJe‖\text{Upstream Mass}=\|I\|+\sum_{k\in\mathcal{P}(u)}\left\|\sum_{\rho\in\mathrm{Path}(k\rightarrow u)}\prod_{e\in\rho}J_{e}\right\|(47)Third, we expand the downstream sensitivityωv\omega_{v}. The total gradient∂ℒ∂xv\frac{\partial\mathcal{L}}{\partial x_{v}}is the aggregation of error signals back-propagated from all reachable sink nodeszz. Expanding the influence matrixMzv=dxzdxvM_{zv}=\frac{dx_{z}}{dx_{v}}as a sum over all pathsγ\gamma:

ωv=‖∑z∈Sinks∂ℒ∂xzdxzdxv‖=‖∑z∈Sinks∂ℒ∂xz∑γ∈Path(v→z)∏e′∈γJe′‖\omega_{v}=\left\|\sum_{z\in\mathrm{Sinks}}\frac{\partial\mathcal{L}}{\partial x_{z}}\frac{dx_{z}}{dx_{v}}\right\|=\left\|\sum_{z\in\mathrm{Sinks}}\frac{\partial\mathcal{L}}{\partial x_{z}}\sum_{\gamma\in\mathrm{Path}(v\rightarrow z)}\prod_{e^{\prime}\in\gamma}J_{e^{\prime}}\right\|(48)Multiplying these three components, Upstream Mass, Local Valve (Je∗J_{e^{*}}), and expanded Downstream Sensitivity, yields the complete Topological Edge Weight definition. ∎

Position: Agentic AI System Is a Foreseeable Pathway to AGI

Abstract

1Introduction

2Theoretical Foundations

2.1Structured Real-World Distribution

Definition 2.1(Structured Real-World Distribution).

2.2Theorems on Generalization Bounds

Proposition 2.2(Minimax Lower Bound on Compact Domains(Stone,1982)).

2.3Multi-Class Learning

Theorem 2.3(Natarajan Dimension Upper Bound for Tree-based Classifiers(Jin,2023)).

Theorem 2.4(Natarajan Dimension Upper Bound for Neural Network Classifiers(Jin,2023)).

Theorem 2.5(Generalization Error Bounds for Multiclass ERM(Danielyet al.,2011)).

3Why and How Much Monolithic Learner Falls Behind

3.1The Monolithic Dilemma

Assumption 3.1(Regularity under Ideal Task Sharding).

Assumption 3.2(Lipschitz Continuous Hessian).

Proposition 3.3(The Average Trap).

3.2A Merely Routing-based Agentic AI Dominates

Monolithic Baseline

Routing-based Agentic Decomposition

The Routing Regret

Assumption 3.4(Manifold Alignment with Orthogonal Subspaces).

Lemma 3.5.

Joint Bound and Optimal Granularity

4A Closer Look at General Agentic AI

Definition 4.1(Agentic AI as a System of a Topological Compositional DAG of Agents).

Lemma 4.2(Topological Weight).

Theorem 4.3(Agentic AI Convergence Superiority).

Lemma 4.4(Topological Edge Weight).

5Alternative Views

Monolithic scaling is enough for AGI

Agentic AI is conceptually similar to Mixture-of-Experts

Multi-Agent systems often fail

6Call to Action

Prioritize Agentic AI for accessible AGI research

Not only fine-tune weights, but also invent better multi-agent evolution methods for applicable Agentic AI

7Conclusion

Acknowledgements

References

Appendix AProofs for Theorems

A.1Proof of Proposition3.3

Proof.

A.2Proof of Lemma4.2

Proof.

A.3Proof of Lemma4.4

Proof.

Appendix BSome Relevant Remarks

B.1Remark for Section3.1

B.2Remark for Assumption3.1

B.3Remark for Section3.2

B.4Remark for Assumption3.4

Similar Articles

Position: Agentic AI System Is a Foreseeable Pathway to AGI

Most “agentic AI” conversations feel too abstract. Here is how my agentic research system looks like

Planning for AGI and beyond

Practices for Governing Agentic AI Systems

@omarsar0: Great paper on self-improving agents. Why? We need to think more deeply about AI agent system design. The protocol spec…

Submit Feedback