SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling

arXiv cs.LG Papers

Summary

This paper proposes SCALE, a deep reinforcement learning scheduler for agentic LLM workflow DAGs that generalizes to unseen cluster sizes using cross-attention and structured representation regularization, reducing response time without retraining.

arXiv:2606.06820v1 Announce Type: new Abstract: Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be scheduled on heterogeneous clusters. Existing deep reinforcement learning (DRL) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes. We propose SCALE (Scalable Cross-Attention Learning with Extrapolation), a DRL scheduler that generalizes to unseen cluster scales without fine-tuning. SCALE employs a cross-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction. We observe, however, that permutation-invariant architecture alone does not guarantee good performance at new scales - the attention feature undergoes distribution shift as the server count grows. To counter this, we introduce Structured Representation Regularization (SRR): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8.9% at N=48 relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale-generalization gap.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:18 AM

# SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling
Source: [https://arxiv.org/html/2606.06820](https://arxiv.org/html/2606.06820)
Jierui LanZixuan LiangAiji Liang[liangaiji@mail\.bnu\.edu\.cn](https://arxiv.org/html/2606.06820v1/mailto:[email protected])Jinxi HeFaculty of Arts and Sciences, Beijing Normal University, Zhuhai 519087, China

###### Abstract

Agentic Large Language Model \(LLM\) systems decompose complex tasks into workflow Directed Acyclic Graphs \(DAGs\) whose primitives must be scheduled on heterogeneous clusters\. Existing deep reinforcement learning \(DRL\) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes\. We proposeSCALE\(ScalableCross\-AttentionLearning withExtrapolation\), a DRL scheduler that generalizes to unseen cluster scales without fine\-tuning\. SCALE employs a cross\-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction\. We observe, however, that permutation\-invariant architecture alone does not guarantee good performance at new scales—the attention feature undergoes distribution shift as the server count grows\. To counter this, we introduceStructuredRepresentationRegularization \(SRR\): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size\. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8\.9% atN=48N\{=\}48relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale\-generalization gap\.

###### keywords:

Workflow Scheduling , Cross\-attention , Scalable Reinforcement Learning

††journal:Computer Networks## 1Introduction

Agentic Large Language Model \(LLM\) systems do more than answer questions\. They perceive environments, make plans, invoke external tools, and self\-correct when execution goes wrong[plaat2025agentic](https://arxiv.org/html/2606.06820#bib.bib1)\. A planner inside these systems breaks high\-level goals into Directed Acyclic Graphs \(DAGs\) of primitive operations[kim2023llm](https://arxiv.org/html/2606.06820#bib.bib2)—each node being a tool call, each edge a data dependency\. How these primitives get mapped onto compute hardware directly determines response latency and cluster utilization\. Unlike single\-model inference serving, the cluster here is heterogeneous, its available capacity shifts at runtime, and individual primitives differ greatly in memory and compute needs\.

Cloud\-based task scheduling has been studied extensively[armbrust2010view](https://arxiv.org/html/2606.06820#bib.bib3), yet the bulk of that literature addresses MapReduce jobs[dean2008mapreduce](https://arxiv.org/html/2606.06820#bib.bib4)or microservice chains[zhang2021sinan](https://arxiv.org/html/2606.06820#bib.bib5), workloads whose structure differs substantially from agentic DAGs\. Several recent analyses[chaudhry2025murakkab](https://arxiv.org/html/2606.06820#bib.bib6),[shen2025batch](https://arxiv.org/html/2606.06820#bib.bib7)observe that conventional schedulers cannot capture DAG\-level dependencies together with per\-primitive heterogeneity in compute and memory\. Agentic tasks compound the difficulty: execution paths are decided on\-the\-fly and arrivals are stochastic, rendering static resource reservation largely useless[cheng2024slice](https://arxiv.org/html/2606.06820#bib.bib8)\. A further obstacle is that current DRL\-based schedulers embed the cluster size into their network architecture, so any change in the number of servers forces a full retraining cycle[ma2024efficient](https://arxiv.org/html/2606.06820#bib.bib9)\. Designing a scheduler for this setting raises two concrete difficulties\.

Representing a heterogeneous, variable\-size state\.The cluster state mixes task\-level attributes \(compute demand, memory footprint\) with server\-level attributes \(capacity, current load\), and the server countNNmay differ between training and deployment\. Standard MLP policy networks[schulman2017proximal](https://arxiv.org/html/2606.06820#bib.bib10)accept fixed\-dimension inputs, so they break outright whenNNchanges\. Graph Neural Networks \(GNNs\) can encode task dependencies[liu2024ga](https://arxiv.org/html/2606.06820#bib.bib11), but merging task and server features into one graph adds substantial computational overhead\.

![Refer to caption](https://arxiv.org/html/2606.06820v1/x1.png)Figure 1:System architecture for agentic workflow scheduling\. Workflow DAGs arrive dynamically and are collected in a Ready Primitive Buffer\. A two\-level scheduler—heuristic longest\-path selection followed by an RL agent—assigns primitives to a heterogeneous server cluster, minimizing average response time and communication delay\.![Refer to caption](https://arxiv.org/html/2606.06820v1/x2.png)Figure 2:Architecture of the SCALE algorithm with SRR\. A 2\-layer cross\-attention backbone integrates primitive features \(queries\) with server features \(key\-value pairs\)\. The resulting representation feeds a Pointer Actor \(softmax over server scores\) and a Value Critic, both trained via PPO\.Generalizing across cluster scales\.We want to train on a small cluster \(NN=16\) and deploy on larger ones \(NN=32, 48, or beyond\) without retraining\. In practice, a policy learned at one scale degrades at another because the latent features undergo distribution shift\. In attention\-based architectures[vaswani2017attention](https://arxiv.org/html/2606.06820#bib.bib12), the variance of the aggregated representation grows with the key\-set size, which causes feature dimensions to become redundant and correlated\. The policy then becomes brittle once the input set exceeds what was seen during training\.

We tackle both issues withScalableCross\-AttentionLearning withExtrapolation \(SCALE\)\. Figures[1](https://arxiv.org/html/2606.06820#S1.F1)and[2](https://arxiv.org/html/2606.06820#S1.F2)show the system overview and algorithm architecture, respectively\. The core mechanism is cross\-attention: the current task’s features serve as the query while all server states form the key\-value set, allowing the model to accept any number of servers without modification\. On top of this, a pointer network[vinyals2015pointer](https://arxiv.org/html/2606.06820#bib.bib13)computes selection scores through query\-key dot products, yielding an output whose dimension equals the currentNN\. The main technical contribution is structured representation regularization \(SRR\)—a decorrelation penalty on the off\-diagonal covariance of the attention feature, paired with a KL term that pulls each dimension’s marginal toward𝒩​\(0,1\)\\mathcal\{N\}\(0,1\)\. SRR prevents the feature statistics from drifting asNNgrows, so a model trained atNN=16 can run atNN=32 or 48 without any fine\-tuning\.

We evaluate SCALE on a heterogeneous simulated cluster, training atNN=16 and testing atNN=32 and 48 with no fine\-tuning\. We compare with several DRL baselines and ablate SRR’s contribution to generalization\.

Our contributions are:

1. 1\.A formalization of agentic workflow scheduling as an MDP over dynamically arriving DAGs on a heterogeneous cluster, jointly capturing task dependencies, resource heterogeneity, and communication costs\.
2. 2\.The SCALE algorithm—a cross\-attention pointer network augmented with SRR—that achieves zero\-shot generalization across cluster sizes\.
3. 3\.Experimental evidence that SCALE, trained atNN=16, retains competitive response time atNN=32 and 48, whereas the same architecture without SRR degrades noticeably\.

The rest of this paper proceeds as follows\. Section[2](https://arxiv.org/html/2606.06820#S2)surveys related work\. Section[3](https://arxiv.org/html/2606.06820#S3)formalizes the system model\. Section[4](https://arxiv.org/html/2606.06820#S4)describes SCALE\. Section[5](https://arxiv.org/html/2606.06820#S5)presents experiments\. Section[6](https://arxiv.org/html/2606.06820#S6)discusses limitations and Section[7](https://arxiv.org/html/2606.06820#S7)concludes\.

## 2Related Work

Agentic LLM Systems\.An agentic LLM system iteratively decomposes goals, invokes tools, and revises its plan based on intermediate results—going well beyond single\-turn generation\. AutoGPT[significantgravitas2023autogpt](https://arxiv.org/html/2606.06820#bib.bib14)demonstrated recursive task decomposition driven entirely by an LLM\. LangChain[chase2022langchain](https://arxiv.org/html/2606.06820#bib.bib15)introduced reusable abstractions for tool calling, memory management, and chain\-of\-thought orchestration\. MetaGPT[hong2024metagpt](https://arxiv.org/html/2606.06820#bib.bib16)assigned distinct roles to collaborating agents, emulating structured development processes\.

From a scheduling standpoint, the critical observation is that these workflows are DAGs\. LangGraph[langchain2024langgraph](https://arxiv.org/html/2606.06820#bib.bib17)and AutoGen[wu2023autogen](https://arxiv.org/html/2606.06820#bib.bib18)represent agent interactions explicitly as graphs with conditional branches and parallel paths\. Nodes are tool calls or reasoning steps; edges are data dependencies\. Different tool calls—code execution, API queries, database lookups—exhibit vastly different latencies and resource footprints, and the number of concurrently active agents fluctuates rapidly\. Existing orchestration frameworks, however, assume a fixed compute pool and do not adjust scheduling to a changing cluster\. Our method instead treats the server set as variable\-size input and learns a policy that adapts to arbitrary cluster configurations\.

Workflow Scheduling\.DAG scheduling on distributed clusters has been studied for decades\. The classical list heuristics HEFT and CPOP[1999a](https://arxiv.org/html/2606.06820#bib.bib19)run inO​\(v2​p\)O\(v^\{2\}p\)forvvtasks andppprocessors and yield good makespans, but they require the processor set to be fixed and known ahead of time\. Neither can react to nodes joining or leaving at runtime\.

DRL introduced learnable scheduling policies\. DeepRM[mao2016deeprm](https://arxiv.org/html/2606.06820#bib.bib20)used a fully connected policy network to bin\-pack jobs onto machines\. Decima[mao2019decima](https://arxiv.org/html/2606.06820#bib.bib21)went further by encoding DAG structure with a GNN, outperforming hand\-tuned heuristics on dataflow workloads\. Both, however, are trained for a specific cluster size—their policy networks have a fixed output dimension and cannot be reused whenNNchanges\. Multi\-objective variants share this limitation\. Our work targets elastic clusters where nodes may be added or removed at runtime due to auto\-scaling, preemption, or failure, demanding a scheduler that generalizes across sizes without retraining\.

Scalable RL for Variable\-size Problems\.The fundamental difficulty in applying RL to scheduling is that both the task graph and the server set vary in size across instances\. Fixed\-dimensional MLP policies fail whenever the problem size changes\. Two lines of work address this\.

One line relies on permutation\-invariant encoders\. GCNs[kipf2017gcn](https://arxiv.org/html/2606.06820#bib.bib22)handle graph\-structured inputs and have been incorporated into scheduling policies\. For unordered sets such as available machines, Deep Sets[zaheer2017deep](https://arxiv.org/html/2606.06820#bib.bib23)and Set Transformers[lee2019set](https://arxiv.org/html/2606.06820#bib.bib24)offer architectures whose output is independent of input ordering\.

The other line exploits attention for combinatorial optimization\. The Attention Model[kool2019attention](https://arxiv.org/html/2606.06820#bib.bib25)solved vehicle routing and TSP instances of varying size with a single trained model; similar ideas appear in job\-shop scheduling[park2021jobshop](https://arxiv.org/html/2606.06820#bib.bib26)\. Self\-attention handles variable cardinality because the softmax normalizes over whatever set is present\.

Training on small instances and deploying on larger ones—scale generalization—has been explored through curriculum learning and size\-invariant architectural biases[bengio2021ml4co](https://arxiv.org/html/2606.06820#bib.bib27), but remains open\. We combine cross\-attention between a single task query and a variable\-size server key set with explicit feature regularization to achieve zero\-shot transfer across cluster sizes\.

## 3Problem Formulation

We consider online scheduling of agentic workflows on a heterogeneous computing cluster\. Workflows arrive over time, each composed of interdependent primitives\. The scheduler assigns ready primitives to servers in real time, aiming to minimize average response time subject to resource constraints and dependency ordering\.

### 3\.1Workflow Model

Let𝒲=\{𝒲1,𝒲2,…\}\\mathcal\{W\}=\\\{\\mathcal\{W\}\_\{1\},\\mathcal\{W\}\_\{2\},\\dots\\\}denote the set of workflows that arrive during the system’s operation\. Each workflow𝒲k\\mathcal\{W\}\_\{k\}is represented as a directed acyclic graph \(DAG\)𝒢k=\(𝒱k,ℰk\)\\mathcal\{G\}\_\{k\}=\(\\mathcal\{V\}\_\{k\},\\mathcal\{E\}\_\{k\}\), where:

- 1\.𝒱k\\mathcal\{V\}\_\{k\}is the set ofprimitives\(atomic execution units\)\. Each primitivevi∈𝒱kv\_\{i\}\\in\\mathcal\{V\}\_\{k\}is characterized by a tuple\(ci,mi,di\)\(c\_\{i\},m\_\{i\},d\_\{i\}\): - \(a\)ci\>0c\_\{i\}\>0: computational demand \(in GFLOPS\), - \(b\)mi\>0m\_\{i\}\>0: memory requirement \(in GB\), - \(c\)di≥0d\_\{i\}\\geq 0: output data volume \(in GB\) to be transmitted to successor tasks\.
- 2\.ℰk⊆𝒱k×𝒱k\\mathcal\{E\}\_\{k\}\\subseteq\\mathcal\{V\}\_\{k\}\\times\\mathcal\{V\}\_\{k\}is the set of directed edges representing data dependencies\. An edge\(vi,vj\)∈ℰk\(v\_\{i\},v\_\{j\}\)\\in\\mathcal\{E\}\_\{k\}indicates thatvjv\_\{j\}depends on the completion ofviv\_\{i\}\.

Thepredecessor setof a nodevjv\_\{j\}isΓ−​\(vj\)=\{vi∈𝒱k:\(vi,vj\)∈ℰk\}\\Gamma^\{\-\}\(v\_\{j\}\)=\\\{v\_\{i\}\\in\\mathcal\{V\}\_\{k\}:\(v\_\{i\},v\_\{j\}\)\\in\\mathcal\{E\}\_\{k\}\\\}\. Nodevjv\_\{j\}becomesreadywhen all its predecessors have finished execution, i\.e\.,Γ−​\(vj\)⊆𝒟​\(t\)\\Gamma^\{\-\}\(v\_\{j\}\)\\subseteq\\mathcal\{D\}\(t\), where𝒟​\(t\)\\mathcal\{D\}\(t\)is the set of tasks completed by timett\. Theready setat timettis

ℛ​\(t\)=\{vj∈⋃k𝒱k∖𝒟​\(t\):Γ−​\(vj\)⊆𝒟​\(t\)\},\\mathcal\{R\}\(t\)=\\left\\\{v\_\{j\}\\in\\bigcup\_\{k\}\\mathcal\{V\}\_\{k\}\\setminus\\mathcal\{D\}\(t\)\\;:\\;\\Gamma^\{\-\}\(v\_\{j\}\)\\subseteq\\mathcal\{D\}\(t\)\\right\\\},which constitutes the only schedulable tasks at that moment\. The composition ofℛ​\(t\)\\mathcal\{R\}\(t\)changes dynamically as tasks complete and new workflows arrive\. Workflows arrive according to a Poisson process; the arrival rate may vary with system load\.

### 3\.2Cluster Model

The computing cluster consists of a set of serversℋ=\{h1,…,hN\}\\mathcal\{H\}=\\\{h\_\{1\},\\dots,h\_\{N\}\\\}, whereNNcan change over time due to auto\-scaling, preemption, or failures\. Each serverhjh\_\{j\}is characterized by its resource capacities\(Cj,Mj\)\(C\_\{j\},M\_\{j\}\):

- 1\.Cj\>0C\_\{j\}\>0: computing capacity \(in GFLOPS\),
- 2\.Mj\>0M\_\{j\}\>0: memory capacity \(in GB\)\.

Servers are heterogeneous; capacities may differ by orders of magnitude\.

Each server executes at most one primitive at a time\. A serverhjh\_\{j\}that is idle and satisfies the memory constraintmi≤Mjm\_\{i\}\\leq M\_\{j\}may accept the assignment of a taskviv\_\{i\}\. Once assigned, theexecution timeofviv\_\{i\}onhjh\_\{j\}is

τi​j=ciCj\.\\tau\_\{ij\}=\\frac\{c\_\{i\}\}\{C\_\{j\}\}\.
If a dependency edge\(vi,vj\)∈ℰk\(v\_\{i\},v\_\{j\}\)\\in\\mathcal\{E\}\_\{k\}has its predecessorviv\_\{i\}assigned to serverhph\_\{p\}and its successorvjv\_\{j\}assigned to a different serverhqh\_\{q\}\(p≠qp\\neq q\), the cross\-server data transfer incurs acommunication cost

δi​j=diB\+L,\\delta\_\{ij\}=\\frac\{d\_\{i\}\}\{B\}\+L,whereBBis the cluster network bandwidth \(GB/s\) andLLis the base network latency \(s\)\. Ifp=qp=q, thenδi​j=0\\delta\_\{ij\}=0\. This communication cost does not directly affect the completion time of a task; instead, it serves as a reward\-shaping signal to encourage locality \(see Section[4\.1\.3](https://arxiv.org/html/2606.06820#S4.SS1.SSS3)\)\.

### 3\.3Objective and Constraints

Letfif\_\{i\}denote the actual completion time of taskviv\_\{i\}\(including queuing and execution delays\)\. Letak​\(i\)a\_\{k\(i\)\}be the arrival time of the workflow𝒲k​\(i\)\\mathcal\{W\}\_\{k\(i\)\}to whichviv\_\{i\}belongs\. Theresponse timeofviv\_\{i\}isTi=fi−ak​\(i\)T\_\{i\}=f\_\{i\}\-a\_\{k\(i\)\}\. The scheduler aims to find a scheduling policyπ\\pito minimize theaverage response timeover all completed tasks:

min⁡T¯=1\|𝒫\|​∑vi∈𝒫\(fi−ak​\(i\)\),\\min\\;\\bar\{T\}=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{v\_\{i\}\\in\\mathcal\{P\}\}\\left\(f\_\{i\}\-a\_\{k\(i\)\}\\right\),where𝒫\\mathcal\{P\}is the set of all completed tasks during the operating period\.

The schedule must satisfy the following constraints at all times:

- 1\.Readyness:A task can only be assigned when it is in the ready setℛ​\(t\)\\mathcal\{R\}\(t\)\.
- 2\.Server idleness:A server can execute at most one task at a time; it must be idle at the moment of assignment\.
- 3\.Memory capacity:The assigned task’s memory requirement must not exceed the server’s memory capacity:mi≤Mjm\_\{i\}\\leq M\_\{j\}\.
- 4\.Non\-preemption:Once a task starts on a server, it runs to completion without interruption\.

However, this formulation is incomplete without specifying how the policyπ\\piis structured\. We defer the full MDP definition—including state, action, and reward—to Section[4\.1](https://arxiv.org/html/2606.06820#S4.SS1), after introducing the RL framework\.

### 3\.4Challenges and MDP Formulation

Compared to classical static DAG scheduling, this problem introduces additional difficulties:

- 1\.Online arrival:Workflows arrive stochastically; the scheduler has no knowledge of future tasks\.
- 2\.DAG dependencies:Readiness depends on predecessor completions, and cross\-server placement incurs communication costs, so greedy assignment is generally suboptimal\.
- 3\.Variable cluster size:NNmay change dynamically through auto\-scaling, making it necessary for the scheduler to work at scales it was never trained on\.
- 4\.Large action space:The number of feasible task\-server pairings grows combinatorially with the ready set and the server count\.

We model the problem as a Markov Decision Process \(MDP\)\. The state encodes the current task \(chosen by a deterministic first\-level heuristic\) together with all server statuses\. The action picks a server for that task\. The reward encourages short execution times and data locality, aligning with the response\-time objective\. The full MDP specification appears in Section[4](https://arxiv.org/html/2606.06820#S4)\.

## 4Algorithm

### 4\.1Reinforcement Learning Formulation

We define the MDP\(𝒮,𝒜,r,𝒯,γ\)\(\\mathcal\{S\},\\mathcal\{A\},r,\\mathcal\{T\},\\gamma\)as follows\. Letting the agent jointly decide over all ready tasks and all servers would produce a Cartesian\-product action space that grows too quickly\. We therefore split scheduling into two levels\. The first level is deterministic: it selects a single taskv∗v^\{\*\}fromℛ​\(t\)\\mathcal\{R\}\(t\)by longest hop\-count path \(in edges\) to the terminal node of its workflow, prioritizing critical\-path tasks\. No parameters are learned at this level\. The second level is the RL agent: givenv∗v^\{\*\}and the full cluster state, it chooses which server to assignv∗v^\{\*\}to\. This decomposition reduces each decision step to a single server selection\.

The formal optimization objective is

minπ:𝒮→𝒜\\displaystyle\\min\_\{\\pi:\\,\\mathcal\{S\}\\to\\mathcal\{A\}\}\\quadT¯=1\|𝒫\|​∑vi∈𝒫\(fi−ak​\(i\)\),\\displaystyle\\bar\{T\}=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{v\_\{i\}\\in\\mathcal\{P\}\}\\bigl\(f\_\{i\}\-a\_\{k\(i\)\}\\bigr\),\(1a\)s\.t\.\\displaystyle\\mathrm\{s\.t\.\}\\quadat=π​\(𝒔t\),∀t,\\displaystyle a\_\{t\}=\\pi\(\\boldsymbol\{s\}\_\{t\}\),\\quad\\forall\\,t,\(1b\)v∗∈ℛ​\(t\),\\displaystyle v^\{\*\}\\in\\mathcal\{R\}\(t\),\(1c\)hat∈ℐ​\(t\),\\displaystyle h\_\{a\_\{t\}\}\\in\\mathcal\{I\}\(t\),\(1d\)mv∗≤Mat,\\displaystyle m\_\{v^\{\*\}\}\\leq M\_\{a\_\{t\}\},\(1e\)where𝒔t∈𝒮\\boldsymbol\{s\}\_\{t\}\\in\\mathcal\{S\}is the system state at decision steptt,at∈𝒜a\_\{t\}\\in\\mathcal\{A\}is the server index selected by the policy,v∗v^\{\*\}is the task chosen by the first\-level scheduler, andℐ​\(t\)⊆ℋ\\mathcal\{I\}\(t\)\\subseteq\\mathcal\{H\}is the set of idle servers at timett\.𝒫\\mathcal\{P\}denotes the set of all completed tasks during the operating period andT¯\\bar\{T\}is the mean response time\. The constraintat=π​\(𝒔t\)a\_\{t\}=\\pi\(\\boldsymbol\{s\}\_\{t\}\)makes explicit that every server assignment is determined by the same stationary policy acting on the current state, so optimizingπ\\piover the state space𝒮\\mathcal\{S\}is the sole degree of freedom for minimizingT¯\\bar\{T\}\.

#### 4\.1\.1State𝒮\\mathcal\{S\}

At each decision step, the agent’s state𝒔t\\boldsymbol\{s\}\_\{t\}is formed by concatenating the features of the current taskv∗v^\{\*\}with the real\-time states of allNNservers\. The task feature vectorϕt∈ℝ5\\boldsymbol\{\\phi\}\_\{t\}\\in\\mathbb\{R\}^\{5\}contains five scalars: normalized computational demandcv∗/Cmaxc\_\{v^\{\*\}\}/C\_\{\\max\}, normalized memory requirementmv∗/Mmaxm\_\{v^\{\*\}\}/M\_\{\\max\}, normalized number of successors\|Γ\+​\(v∗\)\|/5\|\\Gamma^\{\+\}\(v^\{\*\}\)\|/5, the completion ratio of the parent workflow, and the output data volumedv∗d\_\{v^\{\*\}\}\(unnormalized\)\. The server feature vector𝝍t\(j\)∈ℝ5\\boldsymbol\{\\psi\}\_\{t\}^\{\(j\)\}\\in\\mathbb\{R\}^\{5\}for serverhjh\_\{j\}contains its normalized computing capacityCj/CmaxC\_\{j\}/C\_\{\\max\}, normalized memoryMj/MmaxM\_\{j\}/M\_\{\\max\}, remaining execution time of the current task, instantaneous power consumption, and the estimated communication delay from the server hosting a predecessor ofv∗v^\{\*\}tohjh\_\{j\}ifv∗v^\{\*\}were assigned there\. The state vector is

𝒔t=\[ϕt;𝝍t\(1\),…,𝝍t\(N\)\]∈ℝ5\+5​N,\\boldsymbol\{s\}\_\{t\}=\\bigl\[\\boldsymbol\{\\phi\}\_\{t\};\\;\\boldsymbol\{\\psi\}\_\{t\}^\{\(1\)\},\\;\\ldots,\\;\\boldsymbol\{\\psi\}\_\{t\}^\{\(N\)\}\\bigr\]\\in\\mathbb\{R\}^\{5\+5N\},whose dimensionality scales linearly with the cluster sizeNN\. Note that this state vector is only a conceptual description; in our cross\-attention architecture,ϕt\\boldsymbol\{\\phi\}\_\{t\}and𝝍t\(j\)\\boldsymbol\{\\psi\}\_\{t\}^\{\(j\)\}are embedded separately\.

#### 4\.1\.2Action𝒜\\mathcal\{A\}

At each step, the agent selects an actionata\_\{t\}from\{1,…,N\}\\\{1,\\ldots,N\\\}, indicating the assignment ofv∗v^\{\*\}to serverhath\_\{a\_\{t\}\}\. Ifhath\_\{a\_\{t\}\}is currently idle and satisfiesmv∗≤Matm\_\{v^\{\*\}\}\\leq M\_\{a\_\{t\}\}, the assignment takes effect immediately and the simulation clock does not advance; the scheduler then selects the next task from the ready set\. Otherwise, the action is invalid, and the simulation clock advances byΔ​t\\Delta t, during which progress updates for executing tasks, readiness releases upon task completions, and Poisson arrivals of new workflows are processed\.

#### 4\.1\.3RewardRR

The immediate reward for a valid assignment is

rt=wτ​e−τv∗,at/sτ−α​δv∗,atδmax−β​ev∗,atemax,r\_\{t\}=w\_\{\\tau\}\\,e^\{\-\\tau\_\{v^\{\*\},a\_\{t\}\}/s\_\{\\tau\}\}\-\\alpha\\,\\frac\{\\delta\_\{v^\{\*\},a\_\{t\}\}\}\{\\delta\_\{\\max\}\}\-\\beta\\,\\frac\{e\_\{v^\{\*\},a\_\{t\}\}\}\{e\_\{\\max\}\},\(2\)whereτv∗,at=cv∗/Cat\\tau\_\{v^\{\*\},a\_\{t\}\}=c\_\{v^\{\*\}\}/C\_\{a\_\{t\}\}is the estimated execution time,δv∗,at\\delta\_\{v^\{\*\},a\_\{t\}\}is the communication cost from the server hosting one ofv∗v^\{\*\}’s already\-assigned predecessors tohath\_\{a\_\{t\}\}\(whenv∗v^\{\*\}has multiple predecessors, we use the first predecessor whose assignment is completed; ifv∗v^\{\*\}has no predecessors,δv∗,at=0\\delta\_\{v^\{\*\},a\_\{t\}\}=0\), andev∗,ate\_\{v^\{\*\},a\_\{t\}\}is the estimated energy consumption\. The normalization baselines areδmax\\delta\_\{\\max\}andemaxe\_\{\\max\}\. We setwτ=1\.0w\_\{\\tau\}=1\.0,sτ=1\.0s\_\{\\tau\}=1\.0,α=0\.1\\alpha=0\.1, andβ=0\\beta=0\(energy optimization is not active in the current experiments\)\. Invalid actions receivert=−0\.001r\_\{t\}=\-0\.001as a mild penalty to discourage their abuse\. The exponential form bounds the time reward within\(0,1\]\(0,1\], giving stronger marginal incentive for faster assignments while keeping the reward magnitude stable across tasks with different compute demands\. The learning objective ismax⁡𝔼​\[∑t=0∞γt​rt\]\\max\\mathbb\{E\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\_\{t\}\]withγ=0\.99\\gamma=0\.99\. Because the dominant term in \([2](https://arxiv.org/html/2606.06820#S4.E2)\) decreases monotonically with execution time, maximizing cumulative reward is aligned with minimizingT¯\\bar\{T\}in \([1](https://arxiv.org/html/2606.06820#S4.E1)\)—though not strictly equivalent, since the communication and energy penalties also encourage locality and efficiency\. One subtlety: the simulation clock advances only on invalid actions, so policies with higher invalid\-action rates accumulate more workflow arrivals within the same simulated duration, inflating their completed\-task count\. We therefore useT¯\\bar\{T\}as the evaluation metric, which is unaffected by this artifact\.

### 4\.2Cross\-Attention Pointer Network

Given this MDP, the agent must pick a server at each step\. We adopt a cross\-attention pointer network: task features form the query, server features form the key\-value set\. AfterLLcross\-attention layers, the task representation encodes information about the full cluster, and pointer\-style dot\-product scores yield the selection distribution over servers\.

#### 4\.2\.1Embedding

We parse𝒔t\\boldsymbol\{s\}\_\{t\}into the task featureϕ∈ℝd0\\boldsymbol\{\\phi\}\\in\\mathbb\{R\}^\{d\_\{0\}\}\(d0=5d\_\{0\}=5\) and the server feature matrix𝚿=\[𝝍\(1\);…;𝝍\(N\)\]∈ℝN×d0\\boldsymbol\{\\Psi\}=\[\\boldsymbol\{\\psi\}^\{\(1\)\};\\ldots;\\boldsymbol\{\\psi\}^\{\(N\)\}\]\\in\\mathbb\{R\}^\{N\\times d\_\{0\}\}\. Two separate two\-layer fully connected networks with ReLU activations map these into a shareddd\-dimensional latent space:𝒒\(0\)=gp​\(ϕ\)∈ℝd\\boldsymbol\{q\}^\{\(0\)\}=g\_\{p\}\(\\boldsymbol\{\\phi\}\)\\in\\mathbb\{R\}^\{d\}for the task and𝒛j=gs​\(𝝍\(j\)\)∈ℝd\\boldsymbol\{z\}\_\{j\}=g\_\{s\}\(\\boldsymbol\{\\psi\}^\{\(j\)\}\)\\in\\mathbb\{R\}^\{d\}for each server\. The two networksgp:ℝd0→ℝdg\_\{p\}:\\mathbb\{R\}^\{d\_\{0\}\}\\to\\mathbb\{R\}^\{d\}andgs:ℝd0→ℝdg\_\{s\}:\\mathbb\{R\}^\{d\_\{0\}\}\\to\\mathbb\{R\}^\{d\}have independent parameters\.

#### 4\.2\.2Cross\-Attention Backbone

The task embedding𝒒\(0\)\\boldsymbol\{q\}^\{\(0\)\}passes throughLLsuccessive cross\-attention blocks\. Each block has two sublayers—multi\-head cross\-attention followed by a feedforward network—both with residual connections and layer normalization\. In layerℓ\\ell, the task embedding is the query and all server embeddings are key\-value pairs, processed in parallel acrossHHheads\. For headhh\(h=1,…,Hh=1,\\ldots,H\), separate projection matrices map query and key\-values into adh=d/Hd\_\{h\}=d/Hdimensional subspace\. The scaled dot\-product attention weights are

αℓ​j\(h\)=exp⁡\(\(𝑾Q​h\(ℓ\)​𝒒\(ℓ−1\)\)⊤​\(𝑾K​h\(ℓ\)​𝒛j\)/dh\)∑j′=1Nexp⁡\(\(𝑾Q​h\(ℓ\)​𝒒\(ℓ−1\)\)⊤​\(𝑾K​h\(ℓ\)​𝒛j′\)/dh\)\.\\alpha\_\{\\ell j\}^\{\(h\)\}=\\frac\{\\exp\\\!\\bigl\(\(\\boldsymbol\{W\}\_\{Qh\}^\{\(\\ell\)\}\\boldsymbol\{q\}^\{\(\\ell\-1\)\}\)^\{\\top\}\(\\boldsymbol\{W\}\_\{Kh\}^\{\(\\ell\)\}\\boldsymbol\{z\}\_\{j\}\)\\,/\\,\\sqrt\{d\_\{h\}\}\\bigr\)\}\{\\sum\_\{j^\{\\prime\}=1\}^\{N\}\\exp\\\!\\bigl\(\(\\boldsymbol\{W\}\_\{Qh\}^\{\(\\ell\)\}\\boldsymbol\{q\}^\{\(\\ell\-1\)\}\)^\{\\top\}\(\\boldsymbol\{W\}\_\{Kh\}^\{\(\\ell\)\}\\boldsymbol\{z\}\_\{j^\{\\prime\}\}\)\\,/\\,\\sqrt\{d\_\{h\}\}\\bigr\)\}\.\(3\)Each head aggregates value vectors𝑾V​h\(ℓ\)​𝒛j\\boldsymbol\{W\}\_\{Vh\}^\{\(\\ell\)\}\\boldsymbol\{z\}\_\{j\}weighted by these coefficients\. The concatenated head outputs pass through a projection𝑾O\(ℓ\)\\boldsymbol\{W\}\_\{O\}^\{\(\\ell\)\}, then a residual connection and layer normalization yield the intermediate representation𝒒^\(ℓ\)\\hat\{\\boldsymbol\{q\}\}^\{\(\\ell\)\}\. A feedforward sublayer—two linear layers with hidden dimension4​d4dand GELU activation, again with residual connection and layer normalization—produces the layer output𝒒\(ℓ\)\\boldsymbol\{q\}^\{\(\\ell\)\}\. AfterLLlayers,

𝒇=𝒒\(L\)∈ℝd,\\boldsymbol\{f\}=\\boldsymbol\{q\}^\{\(L\)\}\\in\\mathbb\{R\}^\{d\},is called the attention feature\. It compresses the contextual information of allNNservers into a fixed\-length vector whose dimensionddis independent of the cluster sizeNN\.

#### 4\.2\.3Actor and Critic

The Actor uses𝒇\\boldsymbol\{f\}as query and the server embeddings\{𝒛j\}\\\{\\boldsymbol\{z\}\_\{j\}\\\}as keys, producing selection scores via pointer\-style dot products:

pj=\(𝑾q​𝒇\)⊤​\(𝑾k​𝒛j\),πθ​\(at=j∣𝒔t\)=epj∑j′=1Nepj′,p\_\{j\}=\(\\boldsymbol\{W\}\_\{q\}\\,\\boldsymbol\{f\}\)^\{\\top\}\(\\boldsymbol\{W\}\_\{k\}\\,\\boldsymbol\{z\}\_\{j\}\),\\quad\\pi\_\{\\theta\}\(a\_\{t\}=j\\mid\\boldsymbol\{s\}\_\{t\}\)=\\frac\{e^\{p\_\{j\}\}\}\{\\sum\_\{j^\{\\prime\}=1\}^\{N\}e^\{p\_\{j^\{\\prime\}\}\}\},\(4\)where𝑾q,𝑾k∈ℝd×d\\boldsymbol\{W\}\_\{q\},\\boldsymbol\{W\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times d\}are learnable projections\. The Actor’s output dimension equals the current cluster sizeNNand adapts at test time\. The Critic takes𝒇\\boldsymbol\{f\}alone and estimatesVθ​\(𝒔t\)V\_\{\\theta\}\(\\boldsymbol\{s\}\_\{t\}\)through a three\-layer MLP \(hidden dimension 64, ReLU\)\. It deliberately avoids aggregating over\{𝒛j\}\\\{\\boldsymbol\{z\}\_\{j\}\\\}so as not to introduce an implicit dependency onNN\. The full network is trained end\-to\-end with PPO\.

#### 4\.2\.4Permutation Invariance and Scale Adaptability

No positional encoding is used, so the ordering of server embeddings\{𝒛j\}\\\{\\boldsymbol\{z\}\_\{j\}\\\}does not affect𝒇\\boldsymbol\{f\}—the network is permutation\-invariant over the server set\. The Actor’s output dimension equalsNNautomatically\. These two properties allow the model to accept any cluster size at test time\. Permutation invariance, however, is necessary but not sufficient for generalization\. WhenN′≠NtrainN^\{\\prime\}\\neq N\_\{\\mathrm\{train\}\}, we observe that the distribution of𝒇\\boldsymbol\{f\}shifts systematically and decision quality degrades\. Controlling the feature statistics is therefore required on top of the architectural design\.

Algorithm 1Cross\-Attention Pointer Network \+ SRR Training1:Environment

ℰ\\mathcal\{E\}\(

NNservers\), network parameters

θ\\theta, PPO hyperparameters

\(γ,λGAE,ε,c1,c2\)\(\\gamma,\\lambda\_\{\\mathrm\{GAE\}\},\\varepsilon,c\_\{1\},c\_\{2\}\), SRR weights

\(λ1,λ2\)\(\\lambda\_\{1\},\\lambda\_\{2\}\), warmup start epoch

EsE\_\{s\}, warmup duration

EwE\_\{w\}, total training epochs

EE, collection steps per epoch

TT, update epochs

KK, mini\-batch size

BB
2:Trained policy parameters

θ∗\\theta^\{\*\}
3:Randomly initialize

θ\\theta; initialize optimizer and cosine annealing learning rate scheduler

4:for

e=1e=1to

EEdo

5:Clear experience buffer

𝒟←∅\\mathcal\{D\}\\leftarrow\\emptyset
6:for

t=1t=1to

TTdo

7:Obtain state

𝒔t\\boldsymbol\{s\}\_\{t\}from

ℰ\\mathcal\{E\}; parse into

\(ϕ,𝚿\)\(\\boldsymbol\{\\phi\},\\,\\boldsymbol\{\\Psi\}\)
8:

𝒒\(0\)←gp​\(ϕ\)\\boldsymbol\{q\}^\{\(0\)\}\\leftarrow g\_\{p\}\(\\boldsymbol\{\\phi\}\),

𝒛j←gs​\(𝝍\(j\)\)\\boldsymbol\{z\}\_\{j\}\\leftarrow g\_\{s\}\(\\boldsymbol\{\\psi\}^\{\(j\)\}\),

j=1,…,Nj=1,\\ldots,N
9:for

ℓ=1\\ell=1to

LLdo

10:

𝒒\(ℓ\)←CrossAttnBlock​\(𝒒\(ℓ−1\),\{𝒛j\}\)\\boldsymbol\{q\}^\{\(\\ell\)\}\\leftarrow\\mathrm\{CrossAttnBlock\}\(\\boldsymbol\{q\}^\{\(\\ell\-1\)\},\\\{\\boldsymbol\{z\}\_\{j\}\\\}\)
11:endfor

12:

𝒇←𝒒\(L\)\\boldsymbol\{f\}\\leftarrow\\boldsymbol\{q\}^\{\(L\)\}
13:

pj←\(𝑾q​𝒇\)⊤​\(𝑾k​𝒛j\)p\_\{j\}\\leftarrow\(\\boldsymbol\{W\}\_\{q\}\\boldsymbol\{f\}\)^\{\\top\}\(\\boldsymbol\{W\}\_\{k\}\\boldsymbol\{z\}\_\{j\}\),

πθ​\(j∣𝒔t\)←epj/∑j′epj′\\pi\_\{\\theta\}\(j\\mid\\boldsymbol\{s\}\_\{t\}\)\\leftarrow e^\{p\_\{j\}\}/\\sum\_\{j^\{\\prime\}\}e^\{p\_\{j^\{\\prime\}\}\}
14:Sample

ata\_\{t\}from

πθ\\pi\_\{\\theta\}; compute

log⁡πθ​\(at∣𝒔t\)\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid\\boldsymbol\{s\}\_\{t\}\)and

Vθ​\(𝒔t\)←MLP​\(𝒇\)V\_\{\\theta\}\(\\boldsymbol\{s\}\_\{t\}\)\\leftarrow\\mathrm\{MLP\}\(\\boldsymbol\{f\}\)
15:Execute

ata\_\{t\}in

ℰ\\mathcal\{E\}; receive

rtr\_\{t\},

𝒔t\+1\\boldsymbol\{s\}\_\{t\+1\}
16:

𝒟←𝒟∪\{\(𝒔t,at,rt,log⁡πθ​\(at∣𝒔t\),Vθ​\(𝒔t\)\)\}\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\\{\(\\boldsymbol\{s\}\_\{t\},a\_\{t\},r\_\{t\},\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid\\boldsymbol\{s\}\_\{t\}\),V\_\{\\theta\}\(\\boldsymbol\{s\}\_\{t\}\)\)\\\}
17:endfor

18:Compute advantages

A^t\\hat\{A\}\_\{t\}and returns

RtR\_\{t\}using GAE,

∀\(⋅\)∈𝒟\\forall\\,\(\\cdot\)\\in\\mathcal\{D\}
19:

w←min⁡\(1,max⁡\(0,\(e−Es\)/Ew\)\)w\\leftarrow\\min\\\!\\bigl\(1,\\,\\max\(0,\\,\(e\-E\_\{s\}\)/E\_\{w\}\)\\bigr\)
20:for

k=1k=1to

KKdo

21:foreach mini\-batch

ℬ⊂𝒟\\mathcal\{B\}\\subset\\mathcal\{D\},

\|ℬ\|=B\|\\mathcal\{B\}\|=Bdo

22:Recompute forward pass to obtain

log⁡πθ′\\log\\pi\_\{\\theta\}^\{\\prime\}, entropy

HH,

Vθ′V\_\{\\theta\}^\{\\prime\},

𝒇\\boldsymbol\{f\}
23:

ρ←exp⁡\(log⁡πθ′−log⁡πθold\)\\rho\\leftarrow\\exp\(\\log\\pi\_\{\\theta\}^\{\\prime\}\-\\log\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\)
24:

ℒclip←−1\|ℬ\|​∑min⁡\(ρ​A^,clip​\(ρ,1−ε,1\+ε\)​A^\)\\mathcal\{L\}\_\{\\mathrm\{clip\}\}\\leftarrow\-\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\\min\\\!\\bigl\(\\rho\\,\\hat\{A\},\\,\\mathrm\{clip\}\(\\rho,\\,1\{\-\}\\varepsilon,\\,1\{\+\}\\varepsilon\)\\,\\hat\{A\}\\bigr\)
25:

ℒvf←1\|ℬ\|​∑\(R−Vθ′\)2\\mathcal\{L\}\_\{\\mathrm\{vf\}\}\\leftarrow\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\(R\-V\_\{\\theta\}^\{\\prime\}\)^\{2\}
26:

ℒπ←ℒclip\+c1​ℒvf−c2​H\\mathcal\{L\}\_\{\\pi\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{clip\}\}\+c\_\{1\}\\,\\mathcal\{L\}\_\{\\mathrm\{vf\}\}\-c\_\{2\}\\,H
27:Compute

𝑪^\\hat\{\\boldsymbol\{C\}\}from

𝒇\\boldsymbol\{f\}in

ℬ\\mathcal\{B\};

ℒc←∑i≠j\|\(𝑪^\)i​j\|\\mathcal\{L\}\_\{c\}\\leftarrow\\sum\_\{i\\neq j\}\|\(\\hat\{\\boldsymbol\{C\}\}\)\_\{ij\}\|
28:Compute per\-dimension

μ^k\\hat\{\\mu\}\_\{k\},

σ^k2\\hat\{\\sigma\}\_\{k\}^\{2\};

ℒd←12​d​∑k\(μ^k2\+σ^k2−ln⁡σ^k2−1\)\\mathcal\{L\}\_\{d\}\\leftarrow\\frac\{1\}\{2d\}\\sum\_\{k\}\(\\hat\{\\mu\}\_\{k\}^\{2\}\+\\hat\{\\sigma\}\_\{k\}^\{2\}\-\\ln\\hat\{\\sigma\}\_\{k\}^\{2\}\-1\)
29:

ℒ←ℒπ\+w​\(λ1​ℒc\+λ2​ℒd\)\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\pi\}\+w\\,\(\\lambda\_\{1\}\\,\\mathcal\{L\}\_\{c\}\+\\lambda\_\{2\}\\,\\mathcal\{L\}\_\{d\}\)\(see \([5](https://arxiv.org/html/2606.06820#S4.E5)\)\)

30:

θ←θ−η​∇θℒ\\theta\\leftarrow\\theta\-\\eta\\,\\nabla\_\{\\theta\}\\mathcal\{L\}\(update after gradient clipping\)

31:endfor

32:endfor

33:Update learning rate scheduler

34:endfor

35:return

θ∗←θ\\theta^\{\*\}\\leftarrow\\theta

### 4\.3Structured Representation Regularization

Cross\-attention with a pointer network provides permutation invariance and variable output size, but does nothing to prevent the attention feature𝒇\\boldsymbol\{f\}from drifting when the cluster grows\. This drift manifests in two ways\. When dimensions of𝒇\\boldsymbol\{f\}are correlated, effective information concentrates along a few principal directions; a change inN′N^\{\\prime\}shifts those directions and the Actor loses its ability to discriminate among servers\. Separately, the softmax denominator∑j′epj′\\sum\_\{j^\{\\prime\}\}e^\{p\_\{j^\{\\prime\}\}\}in \([4](https://arxiv.org/html/2606.06820#S4.E4)\) grows withN′N^\{\\prime\}, flattening selection probabilities and blurring the distinction between good and bad servers\.

We introduce Structured Representation Regularization \(SRR\) to counter both effects\. SRR imposes two constraints on the batch statistics of𝒇\\boldsymbol\{f\}during training\. A decorrelation lossℒc\\mathcal\{L\}\_\{c\}penalizes off\-diagonal entries of the empirical covariance matrix, pushing feature dimensions toward independence so that information distributes across alldddimensions rather than concentrating on a few axes\. A distribution constraintℒd\\mathcal\{L\}\_\{d\}pulls each dimension’s marginal toward𝒩​\(0,1\)\\mathcal\{N\}\(0,1\)via a KL penalty, anchoring the statistics to a fixed reference irrespective of how many servers participate in the attention computation\. The combined effect is that features maintain stable structure asNNvaries, producing more confident policy outputs that resist the probability\-flattening caused by a larger softmax denominator\. Below we detail the two loss terms\.

Let the set of attention features in the current mini\-batch be𝑭=\[𝒇1,…,𝒇B\]⊤∈ℝB×d\\boldsymbol\{F\}=\[\\boldsymbol\{f\}\_\{1\},\\ldots,\\boldsymbol\{f\}\_\{B\}\]^\{\\top\}\\in\\mathbb\{R\}^\{B\\times d\}, with batch mean𝒇¯=1B​∑b𝒇b\\bar\{\\boldsymbol\{f\}\}=\\frac\{1\}\{B\}\\sum\_\{b\}\\boldsymbol\{f\}\_\{b\}and empirical covariance matrix

𝑪^=1B−1​\(𝑭−𝟏​𝒇¯⊤\)⊤​\(𝑭−𝟏​𝒇¯⊤\)∈ℝd×d\.\\hat\{\\boldsymbol\{C\}\}=\\frac\{1\}\{B\-1\}\\bigl\(\\boldsymbol\{F\}\-\\boldsymbol\{1\}\\bar\{\\boldsymbol\{f\}\}^\{\\top\}\\bigr\)^\{\\top\}\\\!\\bigl\(\\boldsymbol\{F\}\-\\boldsymbol\{1\}\\bar\{\\boldsymbol\{f\}\}^\{\\top\}\\bigr\)\\in\\mathbb\{R\}^\{d\\times d\}\.
The decorrelation loss penalizes the off\-diagonal elements of𝑪^\\hat\{\\boldsymbol\{C\}\}, pushing the dimensions of𝒇\\boldsymbol\{f\}toward statistical independence:

ℒc=∑i≠j\|\(𝑪^\)i​j\|\.\\mathcal\{L\}\_\{c\}=\\sum\_\{i\\neq j\}\\bigl\|\(\\hat\{\\boldsymbol\{C\}\}\)\_\{ij\}\\bigr\|\.
Algorithm 2Online Scheduling Decision1:Trained parameters

θ∗\\theta^\{\*\}, target cluster

ℋ′=\{h1,…,hN′\}\\mathcal\{H\}^\{\\prime\}=\\\{h\_\{1\},\\ldots,h\_\{N^\{\\prime\}\}\\\}\(

N′N^\{\\prime\}may differ from

NN\)

2:Assignment decision mapping each ready task

v∗v^\{\*\}to a server

3:Load

θ∗\\theta^\{\*\}; set to evaluation mode

4:whilesystem is runningdo

5:Wait until ready set

ℛ​\(t\)≠∅\\mathcal\{R\}\(t\)\\neq\\emptyset
6:First\-level scheduler selects

v∗v^\{\*\}from

ℛ​\(t\)\\mathcal\{R\}\(t\)\(prioritizing the longest hop count to the workflow’s terminal node\)

7:Construct state

𝒔t=\[ϕ;𝝍\(1\),…,𝝍\(N′\)\]\\boldsymbol\{s\}\_\{t\}=\[\\boldsymbol\{\\phi\};\\,\\boldsymbol\{\\psi\}^\{\(1\)\},\\ldots,\\boldsymbol\{\\psi\}^\{\(N^\{\\prime\}\)\}\]
8:

𝒒\(0\)←gp​\(ϕ\)\\boldsymbol\{q\}^\{\(0\)\}\\leftarrow g\_\{p\}\(\\boldsymbol\{\\phi\}\),

𝒛j←gs​\(𝝍\(j\)\)\\boldsymbol\{z\}\_\{j\}\\leftarrow g\_\{s\}\(\\boldsymbol\{\\psi\}^\{\(j\)\}\),

j=1,…,N′j=1,\\ldots,N^\{\\prime\}
9:for

ℓ=1\\ell=1to

LLdo

10:

𝒒\(ℓ\)←CrossAttnBlock​\(𝒒\(ℓ−1\),\{𝒛j\}j=1N′\)\\boldsymbol\{q\}^\{\(\\ell\)\}\\leftarrow\\mathrm\{CrossAttnBlock\}\(\\boldsymbol\{q\}^\{\(\\ell\-1\)\},\\\{\\boldsymbol\{z\}\_\{j\}\\\}\_\{j=1\}^\{N^\{\\prime\}\}\)
11:endfor

12:

𝒇←𝒒\(L\)\\boldsymbol\{f\}\\leftarrow\\boldsymbol\{q\}^\{\(L\)\}
13:

a∗←argmaxj∈\{1,…,N′\}\(𝑾q𝒇\)⊤\(𝑾k𝒛j\)a^\{\*\}\\leftarrow\\arg\\max\_\{j\\in\\\{1,\\ldots,N^\{\\prime\}\\\}\}\(\\boldsymbol\{W\}\_\{q\}\\boldsymbol\{f\}\)^\{\\top\}\(\\boldsymbol\{W\}\_\{k\}\\boldsymbol\{z\}\_\{j\}\)
14:if

ha∗h\_\{a^\{\*\}\}is idleand

mv∗≤Ma∗m\_\{v^\{\*\}\}\\leq M\_\{a^\{\*\}\}then

15:Assign

v∗v^\{\*\}to

ha∗h\_\{a^\{\*\}\}; execute immediately

16:else

17:Mark as invalid action; advance simulation clock by

Δ​t\\Delta t
18:endif

19:endwhile

The distribution constraint loss drives the marginal distribution of each dimension toward the standard normal, equivalent to minimizing the sum of per\-dimension KL divergences from𝒩​\(0,1\)\\mathcal\{N\}\(0,1\):

ℒd=DK​L​\(p​\(𝒇\)∥𝒩​\(𝟎,𝑰\)\)=12​d​∑k=1d\(μ^k2\+σ^k2−ln⁡σ^k2−1\),\\mathcal\{L\}\_\{d\}=D\_\{KL\}\\bigl\(p\(\\boldsymbol\{f\}\)\\,\\\|\\,\\mathcal\{N\}\(\\boldsymbol\{0\},\\boldsymbol\{I\}\)\\bigr\)=\\frac\{1\}\{2d\}\\sum\_\{k=1\}^\{d\}\\bigl\(\\hat\{\\mu\}\_\{k\}^\{2\}\+\\hat\{\\sigma\}\_\{k\}^\{2\}\-\\ln\\hat\{\\sigma\}\_\{k\}^\{2\}\-1\\bigr\),whereμ^k=1B​∑bfb​k\\hat\{\\mu\}\_\{k\}=\\frac\{1\}\{B\}\\sum\_\{b\}f\_\{bk\}andσ^k2=1B−1​∑b\(fb​k−μ^k\)2\\hat\{\\sigma\}\_\{k\}^\{2\}=\\frac\{1\}\{B\-1\}\\sum\_\{b\}\(f\_\{bk\}\-\\hat\{\\mu\}\_\{k\}\)^\{2\}\. We apply both constraints to𝒇\\boldsymbol\{f\}rather than to the embedding layer\{𝒛j\}\\\{\\boldsymbol\{z\}\_\{j\}\\\}\. The rationale is that𝒇\\boldsymbol\{f\}feeds directly into Actor scoring and Critic estimation; stabilizing its distribution stabilizes the entire decision pipeline\. The embedding layer, by contrast, operates on each server independently and its statistics do not inherently depend onNN\. The total training loss is

ℒ=ℒπ\+λ1​ℒc\+λ2​ℒd,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\pi\}\+\\lambda\_\{1\}\\,\\mathcal\{L\}\_\{c\}\+\\lambda\_\{2\}\\,\\mathcal\{L\}\_\{d\},\(5\)withλ1=0\.01\\lambda\_\{1\}=0\.01andλ2=0\.001\\lambda\_\{2\}=0\.001\. SRR introduces no additional network parameters\. To avoid disrupting early\-stage policy learning while representations are still forming, we linearly rampλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}from zero starting at epoch 10, reaching full strength by epoch 60\.

Once the constraints are approximately satisfied—𝔼​\[𝒇\]≈𝟎\\mathbb\{E\}\[\\boldsymbol\{f\}\]\\approx\\boldsymbol\{0\}and𝑪^≈𝑰d\\hat\{\\boldsymbol\{C\}\}\\approx\\boldsymbol\{I\}\_\{d\}—each dimension of𝑾q​𝒇\\boldsymbol\{W\}\_\{q\}\\boldsymbol\{f\}contributes independently with stable magnitude\. A largerNNchanges the aggregation pattern but no longer concentrates information along shifted principal directions\. SRR thus converts the architectural property of permutation invariance into actual runtime robustness across scales\.

## 5Experiment

### 5\.1Experimental Setup

#### 5\.1\.1Hardware

All simulation experiments are conducted on a device equipped with an Apple M4 processor and 32 GB of memory\.

#### 5\.1\.2Simulation Setup

The simulated cluster has three tiers of servers at a base scale ofN=16N\{=\}16\(Table[1](https://arxiv.org/html/2606.06820#S5.T1)\)\. For generalization tests, we scale toN=32N\{=\}32andN=48N\{=\}48by proportionally increasing each server type, keeping the small:medium:large ratio at roughly 2:5:1\. Workflows arrive as a Poisson process with base rateλ=1\.0\\lambda=1\.0; the rate scales with cluster size \(λ=2\.0\\lambda\{=\}2\.0atN=32N\{=\}32,λ=3\.0\\lambda\{=\}3\.0atN=48N\{=\}48\) to maintain comparable load pressure\. Each workflow is a randomly generated DAG containing 3–8 primitives \(uniform\)\. Computation demands are drawn from𝒩​\(10\.0,3\.02\)\\mathcal\{N\}\(10\.0,3\.0^\{2\}\)truncated below at 1\.0\. Memory demands follow𝒩​\(16\.0,8\.02\)\\mathcal\{N\}\(16\.0,8\.0^\{2\}\)clipped to\[1\.0,80\.0\]\[1\.0,80\.0\]\. Inter\-task data volumes follow𝒩​\(0\.5,0\.22\)\\mathcal\{N\}\(0\.5,0\.2^\{2\}\)GB with a floor of 0\.01 GB\. Cross\-server communication latency equals data size divided by bandwidth \(10 GB/s\) plus 0\.01 s base latency; intra\-server transfers are free\.

Table 1:Heterogeneous Server Configuration
#### 5\.1\.3Training and Evaluation

All models are trained on theN=16N\{=\}16cluster using PPO\. Table[2](https://arxiv.org/html/2606.06820#S5.T2)summarizes the key hyperparameters\. For the cross\-attention architecture, the embedding dimension is 16, with 4 attention heads and 2 cross\-attention layers\. The SRR regularization activates at epoch 10 with linear warmup over 50 epochs\. Four parallel environments are used for data collection with a fixed random seed of 42\. Each configuration is evaluated over 30 episodes with fixed random seeds, reporting the mean and standard deviation of the average response time as the primary metric\.

Table 2:Training Hyperparameters

### 5\.2Baselines

We compare SCALE against five baselines:

- 1\.CrossAttn: The same cross\-attention pointer network but without SRR, isolating the regularization’s contribution\.
- 2\.PPO: Standard PPO with a structured MLP policy that processes primitive and server features separately before concatenation\. Output dimension is fixed atN=16N\{=\}16; it cannot run on larger clusters\.
- 3\.PPO\+SRR: PPO augmented with the SRR loss\. The output dimension remains fixed, so it still cannot scale, but it tests whether SRR helps a non\-scalable architecture\.
- 4\.DQN: Deep Q\-Network withϵ\\epsilon\-greedy exploration and dueling architecture\.
- 5\.SAC: Soft Actor\-Critic with entropy\-regularized objective\.

MLP\-based methods \(PPO, PPO\+SRR, DQN, SAC\) have fixed output dimensionality and can only run atN=16N\{=\}16\. Cross\-attention methods support zero\-shot deployment at arbitrary cluster sizes\.

### 5\.3Main Results

Table 3:16\-Node Performance ComparisonTable[3](https://arxiv.org/html/2606.06820#S5.T3)compares all six methods atN=16N\{=\}16\. SCALE completes the most tasks per episode \(8\.37\) at a response time of 4\.58 s; CrossAttn follows closely with 8\.10 tasks\. SAC lands in between at 5\.93 tasks\. PPO, PPO\+SRR, and DQN all hover around 4 tasks per episode, with DQN showing the worst response time \(4\.88 s\)\. Adding SRR to PPO slightly improves throughput \(4\.03→\\to4\.30\) without degrading response time, indicating that the regularization provides some benefit even on architectures that lack scale flexibility\.

![Refer to caption](https://arxiv.org/html/2606.06820v1/x3.png)Figure 3:16\-Node Completed primitives comparison\.![Refer to caption](https://arxiv.org/html/2606.06820v1/x4.png)Figure 4:16\-Node Average response time comparison\.
### 5\.4Zero\-Shot Scale Generalization

Table 4:Zero\-Shot Generalization\.![Refer to caption](https://arxiv.org/html/2606.06820v1/x5.png)Figure 5:Average response time across cluster scales\. SCALE maintains stable response times as the cluster grows from 16 to 48 nodes, while CrossAttn degrades more rapidly\.Table[4](https://arxiv.org/html/2606.06820#S5.T4)reports response time for the two cross\-attention variants across scales \(MLP\-based methods cannot run atN≠16N\\neq 16\)\. At the training scale, the two are nearly tied: 4\.58 s vs\. 4\.65 s\. A gap opens atN=32N\{=\}32\(5\.57 vs\. 5\.70\) and widens atN=48N\{=\}48, where SCALE achieves 6\.22 s against CrossAttn’s 6\.83 s—an 8\.9% reduction\. The pattern is consistent with feature drift accumulating asNNgrows: without SRR, the pointer scores lose calibration; with SRR, they remain stable\.

### 5\.5Training Results

![Refer to caption](https://arxiv.org/html/2606.06820v1/x6.png)Figure 6:Training reward curves of all methods\.Fig\.[6](https://arxiv.org/html/2606.06820#S5.F6)shows training curves\. All methods rise quickly in the first 10% of training\. CrossAttn then continues climbing with large fluctuations, eventually exceeding 1\.1—likely overfitting to the 16\-node configuration\. SCALE plateaus near 1\.0 with much smaller variance, consistent with SRR constraining the representation space\. PPO and PPO\+SRR converge between 0\.93 and 0\.95, limited by MLP capacity\. SAC converges slowly to about 0\.6, and DQN is the slowest and most unstable, matching its weak evaluation performance\.

### 5\.6Attention Feature Quality Analysis

Table 5:Attention Feature StatisticsTo see why SRR helps, we examine the statistics of the attention feature𝐳∈ℝd\\mathbf\{z\}\\in\\mathbb\{R\}^\{d\}directly\.

Table[5](https://arxiv.org/html/2606.06820#S5.T5)reports two metrics across scales\. Off\-diagonal covariance measures inter\-dimensional coupling\. SCALE keeps this below 0\.05 at all three scales; CrossAttn ranges from 0\.09 to 0\.27, indicating much stronger correlation among its feature dimensions\. CrossAttn does have lower KL divergence atN=16N\{=\}16—its features happen to look roughly Gaussian without being forced to—but this property breaks down at largerNN, which is exactly the failure mode we target\. What matters for generalization is decorrelation: when dimensions are independent, a change inNNperturbs each dimension locally instead of cascading through the entire vector\.

## 6Discussion

### 6\.1Interpretation of Results

SCALE generalizes fromN=16N=16toN=48N=48without retraining, cutting average response time by 8\.9% relative to the unregularized baseline at the largest tested scale\. At the training scale itself \(N=16N=16\), SCALE does not outperform PPO or SAC on per\-task latency—it completes more tasks per episode but at slightly higher per\-task response time\. This trade\-off is expected: SRR constrains the representation in ways that sacrifice some in\-distribution optimality for out\-of\-distribution robustness\. The training curves corroborate this reading—CrossAttn without SRR reaches higher reward but with large variance, consistent with overfitting to the 16\-node setting\.

### 6\.2Limitations

Several limitations should be noted\. Our evaluation reaches onlyN=48N=48\(3×\\timesthe training size\); behavior atN=128N=128or beyond is untested\. The generalization experiments hold the server\-type ratio \(Small:Medium:Large\) constant, so a cluster with a different hardware mix could behave differently\. The DAGs are randomly generated with limited topological variety—real agentic workflows may exhibit deeper chains or wider fan\-outs\. The first\-level task selector uses a simple longest\-path heuristic rather than a learned policy\. All results are from simulation; deployment on a physical cluster running an actual agentic framework \(e\.g\., LangGraph\) would be needed to validate practical utility\.

### 6\.3Future Work

The most immediate extension is testing at much larger scales \(N≥128N\\geq 128\), where sparse attention approximations may become necessary for fast inference\. Replacing the heuristic first\-level scheduler with a learned joint task\-and\-server selection policy is another natural direction\. Deployment inside a real agentic framework on physical hardware would clarify how well these simulation results transfer\.

## 7Conclusion

We presented SCALE, a DRL scheduler for agentic workflow DAGs that trains once on a small cluster and deploys on larger ones without retraining\. The architecture—cross\-attention between a task query and a variable\-size server key set, with pointer\-network scoring—handles arbitraryNNby construction\. Our main finding is that architectural flexibility alone is insufficient: without SRR, performance degrades asNNgrows due to distribution shift in the attention feature\. The decorrelation and KL\-to\-normal penalties in SRR stabilize these statistics, producing an 8\.9% response\-time reduction over the unregularized baseline atN=48N\{=\}48\.

For elastic clusters where nodes are added or removed frequently, a scheduler requires both a size\-agnostic architecture and explicit regularization of its internal representations\. SCALE demonstrates this principle, though validation at much larger scales and on physical hardware remains to be done\.

## References

- \[1\]A\. Plaat, M\. van Duijn, N\. Van Stein, M\. Preuss, P\. van der Putten, K\. J\. Batenburg, Agentic large language models, a survey, Journal of Artificial Intelligence Research 84 \(2025\)\.
- \[2\]S\. Kim, S\. Moon, R\. Tabrizi, N\. Lee, M\. W\. Mahoney, K\. Keutzer, A\. Gholami, An llm compiler for parallel function calling, arXiv preprint arXiv:2312\.04511 \(2023\)\.
- \[3\]M\. Armbrust, A\. Fox, R\. Griffith, A\. D\. Joseph, R\. Katz, A\. Konwinski, G\. Lee, D\. Patterson, A\. Rabkin, I\. Stoica, et al\., A view of cloud computing, Communications of the ACM 53 \(4\) \(2010\) 50–58\.
- \[4\]J\. Dean, S\. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM 51 \(1\) \(2008\) 107–113\.
- \[5\]Y\. Zhang, W\. Hua, Z\. Zhou, G\. E\. Suh, C\. Delimitrou, Sinan: Ml\-based and qos\-aware resource management for cloud microservices, in: Proceedings of the 26th ACM international conference on architectural support for programming languages and operating systems, 2021, pp\. 167–181\.
- \[6\]G\. I\. Chaudhry, E\. Choukse, H\. Qiu, Í\. Goiri, R\. Fonseca, A\. Belay, R\. Bianchini, Murakkab: Resource\-efficient agentic workflow orchestration in cloud platforms, arXiv preprint arXiv:2508\.18298 \(2025\)\.
- \[7\]J\. Shen, N\. Wadlom, Y\. Lu, Batch query processing and optimization for agentic workflows, arXiv preprint arXiv:2509\.02121 \(2025\)\.
- \[8\]K\. Cheng, W\. Hu, Z\. Wang, H\. Peng, J\. Li, S\. Zhang, Slice\-level scheduling for high throughput and load balanced llm serving, arXiv preprint arXiv:2406\.13511 \(2024\)\.
- \[9\]C\. Ma, A\. Li, Y\. Du, H\. Dong, Y\. Yang, Efficient and scalable reinforcement learning for large\-scale network control, Nature Machine Intelligence 6 \(9\) \(2024\) 1006–1020\.
- \[10\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, O\. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707\.06347 \(2017\)\.
- \[11\]Z\. Liu, L\. Huang, Z\. Gao, M\. Luo, S\. Hosseinalipour, H\. Dai, Ga\-drl: Graph neural network\-augmented deep reinforcement learning for dag task scheduling over dynamic vehicular clouds, IEEE Transactions on Network and Service Management 21 \(4\) \(2024\) 4226–4242\.
- \[12\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, I\. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 \(2017\)\.
- \[13\]O\. Vinyals, M\. Fortunato, N\. Jaitly, Pointer networks, Advances in neural information processing systems 28 \(2015\)\.
- \[14\]Significant\-Gravitas, Autogpt: An autonomous gpt\-4 experiment,[https://github\.com/Significant\-Gravitas/AutoGPT](https://github.com/Significant-Gravitas/AutoGPT), gitHub repository \(2023\)\.
- \[15\]H\. Chase, Langchain: Building applications with llms through composability,[https://github\.com/langchain\-ai/langchain](https://github.com/langchain-ai/langchain), gitHub repository \(2022\)\.
- \[16\]S\. Hong, X\. Zheng, J\. Chen, Y\. Cheng, C\. Wang, Y\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, et al\., Metagpt: Meta programming for multi\-agent collaborative framework, in: International Conference on Learning Representations \(ICLR\), 2024\.
- \[17\]LangChain Inc\., Langgraph: Building stateful multi\-agent applications,[https://langchain\-ai\.github\.io/langgraph/](https://langchain-ai.github.io/langgraph/), documentation \(2024\)\.
- \[18\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, S\. Zhang, E\. Zhu, B\. Li, L\. Jiang, X\. Zhang, C\. Wang, Autogen: Enabling next\-gen llm applications via multi\-agent conversation, arXiv preprint arXiv:2308\.08155 \(2023\)\.
- \[19\]H\. Topcuoglu, S\. Hariri, M\.\-Y\. Wu, Task scheduling algorithms for heterogeneous processors, Proceedings\. Eighth Heterogeneous Computing Workshop \(1999\) 3–14\.
- \[20\]H\. Mao, M\. Alizadeh, I\. Menache, S\. Kandula, Resource management with deep reinforcement learning, in: ACM Workshop on Hot Topics in Networks \(HotNets\), 2016, pp\. 50–56\.
- \[21\]H\. Mao, M\. Schwarzkopf, S\. B\. Venkatakrishnan, Z\. Meng, M\. Alizadeh, Learning scheduling algorithms for data processing clusters, in: ACM SIGCOMM Conference, 2019, pp\. 270–288\.
- \[22\]T\. N\. Kipf, M\. Welling, Semi\-supervised classification with graph convolutional networks, in: International Conference on Learning Representations \(ICLR\), 2017\.
- \[23\]M\. Zaheer, S\. Kottur, S\. Ravanbakhsh, B\. Poczos, R\. Salakhutdinov, A\. Smola, Deep sets, in: Advances in neural information processing systems, 2017\.
- \[24\]J\. Lee, Y\. Lee, J\. Kim, A\. R\. Kosiorek, S\. Choi, Y\. W\. Teh, Set transformer: A framework for attention\-based permutation\-invariant neural networks, in: International Conference on Machine Learning \(ICML\), 2019, pp\. 3744–3753\.
- \[25\]W\. Kool, H\. van Hoof, M\. Welling, Attention, learn to solve routing problems\!, in: International Conference on Learning Representations \(ICLR\), 2019\.
- \[26\]J\. Park, J\. Chun, S\. H\. Kim, Y\. Kim, J\. Park, Learning to schedule job\-shop problems with attention networks, in: International Conference on Automated Planning and Scheduling \(ICAPS\), 2021, pp\. 301–309\.
- \[27\]Y\. Bengio, A\. Lodi, A\. Prouvost, Machine learning for combinatorial optimization: A methodological tour d’horizon, European Journal of Operational Research 290 \(2\) \(2021\) 405–421\.

## Author Biographies

![[Uncaptioned image]](https://arxiv.org/html/2606.06820v1/x7.png)

Zhifei Xuis currently pursuing a bachelor’s degree in the Faculty of Arts and Sciences, Beijing Normal University at Zhuhai, China\. His main research interests include edge computing, reinforcement learning and their applications\.

![[Uncaptioned image]](https://arxiv.org/html/2606.06820v1/x8.png)

Jierui Lanis currently pursuing a bachelor’s degree in the Applied Statistics, Beijing Normal University at Zhuhai, China\. His main research interests include representation learning, optimization algorithm, reinforcement learning and their applications\.

![[Uncaptioned image]](https://arxiv.org/html/2606.06820v1/x9.png)

Zixuan Liangis currently pursuing a bachelor’s degree in the Applied Statistics, Beijing Normal University at Zhuhai, China\. His main research interests include multiple testing, reinforcement learning and their applications\.

![[Uncaptioned image]](https://arxiv.org/html/2606.06820v1/x10.png)

Aiji Liangis currently pursuing a bachelor’s degree in the Faculty of Arts and Sciences, Beijing Normal University at Zhuhai, China\. His main research interests include multi\-agent systems, edge computing, and their applications\.

![[Uncaptioned image]](https://arxiv.org/html/2606.06820v1/x11.png)

Jinxi Heis currently an undergraduate student in the School of Arts and Sciences at Beijing Normal University at Zhuhai, China\. His research interests primarily focus on large language models, multi\-agent systems, and real\-time voice interaction\.

Similar Articles

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

arXiv cs.AI

This paper formalizes workflow learning in multi-agent LLM pipelines as an interface-constrained semi-Markov decision process (IC-SMDP) and proposes IC-ICQQ, an asynchronous decentralized Q-learning algorithm with a finite-sample bound that decomposes error sources, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability.

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

arXiv cs.LG

This paper presents a distributed approach for constrained multi-agent reinforcement learning that uses state-augmented policy learning and neighbor-to-neighbor consensus over dual variables to satisfy global resource constraints while scaling linearly with the number of agents. Experiments on smart grid demand response demonstrate that consensus coordination is essential for feasibility, scaling to thousands of agents unlike centralized training approaches.