SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration
Summary
SkillFlow proposes a flow-driven recursive skill evolution framework for LLM-based agentic orchestration, using Tempered Trajectory Balance to prevent strategy collapse and provide transparent credit assignment. Experiments on 14 datasets show significant improvements over baselines in QA, math, code, and decision-making tasks.
View Cached Full Text
Cached at: 05/15/26, 06:20 AM
# Flow-Driven Recursive Skill Evolution for Agentic Orchestration
Source: [https://arxiv.org/html/2605.14089](https://arxiv.org/html/2605.14089)
Mingda Zhang1, Tiesunlong Shen2, Haoran Luo3, Wenjin Liu3 Zikai Xiao4, Erik Cambria3, Xiaoying Tang1 1The Chinese University of Hong Kong, Shenzhen2National University of Singapore 3Nanyang Technological University4Zhejiang University
###### Abstract
In recent years, a variety of powerful LLM\-based agentic systems have been applied to automate complex tasks through task orchestration\. However, existing orchestration methods still face key challenges, including strategy collapse under reward maximization, high gradient variance with opaque credit assignment, and unguided skill evolution whose decisions are typically made by directly prompting an LLM to judge rather than derived from principled training signals\. To address these challenges, we propose SkillFlow, a flow\-based framework that takes a trainable Supervisor as the agent and a structured environment with dynamic skill library and frozen executor, automating task orchestration through multi\-turn interaction\. SkillFlow employs Tempered Trajectory Balance \(TTB\), a regression\-based flow\-matching loss that samples trajectories proportional to reward, preserving diverse orchestration strategies rather than collapsing to a single mode\. The same flow objective yields a jointly learned backward policy that provides transparent per\-step credit assignment at zero additional inference cost\. Building on these flow diagnostics, a recursive skill evolution mechanism determines*when*to evolve,*what*skills to create or prune, and*where*decision gaps lie—closing the loop from training signal to autonomous capability growth\. Experimental results on 14 datasets show that SkillFlow significantly outperforms baselines across question answering, mathematical reasoning, code generation, and real\-world interactive decision making tasks\. Our code is available at[https://anonymous\.4open\.science/r/SkillFlow\-E850](https://anonymous.4open.science/r/SkillFlow-E850)\.
## 1Introduction
Figure 1:SkillFlow at a glance\.Left: before training, flow on the orchestration DAG is uniform\.Right: after TTB, flow concentrates on reward\-rich paths \(colormap = flow magnitude\)\.Inset: each edge is one action \(KtK\_\{t\}tokens\)\.In recent years, a variety of powerful LLM\-based agentic systems have been applied to solve a wide range of complex tasks\(Yaoet al\.,[2022b](https://arxiv.org/html/2605.14089#bib.bib13); Honget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib19); Wanget al\.,[2024b](https://arxiv.org/html/2605.14089#bib.bib30); Danget al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib34)\), gradually moving beyond single\-turn question answering toward executable end\-to\-end task completion\.
Figure 2:Three orchestration paradigms\. \(a\) heuristic dispatch over a locked library; \(b\) learning\-based with terminal reward, prone to trajectory\-level mode collapse; \(c\) SkillFlow co\-trains a forward and backward policy under TTB and uses flow diagnostics to drive recursive skill evolution\.In this process, task orchestration has become a key bridge from task goals to reproducible execution: by organizing primitive actions and reusable skills into structured trajectories on an action\-level orchestration DAG \(Fig\.[1](https://arxiv.org/html/2605.14089#S1.F1), left\), where each node is an interaction history and each edge is one orchestration action, agents can complete complex tasks with improved controllability, compositionality, and reusability\(Zenget al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib17); Qianet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib25)\)\. However, in practice, orchestration still heavily relies on manual skill design and rigid action definitions\(Huet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib26); Xu and Yan,[2026](https://arxiv.org/html/2605.14089#bib.bib52)\), making it costly to transfer across new tasks, new skill configurations, or different agent capabilities\.
To address these issues, early approaches rely on heuristic orchestration \(Fig\.[2](https://arxiv.org/html/2605.14089#S1.F2)a\), retrieving skills from prebuilt libraries or searching over workflow graphs at test time, e\.g\., matching a corresponding tool to each subtask by description, or using tree search to explore multi\-step action compositions\(Wanget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib15),[2024a](https://arxiv.org/html/2605.14089#bib.bib24); Zhanget al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib29); Zhugeet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib21)\), but their orchestration policies remain bound to pre\-designed heuristics and fixed skill repertoires, with no mechanism to adapt from execution outcomes\. Recent work has shifted to*learning\-based orchestration*\(Fig\.[2](https://arxiv.org/html/2605.14089#S1.F2)b\)\(Zenget al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib17); Chenet al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib16); Shaoet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib20); Guoet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib31); Liet al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib67); Zhanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib48); Konget al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib66); Wanget al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib44); Xiaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib57)\), which trains orchestration policies directly from task\-completion signals—sampling orchestration trajectories, collecting terminal rewards such as answer correctness or task success rate, and updating the orchestration model accordingly\.
However, these methods still face three challenges\.\(i\) Strategy collapse\.REINFORCE\-family objectives, which uniformly raise or lower all action probabilities along a trajectory based on a single terminal reward, converge to a single reward\-maximizing mode, forfeiting diverse, equally effective strategies\(Yuet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib32); Guoet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib31); Zhanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib48); Konget al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib66); Wanget al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib44)\)\. This brittleness is amplified when executor capabilities evolve—once a tool is upgraded, the agent has no suitable fallback strategy\(Zhanget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib14); Liet al\.,[2025a](https://arxiv.org/html/2605.14089#bib.bib46)\)\.\(ii\) High gradient variance and opaque credit assignment\.Terminal\-only rewards create long credit\-assignment horizons, where policy\-gradient variance and intra\-group advantage collapse make per\-step attribution opaque, when a multi\-step trajectory succeeds or fails, it remains unclear which step is responsible\(Schulmanet al\.,[2015](https://arxiv.org/html/2605.14089#bib.bib9); Tanet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib50); Wanget al\.,[2026c](https://arxiv.org/html/2605.14089#bib.bib54); Liet al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib67); Wanget al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib44)\)\.\(iii\) Unguided skill evolution\.Existing frameworks with dynamic skill libraries rely on heuristic triggers, fixed schedules, or direct LLM\-as\-judge prompting, lacking a principled signal for*when*to update the library,*what*skills to prune or create, or*where*the critical decision points lie—so the library evolves blindly\(Wanget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib15); Xiaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib57); Fanget al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib39); Xu and Yan,[2026](https://arxiv.org/html/2605.14089#bib.bib52); Alzubiet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib58); Zhanget al\.,[2026a](https://arxiv.org/html/2605.14089#bib.bib59); Wanget al\.,[2025a](https://arxiv.org/html/2605.14089#bib.bib43),[2026a](https://arxiv.org/html/2605.14089#bib.bib60)\)\.
To address these challenges, we proposeSkillFlow\(Fig\.[2](https://arxiv.org/html/2605.14089#S1.F2)c\), a flow\-based framework\(Bengioet al\.,[2021](https://arxiv.org/html/2605.14089#bib.bib10),[2023](https://arxiv.org/html/2605.14089#bib.bib12); Malkinet al\.,[2022](https://arxiv.org/html/2605.14089#bib.bib11)\)for general task orchestration with recursive skill evolution\. A trainable Supervisor interacts with a structured environment containing a dynamic skill library and a frozen executor\. At the core of SkillFlow \(Fig\.[1](https://arxiv.org/html/2605.14089#S1.F1)\) isTempered Trajectory Balance \(TTB\), a regression\-style flow\-matching loss that drives each trajectory’s sampling probability to be*proportional to its reward*, rather than concentrated on a single best outcome\. In contrast, REINFORCE\-family objectives push nearly all probability mass onto one reward\-maximizing trajectory; this reward\-proportional sampling instead keeps multiple high\-reward sub\-trajectories—distinct successful paths through the orchestration DAG—alive under a single loss, preserving strategic diversity\. The same loss jointly trains a*backward policy*that, once an orchestration terminates, attributes the trajectory\-level outcome to its individual steps at zero additional inference cost—flagging which decisions actually drove success and which were incidental\. Building on these per\-step and per\-skill credit signals, a recursive skill evolution mechanism answers three questions directly from the training signal itself:*when*the current library starts limiting performance \(signaled by TTB convergence\),*what*skills to create or prune \(ranked by the skill marginal flowF^\(s\)\\hat\{F\}\(s\)\), and*where*decision gaps lie \(localized by the step importanceI\(t\)I\(t\)\)—closing the loop from training signal to autonomous capability growth\.
We evaluate on benchmarks across question answering, mathematical reasoning, interactive decision making, and code generation\. Results show SkillFlow outperforms direct LLMs, REINFORCE\-style RL baselines\(Liet al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib67); Zhanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib48); Xiaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib57)\), and skill\-evolution methods\(Jianget al\.,[2026a](https://arxiv.org/html/2605.14089#bib.bib61); Yanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib62); Maet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib63); Alzubiet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib58)\)in task accuracy, strategy diversity, and orchestration cost—a foundation for self\-evolving, flow\-based agentic orchestration\.
## 2Related Work
Agent Task Orchestration\.LLM\-era task orchestration has evolved from rule\-based automation to feedback\-driven plan–act–feedback loops\(Yaoet al\.,[2022b](https://arxiv.org/html/2605.14089#bib.bib13); Wanget al\.,[2024a](https://arxiv.org/html/2605.14089#bib.bib24),[b](https://arxiv.org/html/2605.14089#bib.bib30)\)\. Existing work spans single\-agent sequential decision making\(Qinet al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib23); Yaoet al\.,[2022b](https://arxiv.org/html/2605.14089#bib.bib13)\), LLM\-controller\-based routing and constrained API planning\(Onget al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib27); Wanget al\.,[2024a](https://arxiv.org/html/2605.14089#bib.bib24); Suet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib40)\), and multi\-agent SOP/role collaboration\(Honget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib19); Wuet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib18); Danget al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib34)\)\. Reusable skill memory\(Wanget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib15); Xu and Yan,[2026](https://arxiv.org/html/2605.14089#bib.bib52); Jianget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib65); Wanget al\.,[2026a](https://arxiv.org/html/2605.14089#bib.bib60)\)and self\-evolving architectures\(Fanget al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib39); Li,[2026](https://arxiv.org/html/2605.14089#bib.bib53); Alzubiet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib58); Zhanget al\.,[2026a](https://arxiv.org/html/2605.14089#bib.bib59); Wanget al\.,[2025a](https://arxiv.org/html/2605.14089#bib.bib43),[2026b](https://arxiv.org/html/2605.14089#bib.bib64)\)further improve efficiency, yet all these approaches operate with fixed skill sets and lack a principled mechanism to expand the action space during training\. SkillFlow fills this gap by learning orchestration from execution feedback while autonomously evolving its skill repertoire\.
Reinforcement Learning for Agents\.Agent RL models multi\-turn interaction as a long\-horizon MDP\(Zhouet al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib22); Chenet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib41)\)\. Credit\-assignment advances include implicit step rewards\(Liuet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib42)\), hindsight advantage estimation\(Tanet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib50)\), plan\-execute decomposition\(Penget al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib51)\), progressive reward shaping\(Suet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib36)\), and process reward models\(Zhanget al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib37); Sheet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib38)\)\. Policy optimization has converged on GRPO\-style objectives\(Guoet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib31); Shaoet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib20)\)with extensions \(DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib32)\), VAPO\(Yueet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib33)\), multi\-agent variants\(Liuet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib35); Canget al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib55)\)\), also applied to workflow orchestration\(Zhanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib48); Liet al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib67)\)and search\-augmented reasoning\(Jinet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib45); Yanget al\.,[2026a](https://arxiv.org/html/2605.14089#bib.bib56)\)\. All rely on REINFORCE\-family objectives that converge to a single mode, lacking diversity\-preserving signals\(Liet al\.,[2025a](https://arxiv.org/html/2605.14089#bib.bib46)\)\. While reward\-matching flow training has been explored in LLM reasoning\(Yuet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib28)\)and robust scheduling\(Zhanget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib14)\), SkillFlow targets a different setting—multi\-turn agentic orchestration with a co\-evolving skill library—and contributes reward\-proportional trajectory sampling together with zero\-cost per\-step credit\.
## 3Preliminaries
Definition 1: Orchestration State Graph\.We model the task orchestration process as a directed acyclic graph𝒢=\(𝒱,𝒜\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{A\}\), where each vertexHt∈𝒱H\_\{t\}\\in\\mathcal\{V\}represents an interaction history and each edge\(Ht−1,Ht\)∈𝒜\(H\_\{t\-1\},H\_\{t\}\)\\in\\mathcal\{A\}corresponds to an orchestration action\. Acyclicity follows from the state update ruleHt=Ht−1⊕\(rt,at,otexec\)H\_\{t\}=H\_\{t\-1\}\\oplus\(r\_\{t\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\), which yields strict history growth\|Ht\|\>\|Ht−1\|\|H\_\{t\}\|\>\|H\_\{t\-1\}\|\.
Definition 2: Orchestration Trajectory\.A complete path through𝒢\\mathcal\{G\}from initial stateH0H\_\{0\}to a terminal state defines an orchestration trajectory:
τ=\{\(rt,at,otexec\)\}t=1T⇒yq,\\tau=\\bigl\\\{\(r\_\{t\},\\;a\_\{t\},\\;o\_\{t\}^\{\\text\{exec\}\}\)\\bigr\\\}\_\{t=1\}^\{T\}\\;\\Rightarrow\\;y\_\{q\},\(1\)wherertr\_\{t\}denotes the reasoning reflection at steptt,at=\(αt,ot\)a\_\{t\}=\(\\alpha\_\{t\},o\_\{t\}\)is the action with typeαt∈\{skill,act,accept\}\\alpha\_\{t\}\\in\\\{\\texttt\{skill\},\\,\\texttt\{act\},\\,\\texttt\{accept\}\\\}and parametersoto\_\{t\}, andotexeco\_\{t\}^\{\\text\{exec\}\}is the execution feedback\. The episode terminates whenαt=accept\\alpha\_\{t\}=\\texttt\{accept\}ort=Tmaxt=T\_\{\\max\}\.
Problem Statement\.Given taskq∈𝒟q\\in\\mathcal\{D\}, environmentℰ\\mathcal\{E\}, and executorℳexec\\mathcal\{M\}\_\{\\text\{exec\}\}, we augment𝒢\\mathcal\{G\}with a non\-negative flow functionF:𝒱→ℝ≥0F:\\mathcal\{V\}\\to\\mathbb\{R\}\_\{\\geq 0\}satisfying conservation \(incoming flow equals outgoing flow at each non\-terminal state\), with terminal conditionF\(x\)=R~\(τx\)βF\(x\)=\\tilde\{R\}\(\\tau\_\{x\}\)^\{\\beta\}\. The resulting flow network\(Bengioet al\.,[2021](https://arxiv.org/html/2605.14089#bib.bib10),[2023](https://arxiv.org/html/2605.14089#bib.bib12)\)\(formal foundations in Appendix[A](https://arxiv.org/html/2605.14089#A1)\) induces a policy that samples trajectories in proportion to reward:
π∗\(τ∣q\)∝R~\(τ\)β,R~\(τ\)=R\(τ\)\+εmin,\\pi^\{\*\}\(\\tau\\mid q\)\\propto\\tilde\{R\}\(\\tau\)^\{\\beta\},\\quad\\tilde\{R\}\(\\tau\)=R\(\\tau\)\+\\varepsilon\_\{\\min\},\(2\)whereR\(τ\)R\(\\tau\)is task completion quality \(per\-task definitions in Appendix[G](https://arxiv.org/html/2605.14089#A7)\),εmin\>0\\varepsilon\_\{\\min\}\>0is a small constant that shifts rewards to be strictly positive—a prerequisite of flow networks—so that even zero\-reward trajectories receive non\-zero flow \(details in Appendix[H](https://arxiv.org/html/2605.14089#A8)\), andβ\>0\\beta\>0controls the diversity–quality tradeoff\. This is equivalent to entropy\-regularized RL with temperatureT=1/βT=1/\\beta\(Appendix[C](https://arxiv.org/html/2605.14089#A3)\)\.
Figure 3:SkillFlow architecture\. The Supervisor rolls out a tree\-structured DAG against a frozen Executor; TTB jointly trains the forward and hindsight backward policies via a flow\-matching residual loss; flow diagnostics \(Δ¯∗\(k\)\\bar\{\\Delta\}^\{\*\(k\)\},F^\(s\)\\hat\{F\}\(s\),I\(t\)I\(t\)\) drive recursive skill curation at phase boundaries\.
## 4Methodology: SkillFlow
As illustrated in Figure[3](https://arxiv.org/html/2605.14089#S3.F3), this section introduces the SkillFlow framework, including environment design and task modeling \(Section[4\.1](https://arxiv.org/html/2605.14089#S4.SS1)\), flow\-based end\-to\-end training \(Section[4\.2](https://arxiv.org/html/2605.14089#S4.SS2)\), and flow\-driven recursive skill evolution \(Section[4\.3](https://arxiv.org/html/2605.14089#S4.SS3)\)\.
### 4\.1Environment Design and Task Modeling
SkillFlow follows a Supervisor\-Executor paradigm\(Yaoet al\.,[2022b](https://arxiv.org/html/2605.14089#bib.bib13)\): a trainable Supervisorπθ\\pi\_\{\\theta\}interacts with the structured environmentℰ\\mathcal\{E\}to construct orchestration trajectories\. Unlike static frameworks that operate on a fixed skill set, SkillFlow’s environment carries adynamic skill librarythat co\-evolves with the policy through the curation operatorΦ\\Phiformalised in §[4\.3](https://arxiv.org/html/2605.14089#S4.SS3)\.
Structured Environmentℰ\\mathcal\{E\}\.The environment maintains the skill library, skill creator, and executor:
ℰ=\(𝒮,Ψ,ℳexec\),\\mathcal\{E\}=\(\\mathcal\{S\},\\;\\Psi,\\;\\mathcal\{M\}\_\{\\text\{exec\}\}\),\(3\)where𝒮\\mathcal\{S\}is the dynamic skill library,Ψ\\Psiis the Skill Creator that evolves𝒮\\mathcal\{S\}viasnew=Ψ\(c,𝒯,𝒮\)s\_\{\\text\{new\}\}=\\Psi\(c,\\mathcal\{T\},\\mathcal\{S\}\)using creation contextccand trajectory evidence𝒯\\mathcal\{T\}\(Section[4\.3](https://arxiv.org/html/2605.14089#S4.SS3)\), andℳexec\\mathcal\{M\}\_\{\\text\{exec\}\}is the pluggable executor\.Ψ\\Psiupdates𝒮\\mathcal\{S\}only at episode boundaries, keeping the action space constant within each training phase\.
State Spaceℋ\\mathcal\{H\}and Orchestration Target\.The initial state concatenates the task with retrieved context:H0=\[q⊕𝒮ret⊕ωq\]H\_\{0\}=\[q\\oplus\\mathcal\{S\}\_\{\\text\{ret\}\}\\oplus\\omega\_\{q\}\], where𝒮ret\\mathcal\{S\}\_\{\\text\{ret\}\}are retrieved skills andωq\\omega\_\{q\}is a task\-category\-specific orchestration guideline\. The state evolves via
Ht=Ht−1⊕\(rt,at,otexec\)\.H\_\{t\}=H\_\{t\-1\}\\oplus\(r\_\{t\},\\,a\_\{t\},\\,o\_\{t\}^\{\\text\{exec\}\}\)\.\(4\)The Supervisor interacts withℰ\\mathcal\{E\}untilαt=accept\\alpha\_\{t\}=\\texttt\{accept\}\(early termination\) ort=Tmaxt=T\_\{\\max\}\(budget exhausted\); the resulting trajectoryτ\\taudetermines the answeryqy\_\{q\}via the terminal action’s output\.
Step\-wise Orchestration Policy\.At each steptt, the Supervisor generates reasoningrtr\_\{t\}, selects action typeαt\\alpha\_\{t\}, and produces parametersoto\_\{t\}, modeled as a hierarchical policy conditioned onHt−1H\_\{t\-1\}:
πθ\(rt,at∣Ht−1\)=πθ\(rt∣Ht−1\)⋅πθ\(αt∣rt,Ht−1\)⋅πθ\(ot∣αt,rt,Ht−1\)\.\\pi\_\{\\theta\}\(r\_\{t\},a\_\{t\}\\mid H\_\{t\-1\}\)=\\pi\_\{\\theta\}\(r\_\{t\}\\mid H\_\{t\-1\}\)\\\!\\cdot\\\!\\pi\_\{\\theta\}\(\\alpha\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\\\!\\cdot\\\!\\pi\_\{\\theta\}\(o\_\{t\}\\mid\\alpha\_\{t\},r\_\{t\},H\_\{t\-1\}\)\.\(5\)
Multi\-Turn Interaction and Trajectory Distribution\.Given actionata\_\{t\}, the executor returns feedbackotexec∼𝒞exec\(⋅∣Ht−1,at\)o\_\{t\}^\{\\text\{exec\}\}\\sim\\mathcal\{C\}\_\{\\text\{exec\}\}\(\\cdot\\mid H\_\{t\-1\},a\_\{t\}\), the state evolves via Eq\.[4](https://arxiv.org/html/2605.14089#S4.E4), and the Supervisor continues until termination\. Marginalising the joint policy–executor draws over theTTsteps yields the trajectory distribution:
Pθ\(τ\)=∏t=1T\[πθ\(rt,at∣Ht−1\)⋅𝒞exec\(otexec∣Ht−1,at\)\],P\_\{\\theta\}\(\\tau\)=\\prod\_\{t=1\}^\{T\}\\bigl\[\\pi\_\{\\theta\}\(r\_\{t\},a\_\{t\}\\mid H\_\{t\-1\}\)\\cdot\\mathcal\{C\}\_\{\\text\{exec\}\}\(o\_\{t\}^\{\\text\{exec\}\}\\mid H\_\{t\-1\},a\_\{t\}\)\\bigr\],\(6\)where onlyπθ\\pi\_\{\\theta\}is trainable and𝒞exec\\mathcal\{C\}\_\{\\text\{exec\}\}is frozen\.
###### Proposition 1\.
The Supervisor–Executor environment admits a flow\-conservative DAG structure suitable for end\-to\-end flow\-based training\.
*Proof\.*Main\-result gains \(§[5\.2](https://arxiv.org/html/2605.14089#S5.SS2), RQ1\) and cross\-backbone transferability \(§[5\.4](https://arxiv.org/html/2605.14089#S5.SS4), RQ3\) provide empirical validation; formal acyclicity in Appendix[E](https://arxiv.org/html/2605.14089#A5)\.□\\square
### 4\.2Flow\-Based End\-to\-End Training
Building on the DAG structure of𝒢\\mathcal\{G\}\(Proposition[1](https://arxiv.org/html/2605.14089#Thmtheorem1)\), we train the flow network introduced in §3 to make the action\-sequence distribution reward\-proportional given the realized reasoning and execution context:πθ\(a1:T∣r1:T,o1:Texec,q\)∝R~\(τ\)β\\pi\_\{\\theta\}\(a\_\{1:T\}\\mid r\_\{1:T\},o^\{\\text\{exec\}\}\_\{1:T\},q\)\\propto\\tilde\{R\}\(\\tau\)^\{\\beta\}\.
Forward PolicyPFP\_\{F\}and Backward PolicyPϕP\_\{\\phi\}\.Because𝒞exec\\mathcal\{C\}\_\{\\text\{exec\}\}is frozen and reasoningrtr\_\{t\}is fixed context, the forward policy reduces to action selection, while the backward policyPϕP\_\{\\phi\}conditions on thehindsight stateHt−1⊕otexecH\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}to incorporate the execution observation unavailable toπθ\\pi\_\{\\theta\}:
PF\(Ht∣Ht−1\)=πθ\(at∣rt,Ht−1\),PB\(Ht−1∣Ht\)=Pϕ\(at\|Ht−1⊕otexec⏟hindsight state\)\.P\_\{F\}\(H\_\{t\}\\\!\\mid\\\!H\_\{t\-1\}\)=\\pi\_\{\\theta\}\(a\_\{t\}\\\!\\mid\\\!r\_\{t\},H\_\{t\-1\}\),\\quad P\_\{B\}\(H\_\{t\-1\}\\\!\\mid\\\!H\_\{t\}\)=P\_\{\\phi\}\\\!\\left\(a\_\{t\}\\;\\middle\|\\;\\underbrace\{H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\}\_\{\\text\{hindsight state\}\}\\right\)\.\(7\)
Tempered Trajectory Balance \(TTB\)\.GivenPFP\_\{F\}andPϕP\_\{\\phi\}, the \(tempered, hindsight\-conditioned\)Trajectory Balance\(TB\) condition\(Bengioet al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib12)\)requires for every trajectory:
logZθ\(q\)\+∑t=1TlogPF\(Ht∣Ht−1\)=βlogR~\(τ\)\+∑t=1TlogPB\(Ht−1∣Ht\)\.\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t=1\}^\{T\}\\log P\_\{F\}\(H\_\{t\}\\\!\\mid\\\!H\_\{t\-1\}\)=\\beta\\log\\tilde\{R\}\(\\tau\)\+\\sum\_\{t=1\}^\{T\}\\log P\_\{B\}\(H\_\{t\-1\}\\\!\\mid\\\!H\_\{t\}\)\.\(8\)TheTTB lossis the squared, length\-normalized residual of this condition\(Dall’Antoniaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib49)\)\(full derivation in Appendix[B](https://arxiv.org/html/2605.14089#A2)\):
Δ\(τ\)≔logZθ\(q\)\+∑t=1Tlogπθ\(at∣rt,Ht−1\)−βlogR~\(τ\)−∑t=1TlogPϕ\(at∣Ht−1⊕otexec\),\\displaystyle\\Delta\(\\tau\)\\coloneqq\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t=1\}^\{T\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\-\\beta\\log\\tilde\{R\}\(\\tau\)\-\\sum\_\{t=1\}^\{T\}\\log P\_\{\\phi\}\\bigl\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\\bigr\),ℒTTB\(τ\)=\(Δ\(τ\)/T\)2\.\\displaystyle\\mathcal\{L\}\_\{\\text\{TTB\}\}\(\\tau\)=\\bigl\(\\Delta\(\\tau\)/T\\bigr\)^\{2\}\.\(9\)HereT=\|τ\|T=\|\\tau\|,Zθ\(q\)Z\_\{\\theta\}\(q\)is a task\-conditioned partition function, and each per\-step log\-probability is*per\-token normalized*\(details in Appendix[B\.3](https://arxiv.org/html/2605.14089#A2.SS3)\)\. At the optimumΔ\(τ\)=0\\Delta\(\\tau\)=0, the conditional action\-sequence distribution becomes reward\-proportional,πθ\(a1:T∣r1:T,o1:Texec,q\)∝R~\(τ\)β\\pi\_\{\\theta\}\(a\_\{1:T\}\\mid r\_\{1:T\},o^\{\\text\{exec\}\}\_\{1:T\},q\)\\propto\\tilde\{R\}\(\\tau\)^\{\\beta\}\.
Step Importance and Skill Marginal Flow\.EachHtH\_\{t\}has a unique parent;𝒢\\mathcal\{G\}is therefore tree\-structured, and TB convergence implies Detailed BalanceF\(H\)PF\(H′\|H\)=F\(H′\)PB\(H\|H′\)F\(H\)\\,P\_\{F\}\(H^\{\\prime\}\|H\)=F\(H^\{\\prime\}\)\\,P\_\{B\}\(H\|H^\{\\prime\}\)at every edge \(formal treatment in Appendix[D](https://arxiv.org/html/2605.14089#A4)\)\. Rearranging yields thestep importance:
I\(t\)=F\(Ht\)F\(Ht−1\)=πθ\(at∣rt,Ht−1\)Pϕ\(at∣Ht−1⊕otexec\),I\(t\)=\\frac\{F\(H\_\{t\}\)\}\{F\(H\_\{t\-1\}\)\}=\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\}\{P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\)\},\(10\)which incurs no extra inference cost\. Theinformation asymmetry—πθ\\pi\_\{\\theta\}decides withoutotexeco\_\{t\}^\{\\text\{exec\}\}, whilePϕP\_\{\\phi\}evaluates with it—makesI\(t\)I\(t\)a credit signal: large\|logI\(t\)\|\|\\log I\(t\)\|marks decisions whose appraisal shifted after execution\. Telescoping fromF\(s0\)=Zθ\(q\)F\(s\_\{0\}\)=Z\_\{\\theta\}\(q\)gives theskill marginal flow:
F^\(s\)=1\|ℬs\|∑τ∈ℬs∑t:atinvokessF\(Ht\),logF\(Ht\)=logZθ\(q\)\+∑t′=1tlogI\(t′\),\\hat\{F\}\(s\)=\\frac\{1\}\{\|\\mathcal\{B\}\_\{s\}\|\}\\sum\_\{\\tau\\in\\mathcal\{B\}\_\{s\}\}\\;\\sum\_\{t:\\,a\_\{t\}\\text\{ invokes \}s\}F\(H\_\{t\}\),\\qquad\\log F\(H\_\{t\}\)=\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\log I\(t^\{\\prime\}\),\(11\)whereℬs⊆ℬ\\mathcal\{B\}\_\{s\}\\subseteq\\mathcal\{B\}is the subset of trajectories invokingssand the inner sum aggregates flow over each occurrence ofsswithin a trajectory\.
###### Proposition 2\.
TTB training induces reward\-proportional sampling and yields per\-step credit at no extra inference cost\.
*Proof\.*Main results \(§[5\.2](https://arxiv.org/html/2605.14089#S5.SS2), RQ1\), OOD \(§[5\.3](https://arxiv.org/html/2605.14089#S5.SS3), RQ2\),−\-TTB ablation \(§[5\.5](https://arxiv.org/html/2605.14089#S5.SS5), RQ4\), and algorithm comparison \(§[5\.6](https://arxiv.org/html/2605.14089#S5.SS6), RQ5\) provide empirical validation; full proof in Appendix[J](https://arxiv.org/html/2605.14089#A10)\.□\\square
### 4\.3Flow\-Driven Recursive Skill Evolution
Prior work on skill distillation and refinement\(Wanget al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib15); Alzubiet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib58); Zhanget al\.,[2026a](https://arxiv.org/html/2605.14089#bib.bib59); Wanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib64); Xiaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib57)\)relies on heuristic schedules or LLM\-as\-judge prompting, leaving*when*,*what*, and*where*to evolve underspecified\. SkillFlow derives all three from the flow signals of Section[4\.2](https://arxiv.org/html/2605.14089#S4.SS2): the TTB residualΔ\(τ\)\\Delta\(\\tau\)signals*when*, while the step importanceI\(t\)I\(t\)and skill marginal flowF^\(s\)\\hat\{F\}\(s\)localize*what*and*where*\.
When: TTB Residual Floor\.Within phasekk, gradient descent drivesΔ\(τ\)2→0\\Delta\(\\tau\)^\{2\}\\to 0\(Proposition[2](https://arxiv.org/html/2605.14089#Thmtheorem2)\)\. The squared\-residual floor under the current library is
Δ¯∗\(k\)≔infθ,ϕ,Zθ𝔼τ\[Δ\(τ∣𝒮\(k\),θ,ϕ,Zθ\)2\]≥0\.\\bar\{\\Delta\}^\{\*\(k\)\}\\;\\coloneqq\\;\\inf\_\{\\theta,\\phi,Z\_\{\\theta\}\}\\;\\mathbb\{E\}\_\{\\tau\}\\\!\\bigl\[\\Delta\(\\tau\\mid\\mathcal\{S\}^\{\(k\)\},\\,\\theta,\\phi,Z\_\{\\theta\}\)^\{2\}\\bigr\]\\;\\geq\\;0\.\(12\)Δ¯∗\(k\)=0\\bar\{\\Delta\}^\{\*\(k\)\}=0when𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}together with the policy class can express the reward\-proportional flow; otherwise the loss plateaus atΔ¯∗\(k\)\>0\\bar\{\\Delta\}^\{\*\(k\)\}\>0\. Phasek\+1k\{\+\}1is triggered when the running mean ofΔ\(τ\)2\\Delta\(\\tau\)^\{2\}saturates against this plateau\.
What and Where: Flow\-Guided Skill Curation via a CGF\.Each skills∈𝒮s\\in\\mathcal\{S\}is an*atomic tip*—a short, self\-contained piece of strategic guidance, independently composable via theskillaction\. We define theper\-skill cumulant generating function\(CGF\) of the telescoped log\-flow:
Λλ\(s\)≔log\(1\|ℬs\|∑τ∈ℬs∑t:atinvokessexp\(λ∑t′=1tlogI\(t′\)\)\),λ∈ℝ\(moment order\),\\Lambda^\{\(s\)\}\_\{\\lambda\}\\;\\coloneqq\\;\\log\\\!\\left\(\\frac\{1\}\{\|\\mathcal\{B\}\_\{s\}\|\}\\sum\_\{\\tau\\in\\mathcal\{B\}\_\{s\}\}\\sum\_\{t:\\,a\_\{t\}\\text\{ invokes \}s\}\\\!\\exp\\\!\\Bigl\(\\lambda\\\!\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\log I\(t^\{\\prime\}\)\\Bigr\)\\right\),\\quad\\lambda\\in\\mathbb\{R\}\\ \\text\{\(moment order\)\},which, by the telescoping identitylogF\(Ht\)=logZθ\(q\)\+∑t′≤tlogI\(t′\)\\log F\(H\_\{t\}\)=\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t^\{\\prime\}\\leq t\}\\log I\(t^\{\\prime\}\)\(Eq\.[11](https://arxiv.org/html/2605.14089#S4.E11)\), is the logλ\\lambda\-th moment ofF\(Ht\)/Zθ\(q\)F\(H\_\{t\}\)/Z\_\{\\theta\}\(q\)along occurrences ofss\.Λλ\(s\)\\Lambda^\{\(s\)\}\_\{\\lambda\}is convex inλ\\lambda; two summaries derived from it drive library evolution—themean log\-flowG\(s\)≔∂Λλ\(s\)∂λ\|λ=0G\(s\)\\coloneqq\\tfrac\{\\partial\\Lambda^\{\(s\)\}\_\{\\lambda\}\}\{\\partial\\lambda\}\\big\|\_\{\\lambda=0\}measures the average flowssattracts when invoked, and thecentered log\-flow shareΛ~\(s\)≔Λ1\(s\)−𝔼s′\[Λ1\(s′\)\]\\widetilde\{\\Lambda\}\(s\)\\coloneqq\\Lambda^\{\(s\)\}\_\{1\}\-\\mathbb\{E\}\_\{s^\{\\prime\}\}\[\\Lambda^\{\(s^\{\\prime\}\)\}\_\{1\}\]ranksss’s marginal contribution to the library’s reward\-proportional sampling:
𝒮\(k\+1\)=Φ\(𝒮\(k\);\{G\(s\),Λ~\(s\)\}s∈𝒮\(k\),\{logI\(t\)\}t\),\\mathcal\{S\}^\{\(k\+1\)\}\\;=\\;\\Phi\\bigl\(\\mathcal\{S\}^\{\(k\)\};\\,\\\{G\(s\),\\,\\widetilde\{\\Lambda\}\(s\)\\\}\_\{s\\in\\mathcal\{S\}^\{\(k\)\}\},\\,\\\{\\log I\(t\)\\\}\_\{t\}\\bigr\),\(13\)whereΛ1\(s\)=logF^\(s\)−logZθ\(q\)\\Lambda^\{\(s\)\}\_\{1\}=\\log\\hat\{F\}\(s\)\-\\log Z\_\{\\theta\}\(q\)recovers the skill marginal flow in log\-space\. The operatorΦ\\Phipartitions𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}into four disjoint classes:*retain*\(highG\(s\)G\(s\), small Jensen gap\),*refine*\(highG\(s\)G\(s\), persistently large Jensen gap\), and*prune*\(persistently negativeΛ~\(s\)\\widetilde\{\\Lambda\}\(s\)\); on top of these it invokes the Skill CreatorΨ\\Psito*generate*new atomic tips at high\-logI\(t\)\\log I\(t\)steps from same\-query success/failure pairs\(τ\+,τ−\)\(\\tau^\{\+\},\\tau^\{\-\}\)\. The Jensen gapΛ1\(s\)−G\(s\)\\Lambda^\{\(s\)\}\_\{1\}\-G\(s\)equals the cross\-visit variance oflogF\(Ht\)\\log F\(H\_\{t\}\)plus higher\-order cumulants, serving as a stability diagnostic that distinguishes*retain*from*refine*for context\-inconsistent skills\.
###### Proposition 3\.
Flow\-driven recursive evolution autonomously expands the skill library while preserving its atomic composability\.
*Proof\.*Leave\-one\-out ablation \(§[5\.5](https://arxiv.org/html/2605.14089#S5.SS5), RQ4\) and skill\-evolution cost savings \(§[5\.6](https://arxiv.org/html/2605.14089#S5.SS6), RQ5\) provide empirical validation; full proof in Appendix[K](https://arxiv.org/html/2605.14089#A11)\.□\\square
Table 1:Main results on 14 IID and OOD benchmarks\. SkillFlow uses Qwen3\.5\-9B as the Supervisor\. “Agent\+RL” covers AgentFlow, FlowSteer, SkillRL\.Δ↑\\Delta\\uparrowover Qwen3\.5\-9B\.
## 5Experiments
We evaluate SkillFlow through the following research questions \(RQs\):RQ1:Can SkillFlow outperform existing workflow orchestration methods on in\-distribution benchmarks?RQ2:How does SkillFlow generalize to out\-of\-distribution benchmarks?RQ3:How transferable is SkillFlow across different LLM backbones?RQ4:What are the contributions of core components such as TTB training, backward policy, and skill evolution?RQ5:How does SkillFlow compare against other agent RL algorithms \(GRPO, Tree\-GRPO, HCAPO\) and skill\-evolution baselines in accuracy and computational cost?
### 5\.1Experimental Setup
Datasets\.We evaluate SkillFlow on 14 benchmarks covering four task categories\.In\-distribution \(IID, 7\):HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.14089#bib.bib6)\), TriviaQA\(Joshiet al\.,[2017](https://arxiv.org/html/2605.14089#bib.bib1)\), MedQA\(Jinet al\.,[2021](https://arxiv.org/html/2605.14089#bib.bib2)\), AIME 2026, WebShop\(Yaoet al\.,[2022a](https://arxiv.org/html/2605.14089#bib.bib3)\), ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2605.14089#bib.bib4)\), SWE\-bench\(Jimenezet al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib5)\)\.Out\-of\-distribution \(OOD, 7\):MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.14089#bib.bib7)\), NQ\-Open, MATH\-Hard, GPQA Diamond, HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.14089#bib.bib8)\), ScienceWorld, Mind2Web\. More details in Appendix[M](https://arxiv.org/html/2605.14089#A13)\.
Baselines\.We compare against\(i\)direct LLMs \(Qwen3\.5\-9B, v4\-flash, Claude Haiku 4\.5\),\(ii\)fine\-tuning \(SFT, GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib20)\)\),\(iii\)search\-based workflows \(AFlow\(Zhanget al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib29)\)\), and\(iv\)RL agents \(AgentFlow\(Liet al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib67)\), FlowSteer\(Zhanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib48)\), SkillRL\(Xiaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib57)\)\)\. See Appendix[N](https://arxiv.org/html/2605.14089#A14)for details\.
Evaluation Metrics\.F1 for QA, Accuracy for AIME/MedQA/MATH/GPQA, Average Score and Success Rate for WebShop/ALFWorld, Resolved Rate for SWE\-bench, pass@1 for HumanEval\. Details in Appendix[O](https://arxiv.org/html/2605.14089#A15)\.
Table 2:Component ablation across IID and OOD benchmarks \(RQ4\)\. Each row removes one component\.*−\-TTB*replaces TTB with GRPO \(C1\);*−\-Backward policy*removes the hindsightPϕP\_\{\\phi\}\(C2\); the last three rows ablate the*when*/*where*/*what*signals of skill evolution \(C3\)\.\(a\)Radar charts on different backbones


\(b\)Training dynamics
\(c\)Aggregated performance by task type
Figure 4:Backbone transferability\. \(a\) Per\-backbone radar across seven IID benchmarks \(six proprietary LLMs\), with vs\. without SkillFlow\. \(b\) Training dynamics on two trainable backbones\. \(c\) Aggregated gain by task category\. SkillFlow lifts every backbone, with weaker ones gaining most\.
### 5\.2Main Results \(RQ1\)
Table[1](https://arxiv.org/html/2605.14089#S4.T1)shows SkillFlow leads across all 14 benchmarks and surpasses even much stronger direct\-LLM baselines \(v4\-flash, Claude Haiku 4\.5\), indicating gains come from how the orchestration policy is trained rather than from backbone capacity\. Margins concentrate on WebShop, ALFWorld, and SWE\-bench — where REINFORCE\-family baselines suffer strategy collapse and AFlow’s static workflow runs out of moves — and persist across both IID and OOD halves\. The gains compound from three mechanisms: TTB’s regression loss has lower variance than REINFORCE \(formal bound in Appendix[I](https://arxiv.org/html/2605.14089#A9)\), the backward policy supplies zero\-cost per\-step credit, and recursive skill evolution removes the static\-library bottleneck of AgentFlow, FlowSteer, and SkillRL\.
### 5\.3Out\-of\-Distribution Generalization \(RQ2\)
The seven OOD benchmarks \(Table[1](https://arxiv.org/html/2605.14089#S4.T1)\(b\)\) share the four task categories with the IID set but contain no training data, and the skill library is frozen at end\-of\-training — a strict transfer test\. SkillFlow retains its IID lead with comparable margins; notably, the gap to REINFORCE\-style baselines*widens*on OOD because reward\-proportional sampling preserves multiple solution paths, whereas a single\-mode policy fails once surface forms shift\. On Avg\.\(OOD\) F1, the lift over FlowSteer reaches\+6\.08%\+6\.08\\%—roughly1\.5×1\.5\\timesthe IID gap\. That a frozen library still helps unseen tasks indicates the evolved skills capture transferable orchestration primitives rather than benchmark\-specific shortcuts\.
\(a\)Algorithm comparison on IID and OOD benchmarks \(RQ5\)

\(b\)Training & per\-call cost
\(c\)Evolution events
\(d\)Pass@KK
\(e\)Reward curves
Figure 5:Mechanism analysis and algorithm comparison\. \(a\) Algorithm comparison on IID and OOD\. \(b\) Per\-call cost vs five skill\-evolution baselines\. \(c\) Skill\-evolution events\. \(d\) Pass@KKvs diversity\. \(e\) Reward curves\. SkillFlow leads on accuracy, diversity, and cost simultaneously\.
### 5\.4Backbone Transferability \(RQ3\)
Figure[4](https://arxiv.org/html/2605.14089#S5.F4)swaps the Supervisor for six proprietary LLMs\. Weaker backbones gain most—explicit credit and diversity\-preserving sampling offset shaky base reasoning—while stronger ones still gain on agent\-style tasks where orchestration is the bottleneck\. The per\-category lift hierarchy \(Fig\.[4](https://arxiv.org/html/2605.14089#S5.F4)c\) is preserved across backbones, locating SkillFlow at the training\-recipe level rather than backbone\-specific tuning\.
### 5\.5Component Ablation \(RQ4\)
Table[2](https://arxiv.org/html/2605.14089#S5.T2)reports a leave\-one\-out ablation mapping onto claims C1–C3\.*Loss \(C1\):*replacing TTB with GRPO hurts most on diversity\-sensitive tasks \(AIME, WebShop, ALFWorld\) — mode collapse, not reward\-curve shape, is the dominant failure mode\.*Credit \(C2\):*removingPϕP\_\{\\phi\}hurts harder multi\-step tasks \(AIME, WebShop, ScienceWorld, Mind2Web\) more than fact\-based QA, where per\-step credit matters most for identifying decisive decisions\.*Evolution \(C3\):*ablating any of the three flow signals*when*,*where*, or*what*independently degrades performance, indicating none is redundant\.
### 5\.6Algorithm Comparison and Computational Cost \(RQ5\)
Holding the backbone and training data fixed, SkillFlow leads GRPO, Tree\-GRPO\(Jiet al\.,[2025](https://arxiv.org/html/2605.14089#bib.bib47)\), and HCAPO\(Tanet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib50)\)on all 14 benchmarks \(Figure[5](https://arxiv.org/html/2605.14089#S5.F5)\(a\)\); Figure[5](https://arxiv.org/html/2605.14089#S5.F5)\(d,e\) explains the gap: REINFORCE\-style methods reinforce a single mode and hit a Pass@5 ceiling at low diversity, while reward\-proportional sampling keeps multiple high\-reward paths alive \(same ranking holds OOD\)\. Skill\-evolution events concentrate at TTB plateaus \(Fig\.[5](https://arxiv.org/html/2605.14089#S5.F5)\(c\)\), confirming the plateau\-driven trigger of §[4\.3](https://arxiv.org/html/2605.14089#S4.SS3)\. On cost \(Figure[5](https://arxiv.org/html/2605.14089#S5.F5)\(b\)\), SkillFlow uses the fewest tokens and lowest time against five skill\-evolution baselines\(Jianget al\.,[2026a](https://arxiv.org/html/2605.14089#bib.bib61); Yanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib62); Maet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib63); Alzubiet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib58); Xiaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib57)\); the∼\\sim32% /∼\\sim35% saving over SkillRL pins the gain on flow\-signal\-driven evolution\.
## 6Conclusion
SkillFlow unifies orchestration training and recursive skill evolution under TTB: reward\-proportional sampling preserves diversity, hindsight backward gives zero\-cost per\-step credit, and flow diagnostics drive skill curation\. Across 14 benchmarks it outperforms direct\-LLM, RL, and skill\-evolution baselines on accuracy, diversity, and cost\.
## References
- Evoskill: automated skill discovery for multi\-agent systems\.arXiv preprint arXiv:2603\.02766\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§1](https://arxiv.org/html/2605.14089#S1.p6.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1),[§4\.3](https://arxiv.org/html/2605.14089#S4.SS3.p1.3),[§5\.6](https://arxiv.org/html/2605.14089#S5.SS6.p1.2)\.
- E\. Bengio, M\. Jain, M\. Korablyov, D\. Precup, and Y\. Bengio \(2021\)Flow network based generative models for non\-iterative diverse candidate generation\.Advances in neural information processing systems34,pp\. 27381–27394\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p5.2),[§3](https://arxiv.org/html/2605.14089#S3.p3.6)\.
- Y\. Bengio, S\. Lahlou, T\. Deleu, E\. J\. Hu, M\. Tiwari, and E\. Bengio \(2023\)Gflownet foundations\.Journal of Machine Learning Research24\(210\),pp\. 1–55\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p5.2),[§3](https://arxiv.org/html/2605.14089#S3.p3.6),[§4\.2](https://arxiv.org/html/2605.14089#S4.SS2.p3.2)\.
- Y\. Cang, X\. Zhang, E\. Zhao, Z\. Ji, Y\. Liu, Y\. He, Z\. Ning, C\. Yijun, W\. Que, and L\. Shi \(2026\)Graph\-grpo: stabilizing multi\-agent topology learning via group relative policy optimization\.arXiv preprint arXiv:2603\.02701\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- B\. Chen, C\. Shu, E\. Shareghi, N\. Collier, K\. Narasimhan, and S\. Yao \(2023\)Fireact: toward language agent fine\-tuning\.arXiv preprint arXiv:2310\.05915\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p3.1)\.
- K\. Chen, M\. Cusumano\-Towner, B\. Huval, A\. Petrenko, J\. Hamburger, V\. Koltun, and P\. Krähenbühl \(2025\)Reinforcement learning for long\-horizon interactive llm agents\.arXiv preprint arXiv:2502\.01600\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[5th item](https://arxiv.org/html/2605.14089#A13.I2.i5.p1.1),[Appendix O](https://arxiv.org/html/2605.14089#A15.SS0.SSS0.Px5.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- P\. Dall’Antonia, T\. da Silva, D\. Csillag, S\. Lahlou, and D\. Mesquita \(2026\)Avoid what you know: divergent trajectory balance for gflownets\.arXiv preprint arXiv:2602\.17827\.Cited by:[§4\.2](https://arxiv.org/html/2605.14089#S4.SS2.p3.7)\.
- Y\. Dang, C\. Qian, X\. Luo, J\. Fan, Z\. Xie, R\. Shi, W\. Chen, C\. Yang, X\. Che, Y\. Tian,et al\.\(2025\)Multi\-agent collaboration via evolving orchestration\.arXiv preprint arXiv:2505\.19591\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p1.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- J\. Fang, Y\. Peng, X\. Zhang, Y\. Wang, X\. Yi, G\. Zhang, Y\. Xu, B\. Wu, S\. Liu, Z\. Li,et al\.\(2025\)A comprehensive survey of self\-evolving ai agents: a new paradigm bridging foundation models and lifelong agentic systems\.arXiv preprint arXiv:2508\.07407\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin,et al\.\(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe twelfth international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p1.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2024\)Automated design of agentic systems\.arXiv preprint arXiv:2408\.08435\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p2.1)\.
- Y\. Ji, Z\. Ma, Y\. Wang, G\. Chen, X\. Chu, and L\. Wu \(2025\)Tree search for llm agent reinforcement learning\.arXiv preprint arXiv:2509\.21240\.Cited by:[§5\.6](https://arxiv.org/html/2605.14089#S5.SS6.p1.2)\.
- G\. Jiang, Z\. Su, X\. Qu, and Y\. R\. Fung \(2026a\)Xskill: continual learning from experience and skills in multimodal agents\.arXiv preprint arXiv:2603\.12056\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p6.1),[§5\.6](https://arxiv.org/html/2605.14089#S5.SS6.p1.2)\.
- Y\. Jiang, D\. Li, H\. Deng, B\. Ma, X\. Wang, Q\. Wang, and G\. Yu \(2026b\)SoK: agentic skills–beyond tool use in llm agents\.arXiv preprint arXiv:2602\.20867\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2023\)Swe\-bench: can language models resolve real\-world github issues?\.InThe twelfth international conference on learning representations,Cited by:[7th item](https://arxiv.org/html/2605.14089#A13.I1.i7.p1.1),[Appendix O](https://arxiv.org/html/2605.14089#A15.SS0.SSS0.Px4.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits \(2021\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.Applied Sciences11\(14\),pp\. 6421\.Cited by:[3rd item](https://arxiv.org/html/2605.14089#A13.I1.i3.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer \(2017\)Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1601–1611\.Cited by:[2nd item](https://arxiv.org/html/2605.14089#A13.I1.i2.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- M\. Kong, Z\. Qu, Z\. Zhou, P\. Liang, X\. Li, Z\. Shang, Z\. Hong, K\. Huang, Z\. Wang, and Z\. Dai \(2026\)Workflow\-r1: group sub\-sequence policy optimization for multi\-turn workflow construction\.arXiv preprint arXiv:2602\.01202\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1)\.
- L\. Li, Z\. Zhou, J\. Hao, J\. K\. Liu, Y\. Miao, W\. Pang, X\. Tan, W\. Chu, Z\. Wang, S\. Pan,et al\.\(2025a\)The choice of divergence: a neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward\.arXiv preprint arXiv:2509\.07430\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- X\. Li \(2026\)When single\-agent with skills replace multi\-agent systems and when they fail\.arXiv preprint arXiv:2601\.04748\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- Z\. Li, H\. Zhang, S\. Han, S\. Liu, J\. Xie, Y\. Zhang, Y\. Choi, J\. Zou, and P\. Lu \(2025b\)In\-the\-flow agentic system optimization for effective planning and tool use\.arXiv preprint arXiv:2510\.05592\.Cited by:[1st item](https://arxiv.org/html/2605.14089#A14.I4.i1.p1.1),[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§1](https://arxiv.org/html/2605.14089#S1.p6.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p2.1)\.
- S\. Liu, Z\. Liang, X\. Lyu, and C\. Amato \(2026\)Llm collaboration with multi\-agent reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 32150–32158\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- X\. Liu, K\. Wang, Y\. Wu, F\. Huang, Y\. Li, J\. Zhang, and J\. Jiao \(2025\)Agentic reinforcement learning with implicit step rewards\.arXiv preprint arXiv:2509\.19199\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu \(2026\)SkillClaw: let skills evolve collectively with agentic evolver\.arXiv preprint arXiv:2604\.08377\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p6.1),[§5\.6](https://arxiv.org/html/2605.14089#S5.SS6.p1.2)\.
- N\. Malkin, M\. Jain, E\. Bengio, C\. Sun, and Y\. Bengio \(2022\)Trajectory balance: improved credit assignment in gflownets\.Advances in Neural Information Processing Systems35,pp\. 5955–5967\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p5.2)\.
- I\. Ong, A\. Almahairi, V\. Wu, W\. Chiang, T\. Wu, J\. E\. Gonzalez, M\. W\. Kadous, and I\. Stoica \(2024\)Routellm: learning to route llms with preference data\.arXiv preprint arXiv:2406\.18665\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- J\. Peng, Y\. Liu, R\. Zhou, C\. Fleming, Z\. Wang, A\. Garcia, and M\. Hong \(2026\)Hiper: hierarchical reinforcement learning with explicit credit assignment for large language model agents\.arXiv preprint arXiv:2602\.16165\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- C\. Qian, Z\. Xie, Y\. Wang, W\. Liu, K\. Zhu, H\. Xia, Y\. Dang, Z\. Du, W\. Chen, C\. Yang,et al\.\(2024\)Scaling large language model\-based multi\-agent collaboration\.arXiv preprint arXiv:2406\.07155\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p2.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2023\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.InThe twelfth international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel \(2015\)High\-dimensional continuous control using generalized advantage estimation\.arXiv preprint arXiv:1506\.02438\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[2nd item](https://arxiv.org/html/2605.14089#A14.I2.i2.p1.1),[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p2.1)\.
- S\. She, J\. Liu, Y\. Liu, J\. Chen, X\. Huang, and S\. Huang \(2025\)R\-prm: reasoning\-driven process reward modeling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 13449–13462\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[6th item](https://arxiv.org/html/2605.14089#A13.I1.i6.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- J\. Su, X\. Zeng, L\. Liu, C\. Luo, Y\. Chen, and Z\. Zhuang \(2025\)Enhancing agentic rl with progressive reward shaping and value\-based sampling policy optimization\.arXiv preprint arXiv:2512\.07478\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- J\. Su, Q\. Lan, Y\. Xia, L\. Sun, W\. Tian, T\. Shi, and L\. He \(2026\)Difficulty\-aware agentic orchestration for query\-specific multi\-agent workflows\.InProceedings of the ACM Web Conference 2026,pp\. 2060–2070\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- H\. Tan, X\. Yang, H\. Chen, J\. Shao, Y\. Wen, Y\. Shen, W\. Luo, X\. Du, L\. Guo, and Y\. Li \(2026\)Hindsight credit assignment for long\-horizon llm agents\.arXiv preprint arXiv:2603\.08754\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1),[§5\.6](https://arxiv.org/html/2605.14089#S5.SS6.p1.2)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)♫ MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[1st item](https://arxiv.org/html/2605.14089#A13.I2.i1.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- C\. Wang, Z\. Yu, X\. Xie, W\. Yao, R\. Fang, S\. Qiao, K\. Cao, G\. Zheng, X\. Qi, P\. Zhang,et al\.\(2026a\)SkillX: automatically constructing skill knowledge bases for agents\.arXiv preprint arXiv:2604\.04804\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1),[§4\.3](https://arxiv.org/html/2605.14089#S4.SS3.p1.3)\.
- H\. Wang, G\. Wang, H\. Xiao, Y\. Zhou, Y\. Pan, J\. Wang, K\. Xu, Y\. Wen, X\. Ruan, X\. Chen,et al\.\(2026b\)Skill\-sd: skill\-conditioned self\-distillation for multi\-turn llm agents\.arXiv preprint arXiv:2604\.10674\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p1.1),[§4\.3](https://arxiv.org/html/2605.14089#S4.SS3.p1.3)\.
- J\. Wang, Q\. Yan, Y\. Wang, Y\. Tian, S\. S\. Mishra, Z\. Xu, M\. Gandhi, P\. Xu, and L\. L\. Cheong \(2025a\)Reinforcement learning for self\-improving agent with skill library\.arXiv preprint arXiv:2512\.17102\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- X\. Wang, W\. Wang, K\. Chen, N\. Nimalsiri, and S\. Halgamuge \(2026c\)Discovering process\-outcome credit in multi\-step llm reasoning\.arXiv preprint arXiv:2602\.01034\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1)\.
- X\. Wang, Y\. Chen, L\. Yuan, Y\. Zhang, Y\. Li, H\. Peng, and H\. Ji \(2024a\)Executable code actions elicit better llm agents\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- X\. Wang, B\. Li, Y\. Song, F\. F\. Xu, X\. Tang, M\. Zhuge, J\. Pan, Y\. Song, B\. Li, J\. Singh,et al\.\(2024b\)Openhands: an open platform for ai software developers as generalist agents\.arXiv preprint arXiv:2407\.16741\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p1.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu,et al\.\(2025b\)Ragen: understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.arXiv preprint arXiv:2504\.20073\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2024\)Autogen: enabling next\-gen llm applications via multi\-agent conversations\.InFirst conference on language modeling,Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- P\. Xia, J\. Chen, H\. Wang, J\. Liu, K\. Zeng, Y\. Wang, S\. Han, Y\. Zhou, X\. Zhao, H\. Chen,et al\.\(2026\)Skillrl: evolving agents via recursive skill\-augmented reinforcement learning\.arXiv preprint arXiv:2602\.08234\.Cited by:[3rd item](https://arxiv.org/html/2605.14089#A14.I4.i3.p1.1),[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§1](https://arxiv.org/html/2605.14089#S1.p6.1),[§4\.3](https://arxiv.org/html/2605.14089#S4.SS3.p1.3),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p2.1),[§5\.6](https://arxiv.org/html/2605.14089#S5.SS6.p1.2)\.
- R\. Xu and Y\. Yan \(2026\)Agent skills for large language models: architecture, acquisition, security, and the path forward\.arXiv preprint arXiv:2602\.12430\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p2.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1)\.
- M\. Y\. Yang, H\. Bai, I\. Wu, G\. Yang, A\. Setlur, and A\. Kumar \(2026a\)InT: self\-proposed interventions enable credit assignment in llm reasoning\.arXiv preprint arXiv:2601\.14209\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- Y\. Yang, J\. Li, Q\. Pan, B\. Zhan, Y\. Cai, L\. Du, J\. Zhou, K\. Chen, Q\. Chen, X\. Li,et al\.\(2026b\)Autoskill: experience\-driven lifelong learning via skill self\-evolution\.arXiv preprint arXiv:2603\.01145\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p6.1),[§5\.6](https://arxiv.org/html/2605.14089#S5.SS6.p1.2)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[1st item](https://arxiv.org/html/2605.14089#A13.I1.i1.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022a\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[5th item](https://arxiv.org/html/2605.14089#A13.I1.i5.p1.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022b\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p1.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.14089#S4.SS1.p1.3)\.
- F\. Yu, L\. Jiang, H\. Kang, S\. Hao, and L\. Qin \(2024\)Flow of reasoning: training llms for divergent reasoning with minimal examples\.arXiv preprint arXiv:2406\.05673\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- Y\. Yue, Y\. Yuan, Q\. Yu, X\. Zuo, R\. Zhu, W\. Xu, J\. Chen, C\. Wang, T\. Fan, Z\. Du,et al\.\(2025\)Vapo: efficient and reliable reinforcement learning for advanced reasoning tasks\.arXiv preprint arXiv:2504\.05118\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang \(2024\)Agenttuning: enabling generalized agent abilities for llms\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 3053–3077\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p2.1),[§1](https://arxiv.org/html/2605.14089#S1.p3.1)\.
- D\. W\. Zhang, C\. Rainone, M\. Peschl, and R\. Bondesan \(2023\)Robust scheduling with gflownets\.arXiv preprint arXiv:2302\.05446\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- H\. Zhang, S\. Fan, H\. P\. Zou, Y\. Chen, Z\. Wang, J\. Zhou, C\. Li, W\. Huang, Y\. Yao, K\. Zheng,et al\.\(2026a\)EvoSkills: self\-evolving agent skills via co\-evolutionary verification\.arXiv preprint arXiv:2604\.01687\.Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§2](https://arxiv.org/html/2605.14089#S2.p1.1),[§4\.3](https://arxiv.org/html/2605.14089#S4.SS3.p1.3)\.
- J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang,et al\.\(2024\)Aflow: automating agentic workflow generation\.InThe Thirteenth International Conference on Learning Representations,Cited by:[1st item](https://arxiv.org/html/2605.14089#A14.I3.i1.p1.1),[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p2.1)\.
- M\. Zhang, H\. Luo, T\. Shen, Q\. Lin, X\. Tang, R\. Mao, and E\. Cambria \(2026b\)FlowSteer: interactive agentic workflow orchestration via end\-to\-end reinforcement learning\.arXiv preprint arXiv:2602\.01664\.Cited by:[2nd item](https://arxiv.org/html/2605.14089#A14.I4.i2.p1.1),[§1](https://arxiv.org/html/2605.14089#S1.p3.1),[§1](https://arxiv.org/html/2605.14089#S1.p4.1),[§1](https://arxiv.org/html/2605.14089#S1.p6.1),[§2](https://arxiv.org/html/2605.14089#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.14089#S5.SS1.p2.1)\.
- Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 10495–10516\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- A\. Zhou, K\. Yan, M\. Shlapentokh\-Rothman, H\. Wang, and Y\. Wang \(2023\)Language agent tree search unifies reasoning acting and planning in language models\.arXiv preprint arXiv:2310\.04406\.Cited by:[§2](https://arxiv.org/html/2605.14089#S2.p2.1)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024\)Gptswarm: language agents as optimizable graphs\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.14089#S1.p3.1)\.
## Appendix: Theoretical Foundations of SkillFlow
This appendix provides a complete theoretical treatment of SkillFlow’s flow\-based learning framework, following the progressive development style of GFlowNet Foundations \(Bengio et al\., JMLR 2023\)\. We establish the mathematical foundations for flow networks on DAGs, derive the Tempered Trajectory Balance \(TTB\) loss, prove convergence properties, and detail all supporting theoretical results\. Each section builds explicitly on previous results, providing complete proofs with intermediate steps\.
## Appendix AFlow\-Theoretic Foundations
This section introduces the fundamental concepts of flow networks on directed acyclic graphs \(DAGs\) and establishes the connection to reward\-proportional sampling\.
### A\.1Flow Networks on DAGs
We begin with the basic definitions of flow networks\.
###### Definition 1\(Flow Network\)\.
A flow network is a tuple\(𝒢,F\)\(\\mathcal\{G\},F\)where𝒢=\(𝒱,𝒜\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{A\}\)is a DAG with a unique source nodes0∈𝒱s\_\{0\}\\in\\mathcal\{V\}having no incoming edges and a set of terminal nodes𝒳⊂𝒱\\mathcal\{X\}\\subset\\mathcal\{V\}having no outgoing edges, andF:𝒜→ℝ≥0F:\\mathcal\{A\}\\to\\mathbb\{R\}\_\{\\geq 0\}is a non\-negative edge flow function assigning a non\-negative real number to each directed edge\.
For each states∈𝒱s\\in\\mathcal\{V\}, we denote byPa\(s\)\\mathrm\{Pa\}\(s\)the set of predecessors \(parents\) andCh\(s\)\\mathrm\{Ch\}\(s\)the set of successors \(children\)\. We now introduce the crucial concept of state flow\.
###### Definition 2\(State Flow and Flow Conservation\)\.
For any non\-terminal states∈𝒱∖𝒳s\\in\\mathcal\{V\}\\setminus\\mathcal\{X\}, thestate flowF\(s\)∈ℝ≥0F\(s\)\\in\\mathbb\{R\}\_\{\\geq 0\}is defined implicitly by the flow conservation law:
F\(s\)≔∑s′∈Ch\(s\)F\(s→s′\)=∑s′∈Pa\(s\)F\(s′→s\)\.F\(s\)\\;\\coloneqq\\;\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Ch\}\(s\)\}F\(s\\to s^\{\\prime\}\)\\;=\\;\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Pa\}\(s\)\}F\(s^\{\\prime\}\\to s\)\.\(14\)That is, the total flow entering a non\-terminal state equals the total flow leaving it\. This is the fundamental conservation law of flow networks, analogous to Kirchhoff’s current law in electrical networks\.
The partition function represents the total flow circulating through the network\.
###### Definition 3\(Partition Function and Total Flow\)\.
The partition functionZZis the total flow in the network, equal to the source flow:
Z≔F\(s0\)=∑s′∈Ch\(s0\)F\(s0→s′\)=∑x∈𝒳F\(x\)\.Z\\;\\coloneqq\\;F\(s\_\{0\}\)\\;=\\;\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Ch\}\(s\_\{0\}\)\}F\(s\_\{0\}\\to s^\{\\prime\}\)\\;=\\;\\sum\_\{x\\in\\mathcal\{X\}\}F\(x\)\.\(15\)The last equality \(terminal flow equals source flow\) follows from recursive application of flow conservation at each state: every unit of flow leaving the source must traverse some path and arrive at some terminal\.
We now introduce policies derived from flow normalization\.
###### Definition 4\(Forward Policy and Backward Policy\)\.
Given a flowFF, we define:
*Forward policy*:PF\(s′∣s\)=F\(s→s′\)F\(s\)P\_\{F\}\(s^\{\\prime\}\\mid s\)=\\frac\{F\(s\\to s^\{\\prime\}\)\}\{F\(s\)\}for each edges→s′s\\to s^\{\\prime\}wheressis non\-terminal\.
*Backward policy*:PB\(s∣s′\)=F\(s→s′\)F\(s′\)P\_\{B\}\(s\\mid s^\{\\prime\}\)=\\frac\{F\(s\\to s^\{\\prime\}\)\}\{F\(s^\{\\prime\}\)\}for each edges→s′s\\to s^\{\\prime\}wheres′∈𝒱∖\{source\}s^\{\\prime\}\\in\\mathcal\{V\}\\setminus\\\{\\text\{source\}\\\}\.
By flow conservation \(Definition[2](https://arxiv.org/html/2605.14089#Thmdefinition2)\), bothPFP\_\{F\}andPBP\_\{B\}are valid probability distributions:∑s′PF\(s′∣s\)=1\\sum\_\{s^\{\\prime\}\}P\_\{F\}\(s^\{\\prime\}\\mid s\)=1and∑sPB\(s∣s′\)=1\\sum\_\{s\}P\_\{B\}\(s\\mid s^\{\\prime\}\)=1\.
We now define trajectory flow, which is essential for understanding how the forward policy generates trajectories\.
###### Definition 5\(Trajectory and Trajectory Flow\)\.
A complete trajectory is a pathτ=\(s0→s1→⋯→sT=x\)\\tau=\(s\_\{0\}\\to s\_\{1\}\\to\\cdots\\to s\_\{T\}=x\)from the sources0s\_\{0\}to some terminalx∈𝒳x\\in\\mathcal\{X\}\. The trajectory flow is:
F\(τ\)≔Z⋅∏t=1TPF\(st∣st−1\)\.F\(\\tau\)\\;\\coloneqq\\;Z\\cdot\\prod\_\{t=1\}^\{T\}P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)\.\(16\)
###### Lemma 4\(Trajectory Flow Identity\)\.
On a general DAG, the trajectory flowF\(τ\)=Z⋅∏t=1TPF\(st∣st−1\)F\(\\tau\)=Z\\cdot\\prod\_\{t=1\}^\{T\}P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)admits no further factorization in terms of edge or state flows alone; in particular,F\(τ\)≠∏t=1TF\(st−1→st\)F\(\\tau\)\\neq\\prod\_\{t=1\}^\{T\}F\(s\_\{t\-1\}\\to s\_\{t\}\)in general because edge flows and state flows are distinct quantities\. When the DAG additionally satisfies that each non\-source state has a unique parent \(as in SkillFlow’s tree\-structured𝒢\\mathcal\{G\}\),F\(st−1→st\)=F\(st\)F\(s\_\{t\-1\}\\to s\_\{t\}\)=F\(s\_\{t\}\)holds for every edge \(all flow intosts\_\{t\}comes from the single edgest−1→sts\_\{t\-1\}\\to s\_\{t\}\), and the trajectory flow telescopes to the terminal state flow:
F\(τ\)=Z⋅∏t=1TF\(st\)F\(st−1\)=F\(sT\)F\(s0\)⋅Z=F\(sT\),F\(\\tau\)\\;=\\;Z\\cdot\\prod\_\{t=1\}^\{T\}\\frac\{F\(s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\}\\;=\\;\\frac\{F\(s\_\{T\}\)\}\{F\(s\_\{0\}\)\}\\cdot Z\\;=\\;F\(s\_\{T\}\),\(17\)using the standard ratio\-product telescope on*state*flows together withF\(s0\)=ZF\(s\_\{0\}\)=Z\(Definition[3](https://arxiv.org/html/2605.14089#Thmdefinition3)\)\.
We now establish the key decomposition of state flow\.
###### Proposition 5\(Flow Decomposition\)\.
For any states∈𝒱s\\in\\mathcal\{V\}, the state flow equals the sum of flows of all trajectories passing throughss:
F\(s\)=∑τ∋sF\(τ\),F\(s\)=\\sum\_\{\\tau\\ni s\}F\(\\tau\),\(18\)where\{τ∋s\}\\\{\\tau\\ni s\\\}denotes all complete trajectories passing through statess\. In particular, for terminal states:
F\(x\)=∑τ:sT=xF\(τ\),F\(x\)=\\sum\_\{\\tau:s\_\{T\}=x\}F\(\\tau\),\(19\)i\.e\., the flow into a terminal equals the sum of flows of trajectories terminating at that state\.
###### Proof\.
We prove by structural induction on the depth of states in the DAG, where depth is the longest path length from the source\.
*Base case*\(depth\(s\)=0\\text\{depth\}\(s\)=0\):s=s0s=s\_\{0\}is the source\. By Definition[3](https://arxiv.org/html/2605.14089#Thmdefinition3),F\(s0\)=Z=∑τF\(τ\)F\(s\_\{0\}\)=Z=\\sum\_\{\\tau\}F\(\\tau\), since every trajectory starts ats0s\_\{0\}and summing over all complete trajectories gives the total flow\.
*Inductive step*: Assume the claim holds for all states at depth<d<d\. Consider a statessat depthdd\. By Definition[2](https://arxiv.org/html/2605.14089#Thmdefinition2)\(flow conservation\):
F\(s\)=∑s′∈Pa\(s\)F\(s′→s\)\.F\(s\)=\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Pa\}\(s\)\}F\(s^\{\\prime\}\\to s\)\.\(20\)By Definition[4](https://arxiv.org/html/2605.14089#Thmdefinition4),F\(s′→s\)=F\(s′\)⋅PF\(s∣s′\)F\(s^\{\\prime\}\\to s\)=F\(s^\{\\prime\}\)\\cdot P\_\{F\}\(s\\mid s^\{\\prime\}\)\. By the inductive hypothesis,F\(s′\)=∑τ′∋s′F\(τ′\)F\(s^\{\\prime\}\)=\\sum\_\{\\tau^\{\\prime\}\\ni s^\{\\prime\}\}F\(\\tau^\{\\prime\}\), so:
F\(s\)\\displaystyle F\(s\)=∑s′∈Pa\(s\)F\(s′\)⋅PF\(s∣s′\)\\displaystyle=\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Pa\}\(s\)\}F\(s^\{\\prime\}\)\\cdot P\_\{F\}\(s\\mid s^\{\\prime\}\)\(21\)=∑s′∈Pa\(s\)\(∑τ′∋s′F\(τ′\)\)⋅PF\(s∣s′\)\.\\displaystyle=\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Pa\}\(s\)\}\\left\(\\sum\_\{\\tau^\{\\prime\}\\ni s^\{\\prime\}\}F\(\\tau^\{\\prime\}\)\\right\)\\cdot P\_\{F\}\(s\\mid s^\{\\prime\}\)\.\(22\)Each trajectoryτ′\\tau^\{\\prime\}throughs′s^\{\\prime\}can be extended to a unique trajectory throughssby following the edges′→ss^\{\\prime\}\\to s\. Since every complete trajectory throughsspasses through exactly one parents′∈Pa\(s\)s^\{\\prime\}\\in\\mathrm\{Pa\}\(s\), the above sum equals:
F\(s\)=∑τ∋sF\(τ\)\.F\(s\)=\\sum\_\{\\tau\\ni s\}F\(\\tau\)\.\(23\)∎
### A\.2Reward\-Matching and GFlowNet Sampling
We now establish when a flow induces sampling proportional to a reward function\.
###### Definition 6\(Reward\-Matching Flow\)\.
Given a strictly positive reward functionR:𝒳→ℝ\>0R:\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\>0\}on terminal states, a flowFFisreward\-matchingif:
F\(x\)=R\(x\)for allx∈𝒳\.F\(x\)=R\(x\)\\quad\\text\{for all \}x\\in\\mathcal\{X\}\.\(24\)That is, the flow into each terminal state equals the reward of that terminal\.
The central theorem of GFlowNets states that reward\-matching flows induce a distribution over terminals proportional to rewards\.
###### Theorem 6\(GFlowNet Sampling Property\)\.
IfFFis reward\-matching with respect toR:𝒳→ℝ\>0R:\\mathcal\{X\}\\to\\mathbb\{R\}\_\{\>0\}, then sampling a complete trajectory by following the forward policyPFP\_\{F\}produces a distribution over terminals given by:
PF\(x\)≔Prτ∼PF\[sT=x\]=R\(x\)Z,P\_\{F\}\(x\)\\;\\coloneqq\\;\\Pr\_\{\\tau\\sim P\_\{F\}\}\[s\_\{T\}=x\]=\\frac\{R\(x\)\}\{Z\},\(25\)whereZ=∑x′∈𝒳R\(x′\)Z=\\sum\_\{x^\{\\prime\}\\in\\mathcal\{X\}\}R\(x^\{\\prime\}\)is the partition function \(total reward\)\.
###### Proof\.
Letτx=\{x\}\\tau\_\{x\}=\\\{x\\\}denote any complete trajectory ending atxx\. The probability of sampling a trajectory ending atxxis:
PF\(x\)=∑τ:sT=x∏t=1\|τ\|PF\(st∣st−1\)\.P\_\{F\}\(x\)=\\sum\_\{\\tau:s\_\{T\}=x\}\\prod\_\{t=1\}^\{\|\\tau\|\}P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)\.\(26\)Substituting Equation \([16](https://arxiv.org/html/2605.14089#A1.E16)\):
PF\(x\)=∑τ:sT=xF\(τ\)Z\.P\_\{F\}\(x\)=\\sum\_\{\\tau:s\_\{T\}=x\}\\frac\{F\(\\tau\)\}\{Z\}\.\(27\)By Proposition[5](https://arxiv.org/html/2605.14089#Thmtheorem5),F\(x\)=∑τ:sT=xF\(τ\)F\(x\)=\\sum\_\{\\tau:s\_\{T\}=x\}F\(\\tau\), so:
PF\(x\)=F\(x\)Z\.P\_\{F\}\(x\)=\\frac\{F\(x\)\}\{Z\}\.\(28\)SinceFFis reward\-matching,F\(x\)=R\(x\)F\(x\)=R\(x\), thus:
PF\(x\)=R\(x\)Z=R\(x\)∑x′R\(x′\)\.P\_\{F\}\(x\)=\\frac\{R\(x\)\}\{Z\}=\\frac\{R\(x\)\}\{\\sum\_\{x^\{\\prime\}\}R\(x^\{\\prime\}\)\}\.\(29\)∎
### A\.3Application to Task Orchestration
We now instantiate the abstract flow network framework to SkillFlow’s task orchestration setting\.
###### Proposition 7\(Flow Network Instantiation for Task Orchestration\)\.
The orchestration trajectory distribution is a flow network\(𝒢,F\)\(\\mathcal\{G\},F\)where:
- •Vertices𝒱\\mathcal\{V\}: Each vertex is an interaction history stateHtH\_\{t\}\(Definition 1, main text\)\.
- •Edges𝒜\\mathcal\{A\}: Each directed edge\(Ht−1,Ht\)\(H\_\{t\-1\},H\_\{t\}\)corresponds to an orchestration actionat=\(αt,ot\)a\_\{t\}=\(\\alpha\_\{t\},o\_\{t\}\), where the action typeαt∈\{skill,act,accept\}\\alpha\_\{t\}\\in\\\{\\text\{skill\},\\text\{act\},\\text\{accept\}\\\}is selected by the supervisor\.
- •Sources0s\_\{0\}: The initial stateH0=q⊕𝒮ret⊕ωqH\_\{0\}=q\\oplus\\mathcal\{S\}\_\{\\mathrm\{ret\}\}\\oplus\\omega\_\{q\}, formed by concatenating taskqq, retrieved skills, and orchestration guideline\.
- •Terminals𝒳\\mathcal\{X\}: States whereαt=accept\\alpha\_\{t\}=\\text\{accept\}ort=Tmaxt=T\_\{\\max\}\(maximum trajectory length\)\.
- •Forward policy:PF\(Ht∣Ht−1\)=πθ\(at∣Ht−1\)P\_\{F\}\(H\_\{t\}\\mid H\_\{t\-1\}\)=\\pi\_\{\\theta\}\(a\_\{t\}\\mid H\_\{t\-1\}\), the learned supervisor policy\.
- •Edge flow:F\(Ht−1→Ht\)=Z⋅PF\(Ht∣Ht−1\)F\(H\_\{t\-1\}\\to H\_\{t\}\)=Z\\cdot P\_\{F\}\(H\_\{t\}\\mid H\_\{t\-1\}\)for edges in sampled trajectories\.
- •Terminal flow:F\(x\)=R~\(τx\)βF\(x\)=\\tilde\{R\}\(\\tau\_\{x\}\)^\{\\beta\}for terminalxxreached by trajectoryτx\\tau\_\{x\}, whereR~\(τ\)=R\(τ\)\+εmin\\tilde\{R\}\(\\tau\)=R\(\\tau\)\+\\varepsilon\_\{\\min\}is the smoothed reward andβ\>0\\beta\>0is the temperature\.
The key property is acyclicity, established by the strictly increasing state representation\.
###### Lemma 8\(DAG Acyclicity by State Growth\)\.
Under the definitionHt=Ht−1⊕\(rt,at,otexec\)H\_\{t\}=H\_\{t\-1\}\\oplus\(r\_\{t\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\), each edge strictly increases the state size:\|Ht\|\>\|Ht−1\|\|H\_\{t\}\|\>\|H\_\{t\-1\}\|\. Therefore, the orchestration graph is acyclic: no path can visit the same state twice, as that would require returning to a state with smaller or equal size\.
This acyclicity is essential for applying flow matching algorithms, which require the DAG structure\.
## Appendix BTrajectory Balance and TTB Derivation
This section derives the Trajectory Balance \(TB\) condition and the Tempered Trajectory Balance \(TTB\) loss, the core objective for SkillFlow training\.
#### Convention \(DAG edge==action; node==stateHtH\_\{t\}\)\.
Throughout this appendix, all flow\-theoretic quantities are defined at the*action level*of the orchestration DAG: each nodests\_\{t\}corresponds to the interaction historyHtH\_\{t\}\(Definition 1, main text\), and each directed edge\(st−1,st\)\(s\_\{t\-1\},s\_\{t\}\)corresponds to one orchestration actionat=\(αt,ot\)a\_\{t\}=\(\\alpha\_\{t\},o\_\{t\}\)that itself comprisesKtK\_\{t\}tokens produced autoregressively by the LLM\. The*per\-token normalization*introduced in Definition[9](https://arxiv.org/html/2605.14089#Thmdefinition9)is a within\-edge device that averages over theKtK\_\{t\}tokens of one action to produce a single, length\-robust edge log\-probability; it does*not*switch the DAG to a token\-level granularity\.
### B\.1Backward Policy Definition
The backward policy is the reverse\-direction distribution derived from reward\-matching flows\.
###### Definition 7\(Backward Policy\)\.
Given a reward\-matching flowFF, the backward policy is:
PB\(s∣s′\)=F\(s→s′\)F\(s′\),fors∈Pa\(s′\)\.P\_\{B\}\(s\\mid s^\{\\prime\}\)=\\frac\{F\(s\\to s^\{\\prime\}\)\}\{F\(s^\{\\prime\}\)\},\\quad\\text\{for \}s\\in\\mathrm\{Pa\}\(s^\{\\prime\}\)\.\(30\)By flow conservation \(Definition[2](https://arxiv.org/html/2605.14089#Thmdefinition2)\),∑s∈Pa\(s′\)PB\(s∣s′\)=1F\(s′\)∑sF\(s→s′\)=F\(s′\)F\(s′\)=1\\sum\_\{s\\in\\mathrm\{Pa\}\(s^\{\\prime\}\)\}P\_\{B\}\(s\\mid s^\{\\prime\}\)=\\frac\{1\}\{F\(s^\{\\prime\}\)\}\\sum\_\{s\}F\(s\\to s^\{\\prime\}\)=\\frac\{F\(s^\{\\prime\}\)\}\{F\(s^\{\\prime\}\)\}=1, soPB\(⋅∣s′\)P\_\{B\}\(\\cdot\\mid s^\{\\prime\}\)is a valid probability distribution over parents\.
The backward policy encodes the reverse probability of transitioning froms′s^\{\\prime\}back to each parentss\. For learning, we will use a learned backward policyPϕP\_\{\\phi\}parameterized byϕ\\phi\.
### B\.2Trajectory Balance Condition and Full Derivation
The Trajectory Balance \(TB\) condition is the equivalence characterizing when a flow is reward\-matching\.
###### Theorem 9\(Trajectory Balance \(Malkin et al\., 2022\)\)\.
A flowFFon a DAG is reward\-matching iff for every complete trajectoryτ=\(s0,s1,…,sT\)\\tau=\(s\_\{0\},s\_\{1\},\\ldots,s\_\{T\}\)withsT∈𝒳s\_\{T\}\\in\\mathcal\{X\}:
Z⋅∏t=1TPF\(st∣st−1\)=F\(sT\)⋅∏t=1TPB\(st−1∣st\)\.Z\\cdot\\prod\_\{t=1\}^\{T\}P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)=F\(s\_\{T\}\)\\cdot\\prod\_\{t=1\}^\{T\}P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\)\.\(32\)Equivalently, in log\-space:
logZ\+∑t=1TlogPF\(st∣st−1\)=logF\(sT\)\+∑t=1TlogPB\(st−1∣st\)\.\\log Z\+\\sum\_\{t=1\}^\{T\}\\log P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)=\\log F\(s\_\{T\}\)\+\\sum\_\{t=1\}^\{T\}\\log P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\)\.\(33\)
###### Proof\.
We prove both directions explicitly\.
Forward direction \(FFreward\-matching⇒\\RightarrowTB holds\):
AssumeFFis reward\-matching, soF\(x\)=R\(x\)F\(x\)=R\(x\)for all terminalsxx\. Expand the LHS and RHS of Equation \([32](https://arxiv.org/html/2605.14089#A2.E32)\):
LHS=Z⋅∏t=1TF\(st−1→st\)F\(st−1\),\\displaystyle=Z\\cdot\\prod\_\{t=1\}^\{T\}\\frac\{F\(s\_\{t\-1\}\\to s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\},\(34\)RHS=R\(sT\)⋅∏t=1TF\(st−1→st\)F\(st\)\.\\displaystyle=R\(s\_\{T\}\)\\cdot\\prod\_\{t=1\}^\{T\}\\frac\{F\(s\_\{t\-1\}\\to s\_\{t\}\)\}\{F\(s\_\{t\}\)\}\.\(35\)Taking the ratioLHS/RHS\\text\{LHS\}/\\text\{RHS\}:
LHSRHS=ZR\(sT\)⋅∏t=1TF\(st\)F\(st−1\)\.\\frac\{\\text\{LHS\}\}\{\\text\{RHS\}\}=\\frac\{Z\}\{R\(s\_\{T\}\)\}\\cdot\\prod\_\{t=1\}^\{T\}\\frac\{F\(s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\}\.\(36\)The product telescopes:
∏t=1TF\(st\)F\(st−1\)=F\(s1\)F\(s0\)⋅F\(s2\)F\(s1\)⋯F\(sT\)F\(sT−1\)=F\(sT\)F\(s0\)=R\(sT\)Z,\\prod\_\{t=1\}^\{T\}\\frac\{F\(s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\}=\\frac\{F\(s\_\{1\}\)\}\{F\(s\_\{0\}\)\}\\cdot\\frac\{F\(s\_\{2\}\)\}\{F\(s\_\{1\}\)\}\\cdots\\frac\{F\(s\_\{T\}\)\}\{F\(s\_\{T\-1\}\)\}=\\frac\{F\(s\_\{T\}\)\}\{F\(s\_\{0\}\)\}=\\frac\{R\(s\_\{T\}\)\}\{Z\},\(37\)where we usedF\(s0\)=ZF\(s\_\{0\}\)=Z\(Definition[3](https://arxiv.org/html/2605.14089#Thmdefinition3)\) andF\(sT\)=R\(sT\)F\(s\_\{T\}\)=R\(s\_\{T\}\)\(reward\-matching\)\. ThusLHS/RHS=ZR\(sT\)⋅R\(sT\)Z=1\\text\{LHS\}/\\text\{RHS\}=\\frac\{Z\}\{R\(s\_\{T\}\)\}\\cdot\\frac\{R\(s\_\{T\}\)\}\{Z\}=1, so LHS==RHS\.
Reverse direction \(TB holds⇒\\RightarrowFFreward\-matching\):
Assume TB holds for every trajectory\. Sum Equation \([32](https://arxiv.org/html/2605.14089#A2.E32)\) over all trajectories ending at a fixed terminalxx:
Z⋅∑τ:sT=x∏t=1\|τ\|PF\(st∣st−1\)=∑τ:sT=xF\(x\)⋅∏t=1\|τ\|PB\(st−1∣st\)\.Z\\cdot\\sum\_\{\\tau:s\_\{T\}=x\}\\prod\_\{t=1\}^\{\|\\tau\|\}P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)=\\sum\_\{\\tau:s\_\{T\}=x\}F\(x\)\\cdot\\prod\_\{t=1\}^\{\|\\tau\|\}P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\)\.\(38\)The LHS isZ⋅PF\(x\)Z\\cdot P\_\{F\}\(x\)by definition of the forward policy probability\. For the RHS, by the product structure and properties of the backward policy \(which forms a distribution over reverse trajectories\), the sum∑τ:sT=x∏tPB\(st−1∣st\)=1\\sum\_\{\\tau:s\_\{T\}=x\}\\prod\_\{t\}P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\)=1\(the reverse\-direction probability of reachingxxfrom all trajectories sums to 1\)\. Thus:
Z⋅PF\(x\)=F\(x\)\.Z\\cdot P\_\{F\}\(x\)=F\(x\)\.\(39\)By Theorem[6](https://arxiv.org/html/2605.14089#Thmtheorem6), this impliesPF\(x\)=F\(x\)/ZP\_\{F\}\(x\)=F\(x\)/Z, which combined withZ=∑x′F\(x′\)Z=\\sum\_\{x^\{\\prime\}\}F\(x^\{\\prime\}\)givesF\(x\)=R\(x\)F\(x\)=R\(x\)\. ∎
### B\.3TTB Loss Derivation with Temperature and Length Normalization
We now derive the Tempered Trajectory Balance loss as the squared, length\-normalized log\-space TB residual on the action\-DAG, with two technical devices: \(i\) reasoningrtr\_\{t\}enters the forward conditional but is treated as fixed context \(Lemma[13](https://arxiv.org/html/2605.14089#Thmtheorem13), B\.4\); \(ii\) each action edge’s log\-probability is averaged across itsKtK\_\{t\}tokens \(per\-token tempering, Lemma[15](https://arxiv.org/html/2605.14089#Thmtheorem15), B\.4\)\. Both devices are formally justified in B\.4 and do*not*alter the action\-level DAG structure\.
###### Definition 8\(Per\-Token\-Tempered Edge Log\-Probabilities\)\.
Let actionata\_\{t\}compriseKtK\_\{t\}tokenstok1\(t\),…,tokKt\(t\)\\mathrm\{tok\}^\{\(t\)\}\_\{1\},\\ldots,\\mathrm\{tok\}^\{\(t\)\}\_\{K\_\{t\}\}produced autoregressively\. The per\-token\-tempered forward and backward edge log\-probabilities are:
logπ~θ\(at∣rt,Ht−1\)\\displaystyle\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)≔1Kt∑j=1Ktlogπθ\(tokj\(t\)\|tok<j\(t\),rt,Ht−1\),\\displaystyle\\coloneqq\\frac\{1\}\{K\_\{t\}\}\\sum\_\{j=1\}^\{K\_\{t\}\}\\log\\pi\_\{\\theta\}\\bigl\(\\mathrm\{tok\}^\{\(t\)\}\_\{j\}\\,\\big\|\\,\\mathrm\{tok\}^\{\(t\)\}\_\{<j\},\\,r\_\{t\},\\,H\_\{t\-1\}\\bigr\),\(40\)logP~ϕ\(at∣Ht−1⊕otexec\)\\displaystyle\\widetilde\{\\log P\}\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\)≔1Kt∑j=1KtlogPϕ\(tokj\(t\)\|tok<j\(t\),Ht−1⊕otexec\)\.\\displaystyle\\coloneqq\\frac\{1\}\{K\_\{t\}\}\\sum\_\{j=1\}^\{K\_\{t\}\}\\log P\_\{\\phi\}\\bigl\(\\mathrm\{tok\}^\{\(t\)\}\_\{j\}\\,\\big\|\\,\\mathrm\{tok\}^\{\(t\)\}\_\{<j\},\\,H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\\bigr\)\.\(41\)Each is the geometric mean \(in log\-space\) of token probabilities along theKtK\_\{t\}tokens of one action edge; the edge thus carries a single, length\-robust log\-probability\.
###### Definition 9\(TTB Residual\)\.
Given the per\-token\-tempered edge log\-probabilities of Definition[8](https://arxiv.org/html/2605.14089#Thmdefinition8), the trajectory balance residual is:
Δ\(τ\)≔logZθ\(q\)\+∑t=1Tlogπ~θ\(at∣rt,Ht−1\)−βlogR~\(τ\)−∑t=1TlogP~ϕ\(at∣Ht−1⊕otexec\),\\Delta\(\\tau\)\\;\\coloneqq\\;\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t=1\}^\{T\}\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\-\\beta\\log\\tilde\{R\}\(\\tau\)\-\\sum\_\{t=1\}^\{T\}\\widetilde\{\\log P\}\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\),\(42\)where:
- •Zθ\(q\)Z\_\{\\theta\}\(q\)is a task\-conditioned partition function \(learned parameter\),
- •β\>0\\beta\>0is the temperature parameter controlling diversity,
- •R~\(τ\)=R\(τ\)\+εmin\\tilde\{R\}\(\\tau\)=R\(\\tau\)\+\\varepsilon\_\{\\min\}is the smoothed reward,
- •rtr\_\{t\}is the reasoning emitted at steptt, treated as fixed conditioning context \(Lemma[13](https://arxiv.org/html/2605.14089#Thmtheorem13)\),
- •T=\|τ\|T=\|\\tau\|is the trajectory length, i\.e\., the number of action edges\.
At the optimumΔ\(τ\)=0\\Delta\(\\tau\)=0, the induced*tempered*forward distribution satisfiesπ~θ\(τ\)∝R~\(τ\)β\\tilde\{\\pi\}\_\{\\theta\}\(\\tau\)\\propto\\tilde\{R\}\(\\tau\)^\{\\beta\}\(Lemma[15](https://arxiv.org/html/2605.14089#Thmtheorem15)\)\.
The TB condition in log\-space \(Equation \([33](https://arxiv.org/html/2605.14089#A2.E33)\)\) requiresΔ\(τ\)=0\\Delta\(\\tau\)=0\. To enforce this across all trajectories, we use a regression loss:
###### Definition 10\(Tempered Trajectory Balance Loss\)\.
The TTB loss is:
ℒTTB\(τ\)=\(Δ\(τ\)T\)2,\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)=\\left\(\\frac\{\\Delta\(\\tau\)\}\{T\}\\right\)^\{\\\!2\},\(43\)withΔ\(τ\)\\Delta\(\\tau\)as in Definition[9](https://arxiv.org/html/2605.14089#Thmdefinition9)and division byTTcomparing trajectories of different lengths on equal footing\.
###### Lemma 10\(Length Normalization Property\)\.
Each tempered edge log\-probability isO\(1\)O\(1\)by per\-token averaging \(Definition[8](https://arxiv.org/html/2605.14089#Thmdefinition8)\), so\|∑t=1Tlogπ~θ\(at∣rt,Ht−1\)\|=O\(T\)\|\\sum\_\{t=1\}^\{T\}\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\|=O\(T\)and unmormalized\|Δ\(τ\)\|∝T\|\\Delta\(\\tau\)\|\\propto Tfor typical trajectories\. Dividing byTTbefore squaring penalizes the*average*per\-step balance rather than the cumulative balance, keeping\|Δ\(τ\)/T\|=O\(1\)\|\\Delta\(\\tau\)/T\|=O\(1\)and stabilizing gradients across length variations\.
###### Proposition 11\(TTB Self\-Annealing Property\)\.
TreatingT=\|τ\|T=\|\\tau\|as fixed for a givenτ\\tau, the gradient of the TTB loss satisfies:
∇θℒTTB\(τ\)=2⋅Δ\(τ\)T⋅∇θ\(Δ\(τ\)T\)=2Δ\(τ\)T2⏟length\-scaled annealing factor⋅∇θΔ\(τ\)\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)=2\\cdot\\frac\{\\Delta\(\\tau\)\}\{T\}\\cdot\\nabla\_\{\\theta\}\\\!\\left\(\\frac\{\\Delta\(\\tau\)\}\{T\}\\right\)=\\underbrace\{\\frac\{2\\,\\Delta\(\\tau\)\}\{T^\{2\}\}\}\_\{\\text\{length\-scaled annealing factor\}\}\\cdot\\,\\nabla\_\{\\theta\}\\Delta\(\\tau\)\.\(44\)AsΔ\(τ\)→0\\Delta\(\\tau\)\\to 0, the prefine shrinks and‖∇θℒTTB‖→0\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\\\|\\to 0automatically\. BecauseTTvaries across trajectories, the1/T21/T^\{2\}factor is a per\-trajectory length normalizer and cannot be absorbed into a global constant\.
The temperatureβ\\betais a critical hyperparameter controlling the diversity\-quality tradeoff:
###### Lemma 12\(Temperature Effect on Policy\)\.
At convergence,π∗\(τ\)∝R~\(τ\)β\\pi^\{\*\}\(\\tau\)\\propto\\tilde\{R\}\(\\tau\)^\{\\beta\}\.
- •Asβ→0\+\\beta\\to 0^\{\+\}: the policy approaches uniform distribution over trajectories \(maximum diversity, ignoring rewards\)\.
- •Asβ→∞\\beta\\to\\infty: the policy concentrates on the single highest\-reward trajectory \(maximum quality, no diversity\)\.
- •Intermediateβ\\betatrades off diversity and quality\.
### B\.4Conditional TB, Hindsight Backward, and Per\-Token Tempering
The TTB residual in Definition[9](https://arxiv.org/html/2605.14089#Thmdefinition9)departs from textbook TB in three ways: \(i\) the forward policy conditions on reasoningrtr\_\{t\}; \(ii\) the backward policy conditions on the hindsight stateHt−1⊕otexecH\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}; \(iii\) each edge log\-probability is per\-token averaged\. We establish three lemmas showing that none of these alters the action\-DAG structure or the reward\-matching guarantee\.
#### \(i\) Reasoning as fixed context\.
###### Lemma 13\(TB on the Action Sub\-DAG with Reasoning as Fixed Context\)\.
The hierarchical step policy \(Eq\.[5](https://arxiv.org/html/2605.14089#S4.E5), main text\) factorizes as
πθ\(rt,at∣Ht−1\)=πθ\(rt∣Ht−1\)⋅πθ\(at∣rt,Ht−1\)\.\\pi\_\{\\theta\}\(r\_\{t\},a\_\{t\}\\mid H\_\{t\-1\}\)\\;=\\;\\pi\_\{\\theta\}\(r\_\{t\}\\mid H\_\{t\-1\}\)\\cdot\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\.\(45\)Conditioning on the realized reasoning sequence\{rt\}t=1T\\\{r\_\{t\}\\\}\_\{t=1\}^\{T\}and execution observations\{otexec\}t=1T\\\{o\_\{t\}^\{\\text\{exec\}\}\\\}\_\{t=1\}^\{T\}, only the action choices\{at\}\\\{a\_\{t\}\\\}remain random\. The conditional trajectory probability is
Pθ\(τ\|\{rt\},\{otexec\}\)=∏t=1Tπθ\(at∣rt,Ht−1\),P\_\{\\theta\}\\\!\\bigl\(\\tau\\,\\big\|\\,\\\{r\_\{t\}\\\},\\\{o\_\{t\}^\{\\text\{exec\}\}\\\}\\bigr\)\\;=\\;\\prod\_\{t=1\}^\{T\}\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\),\(46\)which is precisely the forward policy on the*action sub\-DAG*𝒢act⊂𝒢\\mathcal\{G\}^\{\\mathrm\{act\}\}\\subset\\mathcal\{G\}\. The TB condition \(Theorem[9](https://arxiv.org/html/2605.14089#Thmtheorem9)\) is a per\-trajectory equality and therefore applies pointwise to each realized\(\{rt\},\{otexec\}\)\(\\\{r\_\{t\}\\\},\\\{o\_\{t\}^\{\\text\{exec\}\}\\\}\)context, yielding the log\-space form
logZθ\(q\)\+∑t=1Tlogπθ\(at∣rt,Ht−1\)=logF\(τ\)\+∑t=1TlogPB\(st−1∣st\),\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t=1\}^\{T\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\\;=\\;\\log F\(\\tau\)\+\\sum\_\{t=1\}^\{T\}\\log P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\),\(47\)which matches Eq\.[8](https://arxiv.org/html/2605.14089#S4.E8)withrtr\_\{t\}in the conditional\. Reward\-matching of the resulting flow on𝒢act\\mathcal\{G\}^\{\\mathrm\{act\}\}is therefore equivalent to TB holding on every trajectory at every realized reasoning context\.
###### Proof\.
Direct from the hierarchical factorization and Theorem[9](https://arxiv.org/html/2605.14089#Thmtheorem9)\. Reasoningrtr\_\{t\}entersπθ\(at∣rt,Ht−1\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)as a determining context; once realized, it is treated as part of the conditioning state for the action edge, exactly as the source\-statest−1=Ht−1s\_\{t\-1\}=H\_\{t\-1\}is\. The sub\-DAG𝒢act\\mathcal\{G\}^\{\\mathrm\{act\}\}inherits its DAG structure from𝒢\\mathcal\{G\}\(Theorem[26](https://arxiv.org/html/2605.14089#Thmtheorem26)\)\. ∎
#### \(ii\) Hindsight\-asymmetric backward policy\.
###### Lemma 14\(Hindsight\-Conditioned Backward Equals Standard Backward on the Augmented State\)\.
Define the hindsight\-enriched pre\-action states~t≔Ht−1⊕otexec\\tilde\{s\}\_\{t\}\\coloneqq H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\. By Eq\.[4](https://arxiv.org/html/2605.14089#S4.E4), the post\-action statest=Hts\_\{t\}=H\_\{t\}deterministically encodes the joint\(Ht−1,at,otexec\)\(H\_\{t\-1\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\)givenst−1=Ht−1s\_\{t\-1\}=H\_\{t\-1\}\. Therefore conditioning onsts\_\{t\}is equivalent to conditioning on\(s~t,at\)\(\\tilde\{s\}\_\{t\},a\_\{t\}\), and
PB\(st−1∣st\)=Pϕ\(at∣s~t\)=Pϕ\(at∣Ht−1⊕otexec\),P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\)\\;=\\;P\_\{\\phi\}\(a\_\{t\}\\mid\\tilde\{s\}\_\{t\}\)\\;=\\;P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\),\(48\)in the sense that both express the same conditional reverse mass\. Substituting this equality into Eq\. \([47](https://arxiv.org/html/2605.14089#A2.E47)\) yields the SkillFlow TB condition
logZθ\(q\)\+∑t=1Tlogπθ\(at∣rt,Ht−1\)=βlogR~\(τ\)\+∑t=1TlogPϕ\(at∣Ht−1⊕otexec\),\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t=1\}^\{T\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\\;=\\;\\beta\\log\\tilde\{R\}\(\\tau\)\+\\sum\_\{t=1\}^\{T\}\\log P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\),\(49\)wherelogF\(τ\)=βlogR~\(τ\)\\log F\(\\tau\)=\\beta\\log\\tilde\{R\}\(\\tau\)at the reward\-matching terminal\. The information asymmetrys~t⊃st−1\\tilde\{s\}\_\{t\}\\supset s\_\{t\-1\}\(post\-execution observationotexeco\_\{t\}^\{\\text\{exec\}\}added\) is precisely what gives the step\-importance ratioI\(t\)=πθ\(at∣rt,Ht−1\)/Pϕ\(at∣s~t\)I\(t\)=\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)/P\_\{\\phi\}\(a\_\{t\}\\mid\\tilde\{s\}\_\{t\}\)a non\-trivial credit signal: highI\(t\)I\(t\)marks decisions whose quality became clear only after execution\.
###### Proof\.
By the strict\-history\-growth lemma \(Lemma[8](https://arxiv.org/html/2605.14089#Thmtheorem8)\),HtH\_\{t\}uniquely determines its parentHt−1H\_\{t\-1\}and the appended triple\(rt,at,otexec\)\(r\_\{t\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\)\. Hencest↦\(st−1,rt,at,otexec\)s\_\{t\}\\mapsto\(s\_\{t\-1\},r\_\{t\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\)is a bijection\. Marginalizing overrtr\_\{t\}\(which is fixed in the conditional TB of Lemma[13](https://arxiv.org/html/2605.14089#Thmtheorem13)\) leaves the joint\(st−1,at,otexec\)\(s\_\{t\-1\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\), equivalent to\(s~t,at\)\(\\tilde\{s\}\_\{t\},a\_\{t\}\)\. The reverse conditional therefore equalsPϕ\(at∣s~t\)P\_\{\\phi\}\(a\_\{t\}\\mid\\tilde\{s\}\_\{t\}\)by definition ofPϕP\_\{\\phi\}as a learned discriminative reverse model on the hindsight state\. ∎
#### \(iii\) Per\-token tempering preserves convergence semantics\.
###### Lemma 15\(Per\-Token Tempering as Edge\-Level Geometric\-Mean Rescaling\)\.
Letlogπ~θ\(at∣rt,Ht−1\)\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)andlogP~ϕ\(at∣Ht−1⊕otexec\)\\widetilde\{\\log P\}\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\)be as in Definition[8](https://arxiv.org/html/2605.14089#Thmdefinition8), and define the tempered edge probabilities
π~θ\(at∣rt,Ht−1\)≔exp\(logπ~θ\(at∣rt,Ht−1\)\),P~ϕ\(at∣s~t\)≔exp\(logP~ϕ\(at∣s~t\)\)\.\\tilde\{\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\\coloneqq\\exp\\\!\\bigl\(\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\\bigr\),\\quad\\tilde\{P\}\_\{\\phi\}\(a\_\{t\}\\mid\\tilde\{s\}\_\{t\}\)\\coloneqq\\exp\\\!\\bigl\(\\widetilde\{\\log P\}\_\{\\phi\}\(a\_\{t\}\\mid\\tilde\{s\}\_\{t\}\)\\bigr\)\.\(50\)Then:
1. \(i\)π~θ\\tilde\{\\pi\}\_\{\\theta\}is the geometric mean of token probabilities:π~θ\(at\)=∏jπθ\(tokj\)1/Kt\\tilde\{\\pi\}\_\{\\theta\}\(a\_\{t\}\)=\\prod\_\{j\}\\pi\_\{\\theta\}\(\\mathrm\{tok\}\_\{j\}\)^\{1/K\_\{t\}\}, a length\-normalized policy on the action edge\.
2. \(ii\)Substitutingπ~θ,P~ϕ\\tilde\{\\pi\}\_\{\\theta\},\\tilde\{P\}\_\{\\phi\}into the TB condition \(Eq\. \([49](https://arxiv.org/html/2605.14089#A2.E49)\)\) and settingΔ\(τ\)=0\\Delta\(\\tau\)=0defines a tempered flowF~\\tilde\{F\}withF~\(x\)=R~\(τx\)β\\tilde\{F\}\(x\)=\\tilde\{R\}\(\\tau\_\{x\}\)^\{\\beta\}\. By Theorem[6](https://arxiv.org/html/2605.14089#Thmtheorem6)applied toF~\\tilde\{F\}, the induced sampling distribution satisfiesπ~θ\(τ∣q\)∝R~\(τ\)β\\tilde\{\\pi\}\_\{\\theta\}\(\\tau\\mid q\)\\propto\\tilde\{R\}\(\\tau\)^\{\\beta\}\.
3. \(iii\)Geometric\-mean tempering is a strictly monotone positive rescaling of edge probabilities; trajectory probabilities are products of edge probabilities, so the trajectory\-level reward\-proportional ranking is preserved:π~θ\(τ1\)\>π~θ\(τ2\)⇔R~\(τ1\)β\>R~\(τ2\)β⇔R\(τ1\)\>R\(τ2\)\\tilde\{\\pi\}\_\{\\theta\}\(\\tau\_\{1\}\)\>\\tilde\{\\pi\}\_\{\\theta\}\(\\tau\_\{2\}\)\\Leftrightarrow\\tilde\{R\}\(\\tau\_\{1\}\)^\{\\beta\}\>\\tilde\{R\}\(\\tau\_\{2\}\)^\{\\beta\}\\Leftrightarrow R\(\\tau\_\{1\}\)\>R\(\\tau\_\{2\}\)\(Proposition[28](https://arxiv.org/html/2605.14089#Thmtheorem28)\)\.
###### Proof\.
\(i\) Direct fromexp\(1Kt∑jlogπθ\(tokj\)\)=∏jπθ\(tokj\)1/Kt\\exp\\\!\\bigl\(\\tfrac\{1\}\{K\_\{t\}\}\\sum\_\{j\}\\log\\pi\_\{\\theta\}\(\\mathrm\{tok\}\_\{j\}\)\\bigr\)=\\prod\_\{j\}\\pi\_\{\\theta\}\(\\mathrm\{tok\}\_\{j\}\)^\{1/K\_\{t\}\}\. \(ii\) The TTB residual is the same algebraic identity as TB withPF,PBP\_\{F\},P\_\{B\}replaced byπ~θ,P~ϕ\\tilde\{\\pi\}\_\{\\theta\},\\tilde\{P\}\_\{\\phi\}; the proof of Theorem[9](https://arxiv.org/html/2605.14089#Thmtheorem9)carries through verbatim with these tempered policies, defining a reward\-matching tempered flow\. \(iii\) Monotone rescaling preservesargmax\\arg\\maxand ordering\. ∎
Together, Lemmas[13](https://arxiv.org/html/2605.14089#Thmtheorem13),[14](https://arxiv.org/html/2605.14089#Thmtheorem14), and[15](https://arxiv.org/html/2605.14089#Thmtheorem15)provide the formal justification for the SkillFlow TTB residual \(Definition[9](https://arxiv.org/html/2605.14089#Thmdefinition9), equivalently main\-text Eq\.[9](https://arxiv.org/html/2605.14089#S4.E9)\): it is the standard log\-space TB residual on the action\-DAG, evaluated withrtr\_\{t\}\-conditioned forward policy, hindsight\-conditioned backward policy, and per\-token\-tempered edge log\-probabilities\.
## Appendix CEntropy\-Regularized RL Equivalence
This section proves the fundamental equivalence between GFlowNet training and entropy\-regularized reinforcement learning\.
###### Theorem 16\(GFlowNet\-RL Equivalence\)\.
The optimal policy induced by GFlowNet training with temperatureβ\\betais identical to the optimal policy of entropy\-regularized maximum expected reward, with temperature parameterT=1/βT=1/\\beta:
πGFN∗\(τ∣q\)=R~\(τ\)β∑τ′R~\(τ′\)β=exp\[βlogR~\(τ\)\]Zβ\(q\)\.\\pi^\{\*\}\_\{\\mathrm\{GFN\}\}\(\\tau\\mid q\)=\\frac\{\\tilde\{R\}\(\\tau\)^\{\\beta\}\}\{\\sum\_\{\\tau^\{\\prime\}\}\\tilde\{R\}\(\\tau^\{\\prime\}\)^\{\\beta\}\}=\\frac\{\\exp\\bigl\[\\beta\\log\\tilde\{R\}\(\\tau\)\\bigr\]\}\{Z\_\{\\beta\}\(q\)\}\.\(51\)This matches the optimal policy from the entropy\-regularized RL objective:
πRL∗\(τ∣q\)=argmaxπ𝔼τ∼π\[logR\(τ\)\]\+1βℋ\[π\],\\pi^\{\*\}\_\{\\mathrm\{RL\}\}\(\\tau\\mid q\)=\\arg\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\tau\\sim\\pi\}\\left\[\\log R\(\\tau\)\\right\]\+\\frac\{1\}\{\\beta\}\\mathcal\{H\}\[\\pi\],\(52\)whereℋ\[π\]=−𝔼τ\[logπ\(τ\)\]\\mathcal\{H\}\[\\pi\]=\-\\mathbb\{E\}\_\{\\tau\}\[\\log\\pi\(\\tau\)\]is the policy entropy\.
###### Proof\.
The entropy\-regularized RL objective \(with temperatureT=1/βT=1/\\beta\) is:
J\(π\)=𝔼τ∼π\[logR\(τ\)\]\+T⋅ℋ\[π\]=𝔼τ∼π\[logR\(τ\)−logπ\(τ\)\]\+Tlog\(\|𝒯\|\),J\(\\pi\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\}\\left\[\\log R\(\\tau\)\\right\]\+T\\cdot\\mathcal\{H\}\[\\pi\]=\\mathbb\{E\}\_\{\\tau\\sim\\pi\}\\left\[\\log R\(\\tau\)\-\\log\\pi\(\\tau\)\\right\]\+T\\log\(\|\\mathcal\{T\}\|\),\(53\)where\|𝒯\|\|\\mathcal\{T\}\|is the number of trajectories \(constant inπ\\pi\)\.
Taking the functional derivative with respect toπ\(τ\)\\pi\(\\tau\)and setting it to zero:
δJδπ\(τ\)=logR\(τ\)−logπ∗\(τ\)−1\+Tlog\(\|𝒯\|\)=0\.\\frac\{\\delta J\}\{\\delta\\pi\(\\tau\)\}=\\log R\(\\tau\)\-\\log\\pi^\{\*\}\(\\tau\)\-1\+T\\log\(\|\\mathcal\{T\}\|\)=0\.\(54\)Solving forπ∗\(τ\)\\pi^\{\*\}\(\\tau\):
logπ∗\(τ\)=logR\(τ\)\+Tlog\(\|𝒯\|\)−1=1TlogR\(τ\)\+const\.\\log\\pi^\{\*\}\(\\tau\)=\\log R\(\\tau\)\+T\\log\(\|\\mathcal\{T\}\|\)\-1=\\frac\{1\}\{T\}\\log R\(\\tau\)\+\\text\{const\}\.\(55\)Exponentiating:
π∗\(τ\)∝exp\(1TlogR\(τ\)\)=R\(τ\)1/T=R\(τ\)β\.\\pi^\{\*\}\(\\tau\)\\propto\\exp\\left\(\\frac\{1\}\{T\}\\log R\(\\tau\)\\right\)=R\(\\tau\)^\{1/T\}=R\(\\tau\)^\{\\beta\}\.\(56\)Normalizing:
π∗\(τ\)=R\(τ\)β∑τ′R\(τ′\)β\.\\pi^\{\*\}\(\\tau\)=\\frac\{R\(\\tau\)^\{\\beta\}\}\{\\sum\_\{\\tau^\{\\prime\}\}R\(\\tau^\{\\prime\}\)^\{\\beta\}\}\.\(57\)With smoothingR~\(τ\)=R\(τ\)\+εmin\\tilde\{R\}\(\\tau\)=R\(\\tau\)\+\\varepsilon\_\{\\min\}, this becomes:
π∗\(τ\)=R~\(τ\)β∑τ′R~\(τ′\)β=exp\[βlogR~\(τ\)\]Zβ\(q\),\\pi^\{\*\}\(\\tau\)=\\frac\{\\tilde\{R\}\(\\tau\)^\{\\beta\}\}\{\\sum\_\{\\tau^\{\\prime\}\}\\tilde\{R\}\(\\tau^\{\\prime\}\)^\{\\beta\}\}=\\frac\{\\exp\[\\beta\\log\\tilde\{R\}\(\\tau\)\]\}\{Z\_\{\\beta\}\(q\)\},\(58\)which exactly matches the GFlowNet sampling distribution from Theorem[6](https://arxiv.org/html/2605.14089#Thmtheorem6)\. ∎
## Appendix DDetailed Balance and Flow Metrics
This section establishes the Detailed Balance condition and derives the flow\-based credit assignment metrics\.
### D\.1Detailed Balance Theorem
###### Theorem 17\(Detailed Balance\)\.
For a reward\-matching flowFFsatisfying the Trajectory Balance condition \(Theorem[9](https://arxiv.org/html/2605.14089#Thmtheorem9)\), the Detailed Balance \(DB\) condition holds at every edge:
F\(s\)⋅PF\(s′∣s\)=F\(s′\)⋅PB\(s∣s′\),F\(s\)\\cdot P\_\{F\}\(s^\{\\prime\}\\mid s\)=F\(s^\{\\prime\}\)\\cdot P\_\{B\}\(s\\mid s^\{\\prime\}\),\(59\)wherePF\(s′∣s\)=F\(s→s′\)/F\(s\)P\_\{F\}\(s^\{\\prime\}\\mid s\)=F\(s\\to s^\{\\prime\}\)/F\(s\)andPB\(s∣s′\)=F\(s→s′\)/F\(s′\)P\_\{B\}\(s\\mid s^\{\\prime\}\)=F\(s\\to s^\{\\prime\}\)/F\(s^\{\\prime\}\)\.
###### Proof\.
By definition ofPFP\_\{F\}andPBP\_\{B\}from Definition[4](https://arxiv.org/html/2605.14089#Thmdefinition4):
LHS=F\(s\)⋅F\(s→s′\)F\(s\)=F\(s→s′\),\\displaystyle=F\(s\)\\cdot\\frac\{F\(s\\to s^\{\\prime\}\)\}\{F\(s\)\}=F\(s\\to s^\{\\prime\}\),\(60\)RHS=F\(s′\)⋅F\(s→s′\)F\(s′\)=F\(s→s′\)\.\\displaystyle=F\(s^\{\\prime\}\)\\cdot\\frac\{F\(s\\to s^\{\\prime\}\)\}\{F\(s^\{\\prime\}\)\}=F\(s\\to s^\{\\prime\}\)\.\(61\)Both sides equal the edge flowF\(s→s′\)F\(s\\to s^\{\\prime\}\)\. ∎
Detailed Balance is a pointwise \(per\-edge\) condition, stronger than Trajectory Balance \(which is trajectory\-wise\)\. For a reward\-matching flow, DB emerges as a consequence\. In SkillFlow’s tree\-structured DAG \(eachHtH\_\{t\}has a unique parent by strict history growth\), every trajectory through a state is unique, so TB at convergence uniquely determines the edge flows and DB follows\. The next lemma makes this uniqueness explicit\.
###### Lemma 18\(Tree\-DAG Specialization: Uniqueness of the Reward\-Matching Flow and Per\-Edge DB\)\.
On a tree\-structured DAG \(every non\-source node has a unique parent\), as is the case for SkillFlow’s orchestration graph by Lemma[8](https://arxiv.org/html/2605.14089#Thmtheorem8), a reward\-matching flowFFwith terminal values\{F\(x\)=R\(x\)\}x∈𝒳\\\{F\(x\)=R\(x\)\\\}\_\{x\\in\\mathcal\{X\}\}is uniquely determined: for every non\-terminal statess,
F\(s\)=∑x∈Desc\(s\)∩𝒳R\(x\),F\(s\)\\;=\\;\\sum\_\{x\\in\\mathrm\{Desc\}\(s\)\\,\\cap\\,\\mathcal\{X\}\}R\(x\),\(62\)whereDesc\(s\)\\mathrm\{Desc\}\(s\)is the set of descendants ofss\. Consequently, TB convergence \(Theorem[9](https://arxiv.org/html/2605.14089#Thmtheorem9)\) on a tree\-DAG simultaneously enforces \(i\) reward\-matching at terminals, \(ii\) flow uniqueness at every state, and \(iii\) per\-edge Detailed Balance \(Theorem[17](https://arxiv.org/html/2605.14089#Thmtheorem17)\) at every edge\.
###### Proof\.
We prove Eq\. \([62](https://arxiv.org/html/2605.14089#A4.E62)\) by reverse induction on the depth ofss\(longest path from the source\)\.
*Base case*\(terminals=x∈𝒳s=x\\in\\mathcal\{X\}\):F\(x\)=R\(x\)F\(x\)=R\(x\)by reward\-matching, andDesc\(x\)∩𝒳=\{x\}\\mathrm\{Desc\}\(x\)\\cap\\mathcal\{X\}=\\\{x\\\}, so Eq\. \([62](https://arxiv.org/html/2605.14089#A4.E62)\) is the trivial identityR\(x\)=R\(x\)R\(x\)=R\(x\)\.
*Inductive step*: Suppose Eq\. \([62](https://arxiv.org/html/2605.14089#A4.E62)\) holds for all states deeper thanss\. By flow conservation \(Definition[2](https://arxiv.org/html/2605.14089#Thmdefinition2)\),
F\(s\)=∑s′∈Ch\(s\)F\(s→s′\)=∑s′∈Ch\(s\)F\(s′\),F\(s\)=\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Ch\}\(s\)\}F\(s\\to s^\{\\prime\}\)=\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Ch\}\(s\)\}F\(s^\{\\prime\}\),\(63\)where the second equality uses tree structure: each childs′s^\{\\prime\}hasssas its unique parent, so all flow intos′s^\{\\prime\}comes from the single edges→s′s\\to s^\{\\prime\}, i\.e\.,F\(s′\)=F\(s→s′\)F\(s^\{\\prime\}\)=F\(s\\to s^\{\\prime\}\)\. By the inductive hypothesis,
F\(s\)=∑s′∈Ch\(s\)∑x∈Desc\(s′\)∩𝒳R\(x\)=∑x∈Desc\(s\)∩𝒳R\(x\),F\(s\)=\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Ch\}\(s\)\}\\;\\;\\sum\_\{x\\in\\mathrm\{Desc\}\(s^\{\\prime\}\)\\,\\cap\\,\\mathcal\{X\}\}R\(x\)=\\sum\_\{x\\in\\mathrm\{Desc\}\(s\)\\,\\cap\\,\\mathcal\{X\}\}R\(x\),\(64\)where the last equality uses the disjoint partitionDesc\(s\)∖\{s\}=⨆s′∈Ch\(s\)\(\{s′\}∪Desc\(s′\)\)\\mathrm\{Desc\}\(s\)\\setminus\\\{s\\\}=\\bigsqcup\_\{s^\{\\prime\}\\in\\mathrm\{Ch\}\(s\)\}\(\\\{s^\{\\prime\}\\\}\\cup\\mathrm\{Desc\}\(s^\{\\prime\}\)\), valid because subtrees of distinct children ofssare disjoint on a tree\.
For \(iii\), Theorem[17](https://arxiv.org/html/2605.14089#Thmtheorem17)\(proved for general DAGs\) yields DB at every edge onceFFis reward\-matching\. ∎
### D\.2Flow Ratio and Step Importance
###### Corollary 19\(Flow Ratio\)\.
From Detailed Balance \(Theorem[17](https://arxiv.org/html/2605.14089#Thmtheorem17)\),F\(st−1\)⋅PF\(st∣st−1\)=F\(st\)⋅PB\(st−1∣st\)F\(s\_\{t\-1\}\)\\cdot P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)=F\(s\_\{t\}\)\\cdot P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\)\. Dividing both sides byF\(st−1\)⋅PB\(st−1∣st\)F\(s\_\{t\-1\}\)\\cdot P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\):
F\(st\)F\(st−1\)=PF\(st∣st−1\)PB\(st−1∣st\)\.\\frac\{F\(s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\}=\\frac\{P\_\{F\}\(s\_\{t\}\\mid s\_\{t\-1\}\)\}\{P\_\{B\}\(s\_\{t\-1\}\\mid s\_\{t\}\)\}\.\(65\)Substituting the SkillFlow policy realizations from Lemmas[13](https://arxiv.org/html/2605.14089#Thmtheorem13)and[14](https://arxiv.org/html/2605.14089#Thmtheorem14):
F\(st\)F\(st−1\)=πθ\(at∣rt,Ht−1\)Pϕ\(at∣Ht−1⊕otexec\)\.\\frac\{F\(s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\}\\;=\\;\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\}\{P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\)\}\.\(66\)
This leads to the key quantity for credit assignment:
###### Definition 11\(Step Importance\)\.
The step importance at timettis:
I\(t\)≔F\(st\)F\(st−1\)=πθ\(at∣rt,Ht−1\)Pϕ\(at∣Ht−1⊕otexec\)\.I\(t\)\\;\\coloneqq\\;\\frac\{F\(s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\}\\;=\\;\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\}\{P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\)\}\.\(67\)This ratio quantifies the “flow amplification” at steptt: how much the state flow increases \(ifI\(t\)\>1I\(t\)\>1\) or decreases \(ifI\(t\)<1I\(t\)<1\) due to actionata\_\{t\}, measured against the hindsight\-asymmetric backward \(Lemma[14](https://arxiv.org/html/2605.14089#Thmtheorem14)\)\.
Intuitively,I\(t\)I\(t\)measures the information value of decisionata\_\{t\}:
- •I\(t\)≫1I\(t\)\\gg 1: The decision was high\-probability for the forward policy but would have been low\-probability in hindsight; this decision had high impact\.
- •I\(t\)≈1I\(t\)\\approx 1: The forward and backward policies agree; the decision had typical impact\.
- •I\(t\)≪1I\(t\)\\ll 1: The decision was low\-probability in hindsight relative to the forward policy; this decision was sub\-optimal in retrospect\.
### D\.3Skill Marginal Flow
###### Definition 12\(Skill Marginal Flow\)\.
For a skills∈𝒮s\\in\\mathcal\{S\}, the marginal flow is the average post\-action*state flow*at nodes wheressis invoked, aggregated over trajectories in the batch:
F^\(s\)≔1\|ℬs\|∑τ∈ℬs∑t:atinvokessF\(st\),\\hat\{F\}\(s\)\\;\\coloneqq\\;\\frac\{1\}\{\|\\mathcal\{B\}\_\{s\}\|\}\\sum\_\{\\tau\\in\\mathcal\{B\}\_\{s\}\}\\sum\_\{t:\\,a\_\{t\}\\text\{ invokes \}s\}F\(s\_\{t\}\),\(68\)whereℬs⊆ℬ\\mathcal\{B\}\_\{s\}\\subseteq\\mathcal\{B\}is the subset of sampled trajectories that invoke skillss, and the post\-action state flow is recovered by telescoping the step importance from the sourceF\(s0\)=Zθ\(q\)F\(s\_\{0\}\)=Z\_\{\\theta\}\(q\):
F\(st\)=Zθ\(q\)⋅∏t′=1tI\(t′\)⟺logF\(st\)=logZθ\(q\)\+∑t′=1tlogI\(t′\)\.F\(s\_\{t\}\)\\;=\\;Z\_\{\\theta\}\(q\)\\cdot\\prod\_\{t^\{\\prime\}=1\}^\{t\}I\(t^\{\\prime\}\)\\quad\\Longleftrightarrow\\quad\\log F\(s\_\{t\}\)\\;=\\;\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\log I\(t^\{\\prime\}\)\.\(69\)This matches main\-text Eq\.[11](https://arxiv.org/html/2605.14089#S4.E11)exactly\.
Skill marginal flow is the key signal for detecting which skills contribute most to reward\-proportional sampling\. HighF^\(s\)\\hat\{F\}\(s\)indicates thatssis invoked at high\-flow states—which by Theorem[6](https://arxiv.org/html/2605.14089#Thmtheorem6)are visited with probability proportional to downstream reward\. LowF^\(s\)\\hat\{F\}\(s\)indicates thatsseither \(i\) is rarely invoked, \(ii\) is invoked along low\-flow / low\-reward trajectories, or \(iii\) is invoked at decision points where the cumulative information advantage of the policy is small \(small∑t′≤tlogI\(t′\)\\sum\_\{t^\{\\prime\}\\leq t\}\\log I\(t^\{\\prime\}\)\)\.
### D\.4Zero\-Cost Computation
###### Proposition 20\(Zero\-Cost Flow Metrics\)\.
All flow metrics—state flowsF\(st\)F\(s\_\{t\}\), step importancesI\(t\)I\(t\), and skill marginal flowsF^\(s\)\\hat\{F\}\(s\)—are computable from the \(per\-token\-tempered\) forward and backward log\-probabilities already evaluated during the TTB loss computation \(Eq\. \([43](https://arxiv.org/html/2605.14089#A2.E43)\)\)\.
Concretely:
logI\(t\)\\displaystyle\\log I\(t\)=logπ~θ\(at∣rt,Ht−1\)−logP~ϕ\(at∣Ht−1⊕otexec\),\\displaystyle=\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\-\\widetilde\{\\log P\}\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\),\(70\)logF\(st\)\\displaystyle\\log F\(s\_\{t\}\)=logZθ\(q\)\+∑t′=1tlogI\(t′\),\\displaystyle=\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\log I\(t^\{\\prime\}\),\(71\)logF^\(s\)\\displaystyle\\log\\hat\{F\}\(s\)=logZθ\(q\)\+log\(1\|ℬs\|∑τ∈ℬs∑t:atinvokessexp\(∑t′=1tlogI\(t′\)\)\)\.\\displaystyle=\\log Z\_\{\\theta\}\(q\)\+\\log\\\!\\left\(\\frac\{1\}\{\|\\mathcal\{B\}\_\{s\}\|\}\\sum\_\{\\tau\\in\\mathcal\{B\}\_\{s\}\}\\sum\_\{t:\\,a\_\{t\}\\text\{ invokes \}s\}\\exp\\\!\\Bigl\(\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\log I\(t^\{\\prime\}\)\\Bigr\)\\right\)\.\(72\)Equation \([72](https://arxiv.org/html/2605.14089#A4.E72)\) is the log of an arithmetic mean \(matching the linear\-scale definition in Eq\. \([68](https://arxiv.org/html/2605.14089#A4.E68)\)\),*not*the mean\-of\-log: by the telescoping identity each visit contributesF\(st\)=Zθ\(q\)⋅exp\(∑t′≤tlogI\(t′\)\)F\(s\_\{t\}\)=Z\_\{\\theta\}\(q\)\\cdot\\exp\(\\sum\_\{t^\{\\prime\}\\leq t\}\\log I\(t^\{\\prime\}\)\), and we average those visit\-flows in linear space before re\-taking the log\. This expression is precisely the per\-skill CGF atλ=1\\lambda=1shifted bylogZθ\(q\)\\log Z\_\{\\theta\}\(q\)\(cf\. main\-text Eq\. 13 and the equalityΛ1\(s\)=logF^\(s\)−logZθ\(q\)\\Lambda^\{\(s\)\}\_\{1\}=\\log\\hat\{F\}\(s\)\-\\log Z\_\{\\theta\}\(q\)\)\. No additional forward or backward passes through the policy models are required\.
This zero\-cost property is crucial for SkillFlow: the flow signals that drive skill evolution incur no computational overhead beyond the standard TTB loss evaluation\.
### D\.5Cumulant Generating Function \(CGF\) Properties
Section[4\.3](https://arxiv.org/html/2605.14089#S4.SS3)of the main text defines, for each skills∈𝒮s\\in\\mathcal\{S\}, the per\-skill cumulant generating function \(CGF\) of the telescoped log step\-importance:
Λλ\(s\)≔log\(1\|ℬs\|∑τ∈ℬs∑t:atinvokessexp\(λ∑t′=1tlogI\(t′\)\)\),λ∈ℝ\.\\Lambda^\{\(s\)\}\_\{\\lambda\}\\;\\coloneqq\\;\\log\\\!\\left\(\\frac\{1\}\{\|\\mathcal\{B\}\_\{s\}\|\}\\sum\_\{\\tau\\in\\mathcal\{B\}\_\{s\}\}\\sum\_\{t:\\,a\_\{t\}\\text\{ invokes \}s\}\\exp\\\!\\Bigl\(\\lambda\\\!\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\log I\(t^\{\\prime\}\)\\Bigr\)\\right\),\\qquad\\lambda\\in\\mathbb\{R\}\.\(73\)LetVs≔\{\(τ,t\):τ∈ℬs,atinvokess\}V\_\{s\}\\coloneqq\\\{\(\\tau,t\)\\,:\\,\\tau\\in\\mathcal\{B\}\_\{s\},\\,a\_\{t\}\\text\{ invokes \}s\\\}denote the set ofss\-*visits*in the batch, and for each visitv=\(τ,t\)∈Vsv=\(\\tau,t\)\\in V\_\{s\}define the telescoped log\-flow share
Xv≔∑t′=1tlogI\(t′\)=Eq\.\([69](https://arxiv.org/html/2605.14089#A4.E69)\)logF\(st\)−logZθ\(q\)\.X\_\{v\}\\;\\coloneqq\\;\\sum\_\{t^\{\\prime\}=1\}^\{t\}\\log I\(t^\{\\prime\}\)\\;\\stackrel\{\{\\scriptstyle\\text\{Eq\.~\}\\eqref\{eq:state\_flow\_telescope\}\}\}\{\{=\}\}\\;\\log F\(s\_\{t\}\)\\,\-\\,\\log Z\_\{\\theta\}\(q\)\.\(74\)This appendix proves the four properties ofΛλ\(s\)\\Lambda^\{\(s\)\}\_\{\\lambda\}used by the curation operatorΦ\\Phi\(Eq\.[13](https://arxiv.org/html/2605.14089#S4.E13), main text\): convexity,G\(s\)G\(s\)as the visit\-mean ofXX,Λ1\(s\)\\Lambda^\{\(s\)\}\_\{1\}as the centered log skill marginal flow, and the Jensen\-gap cumulant expansion\.
#### Atomicity assumption\.
By the*atomic\-tip*property in main\-text §4\.3, each tipssis self\-contained and independently composable; we assume each atomic tip is invoked*at most once*per trajectory inℬs\\mathcal\{B\}\_\{s\}\. Under this assumption,\|Vs\|=\|ℬs\|\|V\_\{s\}\|=\|\\mathcal\{B\}\_\{s\}\|and Eq\. \([73](https://arxiv.org/html/2605.14089#A4.E73)\) reduces to the standard sample CGF over the empirical visit\-distribution:
Λλ\(s\)=log\(1\|Vs\|∑v∈VseλXv\)=log𝔼Vs\[eλX\]\.\\Lambda^\{\(s\)\}\_\{\\lambda\}\\;=\\;\\log\\\!\\left\(\\frac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\\in V\_\{s\}\}e^\{\\lambda X\_\{v\}\}\\right\)\\;=\\;\\log\\,\\mathbb\{E\}\_\{V\_\{s\}\}\\\!\\bigl\[e^\{\\lambda X\}\\bigr\]\.\(75\)We treatΛλ\(s\)\\Lambda^\{\(s\)\}\_\{\\lambda\}as the standard CGF over\{Xv\}v∈Vs\\\{X\_\{v\}\\\}\_\{v\\in V\_\{s\}\}throughout\.
###### Lemma 21\(Convexity of the CGF inλ\\lambda\)\.
Λλ\(s\)\\Lambda^\{\(s\)\}\_\{\\lambda\}is convex inλ∈ℝ\\lambda\\in\\mathbb\{R\}\.
###### Proof\.
By Eq\. \([75](https://arxiv.org/html/2605.14089#A4.E75)\),Λλ\(s\)=log∑veλXv−log\|Vs\|\\Lambda^\{\(s\)\}\_\{\\lambda\}=\\log\\sum\_\{v\}e^\{\\lambda X\_\{v\}\}\-\\log\|V\_\{s\}\|\. EachλXv\\lambda X\_\{v\}is affine inλ\\lambda,log∑e\(⋅\)\\log\\sum e^\{\(\\cdot\)\}\(log\-sum\-exp\) is a standard convex function of its arguments, and convexity is preserved under affine pre\-composition and constant shift\. HenceΛλ\(s\)\\Lambda^\{\(s\)\}\_\{\\lambda\}is convex inλ\\lambda\. ∎
###### Lemma 22\(Mean Log\-Flow Identity forG\(s\)G\(s\)\)\.
G\(s\)≔∂Λλ\(s\)∂λ\|λ=0=1\|Vs\|∑v∈VsXv=𝔼Vs\[logF\(st\)\]−logZθ\(q\)\.G\(s\)\\;\\coloneqq\\;\\frac\{\\partial\\Lambda^\{\(s\)\}\_\{\\lambda\}\}\{\\partial\\lambda\}\\bigg\|\_\{\\lambda=0\}\\;=\\;\\frac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\\in V\_\{s\}\}X\_\{v\}\\;=\\;\\mathbb\{E\}\_\{V\_\{s\}\}\\\!\\bigl\[\\log F\(s\_\{t\}\)\\bigr\]\\,\-\\,\\log Z\_\{\\theta\}\(q\)\.\(76\)That is,G\(s\)G\(s\)is the visit\-average of the centered log state\-flowlogF\(st\)−logZθ\(q\)\\log F\(s\_\{t\}\)\-\\log Z\_\{\\theta\}\(q\)over occurrences ofss\.
###### Proof\.
Differentiate Eq\. \([75](https://arxiv.org/html/2605.14089#A4.E75)\) with respect toλ\\lambda:
∂∂λlog\(1\|Vs\|∑veλXv\)=∑vXveλXv∑veλXv\.\\frac\{\\partial\}\{\\partial\\lambda\}\\log\\\!\\Bigl\(\\tfrac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}e^\{\\lambda X\_\{v\}\}\\Bigr\)\\;=\\;\\frac\{\\sum\_\{v\}X\_\{v\}\\,e^\{\\lambda X\_\{v\}\}\}\{\\sum\_\{v\}e^\{\\lambda X\_\{v\}\}\}\.\(77\)Atλ=0\\lambda=0this evaluates to∑vXv∑v1=1\|Vs\|∑vXv\\frac\{\\sum\_\{v\}X\_\{v\}\}\{\\sum\_\{v\}1\}=\\frac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}X\_\{v\}, the empirical visit\-mean\. Substituting Eq\. \([74](https://arxiv.org/html/2605.14089#A4.E74)\) forXvX\_\{v\}gives the second equality\. ∎
###### Lemma 23\(CGF atλ=1\\lambda=1Recovers the Skill Marginal Flow\)\.
Λ1\(s\)=logF^\(s\)−logZθ\(q\),\\Lambda^\{\(s\)\}\_\{1\}\\;=\\;\\log\\hat\{F\}\(s\)\\,\-\\,\\log Z\_\{\\theta\}\(q\),\(78\)whereF^\(s\)\\hat\{F\}\(s\)is the skill marginal flow \(Definition[12](https://arxiv.org/html/2605.14089#Thmdefinition12)\)\. This recovers main\-text Eq\.[11](https://arxiv.org/html/2605.14089#S4.E11)\.
###### Proof\.
SubstitutingXv=logF\(st\(v\)\)−logZθ\(q\)X\_\{v\}=\\log F\(s\_\{t\}^\{\(v\)\}\)\-\\log Z\_\{\\theta\}\(q\)into Eq\. \([75](https://arxiv.org/html/2605.14089#A4.E75)\) atλ=1\\lambda=1:
Λ1\(s\)\\displaystyle\\Lambda^\{\(s\)\}\_\{1\}=log\(1\|Vs\|∑veXv\)=log\(1\|Vs\|∑vF\(st\(v\)\)Zθ\(q\)\)\\displaystyle=\\log\\\!\\left\(\\frac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}e^\{X\_\{v\}\}\\right\)=\\log\\\!\\left\(\\frac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}\\frac\{F\(s\_\{t\}^\{\(v\)\}\)\}\{Z\_\{\\theta\}\(q\)\}\\right\)=log\(1Zθ\(q\)⋅1\|Vs\|∑vF\(st\(v\)\)\)=logF^\(s\)−logZθ\(q\),\\displaystyle=\\log\\\!\\left\(\\frac\{1\}\{Z\_\{\\theta\}\(q\)\}\\cdot\\frac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}F\(s\_\{t\}^\{\(v\)\}\)\\right\)\\;=\\;\\log\\hat\{F\}\(s\)\-\\log Z\_\{\\theta\}\(q\),where the last step uses Definition[12](https://arxiv.org/html/2605.14089#Thmdefinition12)and the atomicity identity\|Vs\|=\|ℬs\|\|V\_\{s\}\|=\|\\mathcal\{B\}\_\{s\}\|\. ∎
###### Lemma 24\(Jensen Inequality on the CGF\)\.
For every skillsswithVs≠∅V\_\{s\}\\neq\\emptyset,
G\(s\)≤Λ1\(s\),G\(s\)\\;\\leq\\;\\Lambda^\{\(s\)\}\_\{1\},\(79\)with equality if and only ifXvX\_\{v\}is constant across all visitsv∈Vsv\\in V\_\{s\}\.
###### Proof\.
Apply Jensen’s inequality to the convex functionexp\\exp:
exp\(1\|Vs\|∑vXv\)≤1\|Vs\|∑veXv,\\exp\\\!\\Bigl\(\\tfrac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}X\_\{v\}\\Bigr\)\\;\\leq\\;\\tfrac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}e^\{X\_\{v\}\},\(80\)with equality iffXvX\_\{v\}is constant\. Takinglog\\logon both sides yieldsG\(s\)≤log\(1\|Vs\|∑veXv\)=Λ1\(s\)G\(s\)\\leq\\log\\bigl\(\\tfrac\{1\}\{\|V\_\{s\}\|\}\\sum\_\{v\}e^\{X\_\{v\}\}\\bigr\)=\\Lambda^\{\(s\)\}\_\{1\}, where the equality on the right uses Eq\. \([75](https://arxiv.org/html/2605.14089#A4.E75)\)\. ∎
###### Proposition 25\(Jensen Gap as Cumulant Expansion\)\.
The Jensen gap admits the cumulant expansion
Λ1\(s\)−G\(s\)=12VarVs\[logF\(st\)\]\+∑k≥3κk\(s\)k\!,\\Lambda^\{\(s\)\}\_\{1\}\-G\(s\)\\;=\\;\\tfrac\{1\}\{2\}\\,\\operatorname\{Var\}\_\{V\_\{s\}\}\\\!\\bigl\[\\log F\(s\_\{t\}\)\\bigr\]\\;\+\\;\\sum\_\{k\\geq 3\}\\frac\{\\kappa\_\{k\}\(s\)\}\{k\!\},\(81\)whereκk\(s\)\\kappa\_\{k\}\(s\)is thekk\-th empirical cumulant of\{Xv\}v∈Vs\\\{X\_\{v\}\\\}\_\{v\\in V\_\{s\}\}\(equivalently, of\{logF\(st\(v\)\)\}v\\\{\\log F\(s\_\{t\}^\{\(v\)\}\)\\\}\_\{v\}, sincelogZθ\(q\)\\log Z\_\{\\theta\}\(q\)is a visit\-independent shift\)\. The leading term is one\-half the cross\-visit variance of the log state\-flow at occurrences ofss\.
###### Proof\.
The cumulant generating function admits the standard Taylor expansion
log𝔼\[eλX\]=∑k≥1κkk\!λk=κ1λ\+κ22λ2\+∑k≥3κkk\!λk,\\log\\mathbb\{E\}\\\!\\bigl\[e^\{\\lambda X\}\\bigr\]\\;=\\;\\sum\_\{k\\geq 1\}\\frac\{\\kappa\_\{k\}\}\{k\!\}\\,\\lambda^\{k\}\\;=\\;\\kappa\_\{1\}\\lambda\+\\frac\{\\kappa\_\{2\}\}\{2\}\\lambda^\{2\}\+\\sum\_\{k\\geq 3\}\\frac\{\\kappa\_\{k\}\}\{k\!\}\\lambda^\{k\},\(82\)whereκ1=μ=𝔼\[X\]\\kappa\_\{1\}=\\mu=\\mathbb\{E\}\[X\],κ2=σ2=Var\[X\]\\kappa\_\{2\}=\\sigma^\{2\}=\\operatorname\{Var\}\[X\], and\{κk\}k≥3\\\{\\kappa\_\{k\}\\\}\_\{k\\geq 3\}are the higher cumulants\. Apply this expansion to Eq\. \([75](https://arxiv.org/html/2605.14089#A4.E75)\) \(which expressesΛλ\(s\)\\Lambda^\{\(s\)\}\_\{\\lambda\}as a CGF\):
Λλ\(s\)=G\(s\)λ\+VarVs\[X\]2λ2\+∑k≥3κk\(s\)k\!λk\.\\Lambda^\{\(s\)\}\_\{\\lambda\}\\;=\\;G\(s\)\\,\\lambda\+\\frac\{\\operatorname\{Var\}\_\{V\_\{s\}\}\[X\]\}\{2\}\\,\\lambda^\{2\}\+\\sum\_\{k\\geq 3\}\\frac\{\\kappa\_\{k\}\(s\)\}\{k\!\}\\,\\lambda^\{k\}\.\(83\)Settingλ=1\\lambda=1and rearranging gives Eq\. \([81](https://arxiv.org/html/2605.14089#A4.E81)\)\. The variance ofXvX\_\{v\}equals the variance oflogF\(st\)\\log F\(s\_\{t\}\)overVsV\_\{s\}since the additive constant−logZθ\(q\)\-\\log Z\_\{\\theta\}\(q\)does not affect variance\. ∎
## Appendix EDAG Acyclicity Proof
###### Theorem 26\(Orchestration Graph is a DAG\)\.
Under the SkillFlow environment definition with three\-way policy factorization \(Equation \([5](https://arxiv.org/html/2605.14089#S4.E5)\), main text\) and frozen skill library within each training phase, the orchestration state graph𝒢\\mathcal\{G\}is a directed acyclic graph\.
###### Proof\.
Define a strict order on states by their history length:depth\(s\):=\|Hs\|\\text\{depth\}\(s\):=\|H\_\{s\}\|, i\.e\., the number of tokens in the interaction history at statess\.
By the state\-update rule \(main\-text Eq\. \([4](https://arxiv.org/html/2605.14089#S4.E4)\)\),Ht=Ht−1⊕\(rt,at,otexec\)H\_\{t\}=H\_\{t\-1\}\\oplus\(r\_\{t\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\)strictly appends three non\-empty components to the history\. Therefore:
depth\(st\)=\|Ht\|\>\|Ht−1\|=depth\(st−1\)\.\\text\{depth\}\(s\_\{t\}\)=\|H\_\{t\}\|\>\|H\_\{t\-1\}\|=\\text\{depth\}\(s\_\{t\-1\}\)\.\(85\)
For any directed edges→s′s\\to s^\{\\prime\}in the orchestration graph, we have:
depth\(s′\)\>depth\(s\)\.\\text\{depth\}\(s^\{\\prime\}\)\>\\text\{depth\}\(s\)\.\(86\)
A cycle would be a paths0→s1→⋯→sk→s0s\_\{0\}\\to s\_\{1\}\\to\\cdots\\to s\_\{k\}\\to s\_\{0\}withk≥1k\\geq 1\. Following the edges:
depth\(s1\)\>depth\(s0\),depth\(s2\)\>depth\(s1\),…,depth\(s0\)\>depth\(sk\)\.\\text\{depth\}\(s\_\{1\}\)\>\\text\{depth\}\(s\_\{0\}\),\\quad\\text\{depth\}\(s\_\{2\}\)\>\\text\{depth\}\(s\_\{1\}\),\\quad\\ldots,\\quad\\text\{depth\}\(s\_\{0\}\)\>\\text\{depth\}\(s\_\{k\}\)\.\(87\)Combining:depth\(s0\)\>depth\(sk\)\>⋯\>depth\(s0\)\\text\{depth\}\(s\_\{0\}\)\>\\text\{depth\}\(s\_\{k\}\)\>\\cdots\>\\text\{depth\}\(s\_\{0\}\), a contradiction\.
Therefore, no cycle exists, and𝒢\\mathcal\{G\}is a DAG\. ∎
## Appendix FSkill Curation Details
This appendix specifies \(i\) the phase\-boundary detection rule, \(ii\) the curation classes𝒟−,ℛ,𝒰\\mathcal\{D\}^\{\-\},\\mathcal\{R\},\\mathcal\{U\}used by the operatorΦ\\Phiin main\-text Eq\.[13](https://arxiv.org/html/2605.14089#S4.E13), \(iii\) the Skill CreatorΨ\\Psiwith its trigger steps, \(iv\) the resulting curation algorithm, and \(v\) the formal definition of*atomic composability*together with the proof thatΦ\\Phipreserves it\.
### F\.1Phase\-Boundary Detection Rule
Within phasekk, the squared TTB residualΔ\(τ\)2\\Delta\(\\tau\)^\{2\}is tracked as a running mean over a sliding window ofWWtraining steps:
Δ2¯w\(k\)≔1W∑i=w−W\+1w1\|ℬi\|∑τ∈ℬiΔ\(τ∣𝒮\(k\)\)2,\\overline\{\\Delta^\{2\}\}\_\{w\}^\{\(k\)\}\\;\\coloneqq\\;\\frac\{1\}\{W\}\\sum\_\{i=w\-W\+1\}^\{w\}\\frac\{1\}\{\|\\mathcal\{B\}\_\{i\}\|\}\\sum\_\{\\tau\\in\\mathcal\{B\}\_\{i\}\}\\Delta\(\\tau\\mid\\mathcal\{S\}^\{\(k\)\}\)^\{2\},\(88\)wherewwis the current step within phasekkandℬi\\mathcal\{B\}\_\{i\}is the mini\-batch at stepii\. By Proposition[2](https://arxiv.org/html/2605.14089#Thmtheorem2)and the residual\-floor identityΔ¯∗\(k\)=infθ𝔼τ\[Δ\(τ∣𝒮\(k\),θ\)2\]\\bar\{\\Delta\}^\{\*\(k\)\}=\\inf\_\{\\theta\}\\mathbb\{E\}\_\{\\tau\}\[\\Delta\(\\tau\\mid\\mathcal\{S\}^\{\(k\)\},\\theta\)^\{2\}\]\(main\-text Eq\.[12](https://arxiv.org/html/2605.14089#S4.E12)\),Δ2¯w\(k\)\\overline\{\\Delta^\{2\}\}\_\{w\}^\{\(k\)\}asymptotes to a value≥Δ¯∗\(k\)\\geq\\bar\{\\Delta\}^\{\*\(k\)\}, and gradient descent halts further reduction once that floor is reached\.
###### Definition 13\(Plateau Trigger\)\.
Letρ\>0\\rho\>0be a relative\-decrease tolerance andMMa window\-count budget\. We say training has*plateaued*at stepwwwithin phasekkif
Δ2¯w−W\(k\)−Δ2¯w\(k\)Δ2¯w−W\(k\)<ρ\.\\frac\{\\overline\{\\Delta^\{2\}\}\_\{w\-W\}^\{\(k\)\}\\,\-\\,\\overline\{\\Delta^\{2\}\}\_\{w\}^\{\(k\)\}\}\{\\overline\{\\Delta^\{2\}\}\_\{w\-W\}^\{\(k\)\}\}\\;<\\;\\rho\.\(89\)Phasek\+1k\+1is triggered at the first stepwwfor which Eq\. \([89](https://arxiv.org/html/2605.14089#A6.E89)\) holds forMMconsecutive non\-overlapping windows\.
Concrete values forW,ρ,MW,\\rho,Mare fixed throughout training and detailed in the supplementary code\.
### F\.2Curation Classes
At every triggered phase boundaryk→k\+1k\\to k\+1, each existing skills∈𝒮\(k\)s\\in\\mathcal\{S\}^\{\(k\)\}is classified into one of three disjoint sets via the CGF statistics of Lemmas[22](https://arxiv.org/html/2605.14089#Thmtheorem22)–[24](https://arxiv.org/html/2605.14089#Thmtheorem24)\(Appendix[D\.5](https://arxiv.org/html/2605.14089#A4.SS5)\)\. LetΦthrG,ΦthrJ∈ℝ\\Phi^\{G\}\_\{\\text\{thr\}\},\\Phi^\{J\}\_\{\\text\{thr\}\}\\in\\mathbb\{R\}be hyperparameters; letn−\(s\)n^\{\-\}\(s\)count the cumulative number of past phase boundaries at whichΛ~\(s\)<0\\widetilde\{\\Lambda\}\(s\)<0\.
###### Definition 14\(Curation Classes\)\.
The library is partitioned into three disjoint subsets:
𝒟k−\\displaystyle\\mathcal\{D\}^\{\-\}\_\{k\}≔\{s∈𝒮\(k\):n−\(s\)≥K−\},\\displaystyle\\;\\coloneqq\\;\\bigl\\\{\\,s\\in\\mathcal\{S\}^\{\(k\)\}\\;:\\;n^\{\-\}\(s\)\\geq K^\{\-\}\\bigr\\\},\(90\)ℛk\\displaystyle\\mathcal\{R\}\_\{k\}≔\{s∈𝒮\(k\)∖𝒟k−:G\(s\)≥ΦthrG,Λ1\(s\)−G\(s\)≤ΦthrJ\},\\displaystyle\\;\\coloneqq\\;\\bigl\\\{\\,s\\in\\mathcal\{S\}^\{\(k\)\}\\setminus\\mathcal\{D\}^\{\-\}\_\{k\}\\;:\\;G\(s\)\\geq\\Phi^\{G\}\_\{\\text\{thr\}\},\\;\\Lambda^\{\(s\)\}\_\{1\}\-G\(s\)\\leq\\Phi^\{J\}\_\{\\text\{thr\}\}\\bigr\\\},\(91\)𝒰k\\displaystyle\\mathcal\{U\}\_\{k\}≔𝒮\(k\)∖\(𝒟k−∪ℛk\)\.\\displaystyle\\;\\coloneqq\\;\\mathcal\{S\}^\{\(k\)\}\\setminus\\bigl\(\\mathcal\{D\}^\{\-\}\_\{k\}\\cup\\mathcal\{R\}\_\{k\}\\bigr\)\.\(92\)The three roles are:𝒟k−\\mathcal\{D\}^\{\-\}\_\{k\}*prunes*skills with persistently negative centered share;ℛk\\mathcal\{R\}\_\{k\}*retains*skills with high mean log\-flow and low Jensen gap;𝒰k\\mathcal\{U\}\_\{k\}*refines*the remaining skills \(high Jensen gap or lowGG\)\.
The Jensen gap thresholdΦthrJ\\Phi^\{J\}\_\{\\text\{thr\}\}implements the stability diagnostic of Remark[6](https://arxiv.org/html/2605.14089#Thmremark6): a skill with consistent flow contribution across visits has small gap and is retained; a context\-inconsistent skill has large gap and is refined\. Concrete threshold values are detailed in the supplementary code\.
### F\.3Skill Creation from Success/Failure Trajectory Pairs
The Skill CreatorΨ\\Psiis invoked at high\-step\-importance positions where successful and failed trajectories from the same query diverge\.
###### Definition 15\(Trigger Steps\)\.
For each queryqqin the validation pool, letτ\+\\tau^\{\+\}be a successful trajectory \(R\(τ\+\)=1R\(\\tau^\{\+\}\)=1\) andτ−\\tau^\{\-\}a same\-query failed trajectory \(R\(τ−\)=0R\(\\tau^\{\-\}\)=0\) sampled under𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}with the current policyπθ\\pi\_\{\\theta\}\. The set of*trigger steps*for\(q,τ\+,τ−\)\(q,\\tau^\{\+\},\\tau^\{\-\}\)is
𝒯qtrig≔\{t∈\{1,…,\|τ\+\|\}:logI\(t\)\|τ\+≥ζtrig∧t∉cov\(ℛk∪𝒰k′\)\},\\mathcal\{T\}\_\{q\}^\{\\,\\text\{trig\}\}\\;\\coloneqq\\;\\Bigl\\\{\\,t\\in\\\{1,\\dots,\|\\tau^\{\+\}\|\\\}\\;:\\;\\log I\(t\)\\big\|\_\{\\tau^\{\+\}\}\\geq\\zeta\_\{\\text\{trig\}\}\\;\\wedge\\;t\\notin\\mathrm\{cov\}\(\\mathcal\{R\}\_\{k\}\\cup\\mathcal\{U\}^\{\\prime\}\_\{k\}\)\\Bigr\\\},\(93\)whereζtrig\\zeta\_\{\\text\{trig\}\}is a high\-importance threshold andcov\(⋅\)\\mathrm\{cov\}\(\\cdot\)marks steps already covered by surviving \(or refined\) skills\. By Lemma[14](https://arxiv.org/html/2605.14089#Thmtheorem14),logI\(t\)\|τ\+≫0\\log I\(t\)\\big\|\_\{\\tau^\{\+\}\}\\gg 0marks decisions whose quality became clear only under the hindsight backward—precisely the gap candidates for new tips\.
###### Definition 16\(Skill CreatorΨ\\Psi\)\.
The Skill CreatorΨ\\Psiis a frozen LLM\-based generator constrained to render*atomic tips*\(Definition[18](https://arxiv.org/html/2605.14089#Thmdefinition18)\)\. Given creation contextc=\(q,τ\+,τ−,t\)c=\(q,\\tau^\{\+\},\\tau^\{\-\},t\)for eacht∈𝒯qtrigt\\in\\mathcal\{T\}\_\{q\}^\{\\,\\text\{trig\}\},Ψ\\Psioutputs a textual atomic tip
snew=Ψ\(c,𝒯,𝒮\(k\)\)s\_\{\\text\{new\}\}\\;=\\;\\Psi\(c,\\mathcal\{T\},\\mathcal\{S\}^\{\(k\)\}\)\(94\)where𝒯\\mathcal\{T\}denotes the validation buffer of\(q,τ\+,τ−\)\(q,\\tau^\{\+\},\\tau^\{\-\}\)trajectory pairs available at the phase boundary\. The tip captures the strategic decision differentiatingτ\+\\tau^\{\+\}fromτ−\\tau^\{\-\}at steptt\. The output is constrained at the prompt level to be \(a\) self\-contained \(no reference to other tips at runtime\), \(b\) of bounded textual lengthLmaxL\_\{\\max\}, and \(c\) phrased as strategic guidance \(not as a literal action\)\.Ψ\\Psialso operates in a*refine mode*on𝒰k\\mathcal\{U\}\_\{k\}, rewriting context\-inconsistent tips under the same atomic constraints\.
### F\.4The Curation OperatorΦ\\Phi
###### Definition 17\(Evolution Operator\)\.
The CGF\-based curation operatorΦ\\Phiproduces the next\-phase library by combining three sets:
𝒮\(k\+1\)≔Φ\(𝒮\(k\);\{\(G\(s\),Λ~\(s\)\)\}s∈𝒮\(k\),\{logI\(t\)\}t\)=ℛk∪𝒰k′∪Ψknew,\\mathcal\{S\}^\{\(k\+1\)\}\\;\\coloneqq\\;\\Phi\\bigl\(\\mathcal\{S\}^\{\(k\)\};\\,\\\{\(G\(s\),\\widetilde\{\\Lambda\}\(s\)\)\\\}\_\{s\\in\\mathcal\{S\}^\{\(k\)\}\},\\,\\\{\\log I\(t\)\\\}\_\{t\}\\bigr\)\\;=\\;\\mathcal\{R\}\_\{k\}\\,\\cup\\,\\mathcal\{U\}^\{\\prime\}\_\{k\}\\,\\cup\\,\\Psi^\{\\text\{new\}\}\_\{k\},\(95\)whereℛk\\mathcal\{R\}\_\{k\}is the retained set \(Definition[14](https://arxiv.org/html/2605.14089#Thmdefinition14)\);𝒰k′≔\{Ψ\(s,refine\):s∈𝒰k\}\\mathcal\{U\}^\{\\prime\}\_\{k\}\\coloneqq\\\{\\Psi\(s,\\,\\text\{refine\}\):s\\in\\mathcal\{U\}\_\{k\}\\\}is the refined set; and
Ψknew≔⋃\(q,τ\+,τ−\)⋃t∈𝒯qtrig\{Ψ\(\(q,τ\+,τ−,t\),𝒯,𝒮\(k\)\)\}\\Psi^\{\\text\{new\}\}\_\{k\}\\;\\coloneqq\\;\\bigcup\_\{\(q,\\tau^\{\+\},\\tau^\{\-\}\)\}\\;\\;\\bigcup\_\{t\\in\\mathcal\{T\}\_\{q\}^\{\\,\\text\{trig\}\}\}\\;\\bigl\\\{\\Psi\\bigl\(\(q,\\tau^\{\+\},\\tau^\{\-\},t\),\\,\\mathcal\{T\},\\,\\mathcal\{S\}^\{\(k\)\}\\bigr\)\\bigr\\\}\(96\)is the newly\-created tip set\. Pruned tips𝒟k−\\mathcal\{D\}^\{\-\}\_\{k\}are absent from𝒮\(k\+1\)\\mathcal\{S\}^\{\(k\+1\)\}by construction\.
### F\.5Atomic Composability and Its Preservation
###### Definition 18\(Atomic Tip\)\.
A skillssis an*atomic tip*if \(i\)ssis a textual prompt fragment of bounded length≤Lmax\\leq L\_\{\\max\}, \(ii\) the actionskill\(s\)\\texttt\{skill\}\(s\)deterministically appendsssas a strategic\-guidance segment toHt−1H\_\{t\-1\}without reading or modifying any other skill in𝒮\\mathcal\{S\}at runtime, and \(iii\) the cost of invokingss\(in tokens and wall\-clock\) is bounded by a constant independent of\|𝒮\|\|\\mathcal\{S\}\|\.
###### Definition 19\(Atomic Composability of a Library\)\.
A skill library𝒮\\mathcal\{S\}is*atomically composable*if everys∈𝒮s\\in\\mathcal\{S\}is an atomic tip \(Definition[18](https://arxiv.org/html/2605.14089#Thmdefinition18)\)\.
###### Lemma 27\(Φ\\PhiPreserves Atomic Composability\)\.
If𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}is atomically composable, andΨ\\Psiis constrained to produce only atomic tips \(Definition[16](https://arxiv.org/html/2605.14089#Thmdefinition16), in both creation and refine modes\), then𝒮\(k\+1\)=Φ\(𝒮\(k\);…\)\\mathcal\{S\}^\{\(k\+1\)\}=\\Phi\(\\mathcal\{S\}^\{\(k\)\};\\ldots\)is atomically composable\.
###### Proof\.
By Definition[17](https://arxiv.org/html/2605.14089#Thmdefinition17),𝒮\(k\+1\)=ℛk∪𝒰k′∪Ψknew\\mathcal\{S\}^\{\(k\+1\)\}=\\mathcal\{R\}\_\{k\}\\cup\\mathcal\{U\}^\{\\prime\}\_\{k\}\\cup\\Psi^\{\\text\{new\}\}\_\{k\}\. We verify atomicity for each constituent\.
Retained skills \(ℛk⊆𝒮\(k\)\\mathcal\{R\}\_\{k\}\\subseteq\\mathcal\{S\}^\{\(k\)\}\)\.By the hypothesis on𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}, everys∈𝒮\(k\)s\\in\\mathcal\{S\}^\{\(k\)\}is atomic; the subsetℛk\\mathcal\{R\}\_\{k\}inherits atomicity unchanged\.
Refined skills \(𝒰k′=Ψ\(𝒰k,refine\)\\mathcal\{U\}^\{\\prime\}\_\{k\}=\\Psi\(\\mathcal\{U\}\_\{k\},\\text\{refine\}\)\)\.By Definition[16](https://arxiv.org/html/2605.14089#Thmdefinition16),Ψ\\Psiin refine mode is constrained to produce atomic tips\. Eachs′∈𝒰k′s^\{\\prime\}\\in\\mathcal\{U\}^\{\\prime\}\_\{k\}is therefore atomic\.
New skills \(Ψknew\\Psi^\{\\text\{new\}\}\_\{k\}\)\.Eachsnew∈Ψknews\_\{\\text\{new\}\}\\in\\Psi^\{\\text\{new\}\}\_\{k\}is the output ofΨ\\Psiapplied at a trigger step \(Eq\. \([94](https://arxiv.org/html/2605.14089#A6.E94)\)\); by the same constraint it is atomic\.
Pruned tips𝒟k−\\mathcal\{D\}^\{\-\}\_\{k\}are removed from𝒮\(k\+1\)\\mathcal\{S\}^\{\(k\+1\)\}entirely\. Atomicity is a per\-tip structural property \(Definition[18](https://arxiv.org/html/2605.14089#Thmdefinition18)\); it is preserved under set union, removal of subsets, and replacement of subsets by other atomic tips\. Therefore everys∈𝒮\(k\+1\)s\\in\\mathcal\{S\}^\{\(k\+1\)\}is atomic, and𝒮\(k\+1\)\\mathcal\{S\}^\{\(k\+1\)\}is atomically composable\. ∎
### F\.6Curation Algorithm
The full curation procedure at phase boundaryk→k\+1k\\to k\+1is summarized below\.
Algorithm 1: Skill\-Library Curation at Phase Boundaryk→k\+1k\\to k\+1\.
1. 1\.For eachs∈𝒮\(k\)s\\in\\mathcal\{S\}^\{\(k\)\}, compute the per\-skill CGFΛλ\(s\)\\Lambda^\{\(s\)\}\_\{\\lambda\}atλ∈\{0,1\}\\lambda\\in\\\{0,1\\\}over the recent batchℬs\\mathcal\{B\}\_\{s\}via the zero\-cost formulas of Proposition[20](https://arxiv.org/html/2605.14089#Thmtheorem20)\.
2. 2\.Derive the summariesG\(s\)G\(s\),Λ1\(s\)\\Lambda^\{\(s\)\}\_\{1\}, andΛ~\(s\)=Λ1\(s\)−𝔼s′\[Λ1\(s′\)\]\\widetilde\{\\Lambda\}\(s\)=\\Lambda^\{\(s\)\}\_\{1\}\-\\mathbb\{E\}\_\{s^\{\\prime\}\}\[\\Lambda^\{\(s^\{\\prime\}\)\}\_\{1\}\]via Lemmas[22](https://arxiv.org/html/2605.14089#Thmtheorem22),[23](https://arxiv.org/html/2605.14089#Thmtheorem23)and Remark[7](https://arxiv.org/html/2605.14089#Thmremark7)\.
3. 3\.Classify eachs∈𝒮\(k\)s\\in\\mathcal\{S\}^\{\(k\)\}into𝒟k−\\mathcal\{D\}^\{\-\}\_\{k\},ℛk\\mathcal\{R\}\_\{k\}, or𝒰k\\mathcal\{U\}\_\{k\}via Definition[14](https://arxiv.org/html/2605.14089#Thmdefinition14)\.
4. 4\.Refine eachs∈𝒰ks\\in\\mathcal\{U\}\_\{k\}viaΨ\\Psiin refine mode to produce𝒰k′\\mathcal\{U\}^\{\\prime\}\_\{k\}\.
5. 5\.From the validation buffer, sample same\-query success/failure pairs\(τ\+,τ−\)\(\\tau^\{\+\},\\tau^\{\-\}\); identify trigger steps𝒯qtrig\\mathcal\{T\}\_\{q\}^\{\\,\\text\{trig\}\}via Definition[15](https://arxiv.org/html/2605.14089#Thmdefinition15)\.
6. 6\.For each trigger step, invokeΨ\\Psiin creation mode to obtain new atomic tipsΨknew\\Psi^\{\\text\{new\}\}\_\{k\}\(Eq\. \([96](https://arxiv.org/html/2605.14089#A6.E96)\)\)\.
7. 7\.Assemble𝒮\(k\+1\)=ℛk∪𝒰k′∪Ψknew\\mathcal\{S\}^\{\(k\+1\)\}=\\mathcal\{R\}\_\{k\}\\cup\\mathcal\{U\}^\{\\prime\}\_\{k\}\\cup\\Psi^\{\\text\{new\}\}\_\{k\}\(Eq\. \([95](https://arxiv.org/html/2605.14089#A6.E95)\)\)\.
8. 8\.Warm\-startπθ\\pi\_\{\\theta\}andPϕP\_\{\\phi\}from phasekk; reinitialize the partition functionZθ\(q\)Z\_\{\\theta\}\(q\)for the new action space\.
By Lemma[27](https://arxiv.org/html/2605.14089#Thmtheorem27), this procedure preserves atomic composability across all phase transitions; together with Lemma[8](https://arxiv.org/html/2605.14089#Thmtheorem8), the post\-evolution graph𝒢\\mathcal\{G\}remains a tree\-structured DAG, satisfying the prerequisites for TB\-based training within phasek\+1k\+1\.
Full prompt templates forΨ\\Psi\(creation and refine modes\) and complete hyperparameter values are provided in the supplementary code\.
## Appendix GReward Function Details
R\(τ\)R\(\\tau\)is an outcome\-based scalar reward\. For multi\-hop QA tasks,R\(τ\)=EM\(yq,y∗\)R\(\\tau\)=\\text\{EM\}\(y\_\{q\},y^\{\*\}\)\(exact match\); for mathematical reasoning,R\(τ\)=Acc\(yq,y∗\)R\(\\tau\)=\\text\{Acc\}\(y\_\{q\},y^\{\*\}\); and for code generation,R\(τ\)=Pass@1\(yq\)R\(\\tau\)=\\text\{Pass@1\}\(y\_\{q\}\)\. All rewards lie in\[0,1\]\[0,1\]\. We applyε\\varepsilon\-smoothing \(Appendix[H](https://arxiv.org/html/2605.14089#A8)\) to ensure positive support\.
## Appendix Hε\\varepsilon\-Smoothing Analysis
WhenR\(τ\)R\(\\tau\)can be zero \(e\.g\., failed orchestration\), Theorem[6](https://arxiv.org/html/2605.14089#Thmtheorem6)requires strictly positive rewards\. We handle this via smoothing\.
###### Definition 20\(ε\\varepsilon\-Smoothing\)\.
We define the smoothed reward as:
R~\(τ\)≔R\(τ\)\+εmin,\\tilde\{R\}\(\\tau\)\\;\\coloneqq\\;R\(\\tau\)\+\\varepsilon\_\{\\min\},\(97\)whereεmin\>0\\varepsilon\_\{\\min\}\>0is a small constant\.
###### Proposition 28\(Ordering Preservation\)\.
Smoothing preserves the strict reward ordering: for anyτ1,τ2\\tau\_\{1\},\\tau\_\{2\}withR\(τ1\)\>R\(τ2\)≥0R\(\\tau\_\{1\}\)\>R\(\\tau\_\{2\}\)\\geq 0:
R~\(τ1\)β\>R~\(τ2\)β,\\tilde\{R\}\(\\tau\_\{1\}\)^\{\\beta\}\>\\tilde\{R\}\(\\tau\_\{2\}\)^\{\\beta\},\(98\)hence the relative quality ranking of trajectories is preserved\.
###### Proof\.
SinceR\(τ1\)\>R\(τ2\)≥0R\(\\tau\_\{1\}\)\>R\(\\tau\_\{2\}\)\\geq 0, we have:
R~\(τ1\)=R\(τ1\)\+εmin\>R\(τ2\)\+εmin=R~\(τ2\)\.\\tilde\{R\}\(\\tau\_\{1\}\)=R\(\\tau\_\{1\}\)\+\\varepsilon\_\{\\min\}\>R\(\\tau\_\{2\}\)\+\\varepsilon\_\{\\min\}=\\tilde\{R\}\(\\tau\_\{2\}\)\.\(99\)BothR~\(τ1\),R~\(τ2\)\>0\\tilde\{R\}\(\\tau\_\{1\}\),\\tilde\{R\}\(\\tau\_\{2\}\)\>0\. Sincex↦xβx\\mapsto x^\{\\beta\}is strictly monotone increasing onℝ\>0\\mathbb\{R\}\_\{\>0\}forβ\>0\\beta\>0:
R~\(τ1\)β\>R~\(τ2\)β\.\\tilde\{R\}\(\\tau\_\{1\}\)^\{\\beta\}\>\\tilde\{R\}\(\\tau\_\{2\}\)^\{\\beta\}\.\(100\)∎
###### Proposition 29\(Flow Perturbation Bound\)\.
Forβ≥1\\beta\\geq 1, by the mean\-value theorem applied to the convex functionx↦xβx\\mapsto x^\{\\beta\}on\[R\(τx\),R~\(τx\)\]\[R\(\\tau\_\{x\}\),\\tilde\{R\}\(\\tau\_\{x\}\)\],
\|R~\(τx\)β−R\(τx\)β\|≤βεminmax\(R~\(τx\),R\(τx\)\)β−1,β≥1\.\\bigl\|\\tilde\{R\}\(\\tau\_\{x\}\)^\{\\beta\}\-R\(\\tau\_\{x\}\)^\{\\beta\}\\bigr\|\\;\\leq\\;\\beta\\,\\varepsilon\_\{\\min\}\\,\\max\\\!\\bigl\(\\tilde\{R\}\(\\tau\_\{x\}\),R\(\\tau\_\{x\}\)\\bigr\)^\{\\beta\-1\},\\qquad\\beta\\geq 1\.\(101\)For smallεmin\\varepsilon\_\{\\min\}the perturbation is negligible\.
## Appendix IGradient Variance Analysis
This section provides detailed analysis of the TTB gradient variance and compares it to standard policy\-gradient methods\.
### I\.1TTB Self\-Annealing Theorem with Chain Rule Expansion
###### Theorem 30\(TTB Gradient Self\-Annealing\)\.
TreatingT=\|τ\|T=\|\\tau\|as fixed inθ\\thetafor each givenτ\\tau, the gradient of the TTB loss with respect toθ\\thetasatisfies:
∇θℒTTB\(τ\)=2Δ\(τ\)T2⋅∇θΔ\(τ\)\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)\\;=\\;\\frac\{2\\,\\Delta\(\\tau\)\}\{T^\{2\}\}\\cdot\\nabla\_\{\\theta\}\\Delta\(\\tau\)\.\(102\)
Expanding∇θΔ\(τ\)\\nabla\_\{\\theta\}\\Delta\(\\tau\)by the chain rule on Definition[9](https://arxiv.org/html/2605.14089#Thmdefinition9):
∇θΔ\(τ\)=∇θlogZθ\(q\)\+∑t=1T∇θlogπ~θ\(at∣rt,Ht−1\)−0−0,\\nabla\_\{\\theta\}\\Delta\(\\tau\)\\;=\\;\\nabla\_\{\\theta\}\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\,\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\\;\-\\;0\\;\-\\;0,\(103\)where the last two terms vanish becauseβlogR~\(τ\)\\beta\\log\\tilde\{R\}\(\\tau\)andlogP~ϕ\\widetilde\{\\log P\}\_\{\\phi\}do not depend onθ\\theta\.
Therefore:
∇θℒTTB\(τ\)=2Δ\(τ\)T2⋅\[∇θlogZθ\(q\)\+∑t=1T∇θlogπ~θ\(at∣rt,Ht−1\)\]\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)\\;=\\;\\frac\{2\\,\\Delta\(\\tau\)\}\{T^\{2\}\}\\cdot\\\!\\left\[\\nabla\_\{\\theta\}\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\,\\widetilde\{\\log\\pi\}\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\\right\]\.\(104\)AsΔ\(τ\)→0\\Delta\(\\tau\)\\to 0, the prefine2Δ\(τ\)/T22\\Delta\(\\tau\)/T^\{2\}shrinks and‖∇θℒTTB‖→0\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\\\|\\to 0for every fixed\-length trajectory; trajectories of differentTTcontribute on a comparable scale because of the1/T21/T^\{2\}length normalizer\.
###### Proof\.
Apply the chain rule toℒTTB\(τ\)=\(Δ\(τ\)/T\)2=Δ\(τ\)2/T2\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)=\(\\Delta\(\\tau\)/T\)^\{2\}=\\Delta\(\\tau\)^\{2\}/T^\{2\}:
∇θℒTTB\(τ\)=∇θ\[Δ\(τ\)2T2\]=2Δ\(τ\)T2∇θΔ\(τ\),\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)\\;=\\;\\nabla\_\{\\theta\}\\\!\\left\[\\frac\{\\Delta\(\\tau\)^\{2\}\}\{T^\{2\}\}\\right\]\\;=\\;\\frac\{2\\,\\Delta\(\\tau\)\}\{T^\{2\}\}\\,\\nabla\_\{\\theta\}\\Delta\(\\tau\),\(105\)whereTTis determined byτ\\tau\(not byθ\\theta\) and is constant under∇θ\\nabla\_\{\\theta\}\. SinceTTvaries across trajectories,1/T21/T^\{2\}remains a per\-trajectory length normalizer\. ∎
### I\.2Variance Bound and Comparison to REINFORCE
###### Proposition 31\(Variance Bound under Bounded Gradient\)\.
Assume the residual gradient is uniformly bounded along training: there existsG<∞G<\\inftysuch that‖∇θΔ\(τ\)‖≤G\\\|\\nabla\_\{\\theta\}\\Delta\(\\tau\)\\\|\\leq Gfor allτ\\tau, and assumeT≥TminT\\geq T\_\{\\min\}for a fixedTmin≥1T\_\{\\min\}\\geq 1\. Then the per\-trajectory TTB gradient norm satisfies
‖∇θℒTTB\(τ\)‖=2\|Δ\(τ\)\|T2‖∇θΔ\(τ\)‖≤2G\|Δ\(τ\)\|Tmin2,\\bigl\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)\\bigr\\\|\\;=\\;\\frac\{2\|\\Delta\(\\tau\)\|\}\{T^\{2\}\}\\,\\\|\\nabla\_\{\\theta\}\\Delta\(\\tau\)\\\|\\;\\leq\\;\\frac\{2G\\,\|\\Delta\(\\tau\)\|\}\{T\_\{\\min\}^\{2\}\},\(106\)which yields the variance bound
Varτ\[∇θℒTTB\(τ\)\]≤𝔼τ\[‖∇θℒTTB\(τ\)‖2\]≤4G2Tmin4𝔼τ\[Δ\(τ\)2\]\.\\operatorname\{Var\}\_\{\\tau\}\\\!\\bigl\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)\\bigr\]\\;\\leq\\;\\mathbb\{E\}\_\{\\tau\}\\\!\\bigl\[\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)\\\|^\{2\}\\bigr\]\\;\\leq\\;\\frac\{4G^\{2\}\}\{T\_\{\\min\}^\{4\}\}\\,\\mathbb\{E\}\_\{\\tau\}\\\!\\bigl\[\\Delta\(\\tau\)^\{2\}\\bigr\]\.\(107\)
###### Proof\.
Eq\. \([106](https://arxiv.org/html/2605.14089#A9.E106)\) follows from Theorem[30](https://arxiv.org/html/2605.14089#Thmtheorem30)together with the bounded\-gradient assumption\. Squaring and taking expectation gives Eq\. \([107](https://arxiv.org/html/2605.14089#A9.E107)\); the variance bound usesVar≤𝔼\[∥⋅∥2\]\\operatorname\{Var\}\\leq\\mathbb\{E\}\[\\\|\\cdot\\\|^\{2\}\]\. ∎
As training converges,𝔼τ\[Δ\(τ\)2\]→0\\mathbb\{E\}\_\{\\tau\}\[\\Delta\(\\tau\)^\{2\}\]\\to 0and Eq\. \([107](https://arxiv.org/html/2605.14089#A9.E107)\) forcesVarτ\[∇θℒTTB\]→0\\operatorname\{Var\}\_\{\\tau\}\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\]\\to 0\.
In contrast, the REINFORCE estimatorgPG\(τ\)≔\(R\(τ\)−b\)∇θlogπθ\(τ\)g\_\{\\text\{PG\}\}\(\\tau\)\\coloneqq\(R\(\\tau\)\-b\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)has
Varτ\[gPG\(τ\)\]=Varτ\[\(R\(τ\)−b\)∇θlogπθ\(τ\)\],\\operatorname\{Var\}\_\{\\tau\}\[g\_\{\\text\{PG\}\}\(\\tau\)\]\\;=\\;\\operatorname\{Var\}\_\{\\tau\}\\\!\\bigl\[\(R\(\\tau\)\-b\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\)\\bigr\],\(108\)which is*not forced*to vanish at convergence: it can stay strictly positive whenever the converged policy remains stochastic and sampled trajectories carry heterogeneous rewards \(Lemma[32](https://arxiv.org/html/2605.14089#Thmtheorem32)below\)\. Vanishing only occurs in the special cases of a deterministic optimal policy or rewards being identical across all sampled trajectories\.
###### Lemma 32\(REINFORCE Variance Is Not Forced to Vanish\)\.
In REINFORCE, gradient estimates are\(R\(τ\)−b\)∇θlogπθ\(τ\)\(R\(\\tau\)\-b\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\tau\), wherebbis a baseline\. Whenever the converged policy remains stochastic and the sampled trajectories carry heterogeneous rewards,Varτ\[R\(τ\)\]\>0\\operatorname\{Var\}\_\{\\tau\}\[R\(\\tau\)\]\>0and the estimator’s variance has a strictly positive lower bound: it is*not forced*to vanish at convergence\. \(Vanishing can occur in the special cases of a deterministic optimal policy, or rewards being identical across all sampled trajectories\.\)
By contrast, TTB uses a regression loss whose residualΔ\\Deltaitself is the target for convergence\. WhenΔ\(τ\)→0\\Delta\(\\tau\)\\to 0pointwise, Theorem[30](https://arxiv.org/html/2605.14089#Thmtheorem30)forces the gradient norm‖∇θℒTTB‖→0\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\\\|\\to 0; under the bounded\-gradient assumption of Proposition[31](https://arxiv.org/html/2605.14089#Thmtheorem31), the variance also vanishes\.
## Appendix JProof of Proposition 2 \(Main Text\)
Proposition 2 \(main text\):*TTB training induces reward\-proportional sampling and yields per\-step credit at no extra inference cost\.*
We prove a slightly more detailed version: at convergence the gradient variance vanishes \(residual→0\\to 0\), the learned policy samples trajectories in proportion to tempered reward, and the step importance ratio gives a multiplicative per\-step decomposition of trajectory\-level credit\.
###### Proof\.
*\(i\) TTB gradient variance vanishes\.*
By Theorem[30](https://arxiv.org/html/2605.14089#Thmtheorem30),∇θℒTTB\(τ\)=\(2Δ\(τ\)/T2\)⋅∇θΔ\(τ\)\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)=\(2\\Delta\(\\tau\)/T^\{2\}\)\\cdot\\nabla\_\{\\theta\}\\Delta\(\\tau\)\. AsΔ\(τ\)→0\\Delta\(\\tau\)\\to 0during training \(which occurs at the optimum by the definition of the loss\), the factorΔ\(τ\)/T\\Delta\(\\tau\)/Tshrinks, causing‖∇θℒTTB‖→0\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\\\|\\to 0\. By Proposition[31](https://arxiv.org/html/2605.14089#Thmtheorem31), both variance terms𝔼\[Δ2\]\\mathbb\{E\}\[\\Delta^\{2\}\]andVar\[Δ\]\\operatorname\{Var\}\[\\Delta\]vanish, soVar\[∇θℒTTB\]→0\\operatorname\{Var\}\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\]\\to 0\.
In contrast, by Lemma[32](https://arxiv.org/html/2605.14089#Thmtheorem32), REINFORCE and GRPO gradients have variance proportional to reward variance, which persists at convergence\.
*\(ii\) Policy samples in proportion to tempered reward\.*
At convergenceΔ\(τ\)=0\\Delta\(\\tau\)=0for allτ\\tau, so the TB condition \(Theorem[9](https://arxiv.org/html/2605.14089#Thmtheorem9)\) is satisfied with rewardR~\(τ\)β\\tilde\{R\}\(\\tau\)^\{\\beta\}\. By the GFlowNet sampling theorem \(Theorem[6](https://arxiv.org/html/2605.14089#Thmtheorem6)\), the conditional action\-sequence distribution then satisfiesπθ\(a1:T∣r1:T,o1:Texec,q\)∝R~\(τ\)β\\pi\_\{\\theta\}\(a\_\{1:T\}\\mid r\_\{1:T\},o^\{\\text\{exec\}\}\_\{1:T\},q\)\\propto\\tilde\{R\}\(\\tau\)^\{\\beta\}\(matching main\-text §[4\.2](https://arxiv.org/html/2605.14089#S4.SS2)\); marginalising over reasoning and execution context recovers the unconditional formπθ\(τ∣q\)=R~\(τ\)β/Zθ\(q\)\\pi\_\{\\theta\}\(\\tau\\mid q\)=\\tilde\{R\}\(\\tau\)^\{\\beta\}/Z\_\{\\theta\}\(q\)\.
*\(iii\) Step importance provides per\-step credit decomposition\.*
By Corollary[19](https://arxiv.org/html/2605.14089#Thmtheorem19), the step importance is:
I\(t\)=F\(st\)F\(st−1\)=πθ\(at∣rt,Ht−1\)Pϕ\(at∣Ht−1⊕otexec\)\.I\(t\)=\\frac\{F\(s\_\{t\}\)\}\{F\(s\_\{t\-1\}\)\}=\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)\}\{P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\)\}\.\(109\)
This decomposes the trajectory\-level flowF\(τ\)F\(\\tau\)multiplicatively:
F\(τ\)=Zθ\(q\)⋅∏t=1TI\(t\)\.F\(\\tau\)=Z\_\{\\theta\}\(q\)\\cdot\\prod\_\{t=1\}^\{T\}I\(t\)\.\(110\)
Each factorI\(t\)I\(t\)quantifies the amplification of flow at decisiontt, providing an interpretable per\-step attribution\. Unlike Monte\-Carlo rollout baselines \(which require additional samples\),I\(t\)I\(t\)is computed from the forward and backward policies already evaluated in the TTB loss\. ∎
## Appendix KProof of Proposition 3 \(Main Text\)
Proposition 3 \(main text\):*Flow\-driven recursive evolution autonomously expands the skill library while preserving its atomic composability\.*
We prove a slightly more detailed version with three parts: \(i\) saturation of the running TTB residual against the library\-conditional floorΔ¯∗\(k\)\\bar\{\\Delta\}^\{\*\(k\)\}is a sufficient diagnostic of joint expressiveness limits and triggers phasek→k\+1k\\to k\{\+\}1; \(ii\) the curation operatorΦ\\Phipreserves atomic composability; \(iii\) training within a phase keeps flow conservation under the frozen library\.
###### Proof\.
*\(i\) Plateau saturation is a sufficient diagnostic of joint expressiveness limits\.*
Within training phasekk, the per\-trajectory squared residualΔ\(τ∣𝒮\(k\),θ,ϕ,Zθ\)2\\Delta\(\\tau\\mid\\mathcal\{S\}^\{\(k\)\},\\theta,\\phi,Z\_\{\\theta\}\)^\{2\}enters the TTB loss asℒTTB\(τ\)=\(Δ/T\)2\\mathcal\{L\}\_\{\\mathrm\{TTB\}\}\(\\tau\)=\(\\Delta/T\)^\{2\}\(Definition[10](https://arxiv.org/html/2605.14089#Thmdefinition10)\)\. By Theorem[30](https://arxiv.org/html/2605.14089#Thmtheorem30)the gradient drivesΔ\(τ\)2→0\\Delta\(\\tau\)^\{2\}\\to 0on every trajectory subject to the expressiveness of the parameterization\. The infimal expected squared residual under the current library is the floorΔ¯∗\(k\)\\bar\{\\Delta\}^\{\*\(k\)\}\(main\-text Eq\.[12](https://arxiv.org/html/2605.14089#S4.E12)\):
Δ¯∗\(k\)≔infθ,ϕ,Zθ𝔼τ\[Δ\(τ∣𝒮\(k\),θ,ϕ,Zθ\)2\]≥0\.\\bar\{\\Delta\}^\{\*\(k\)\}\\;\\coloneqq\\;\\inf\_\{\\theta,\\phi,Z\_\{\\theta\}\}\\;\\mathbb\{E\}\_\{\\tau\}\\\!\\bigl\[\\Delta\(\\tau\\mid\\mathcal\{S\}^\{\(k\)\},\\theta,\\phi,Z\_\{\\theta\}\)^\{2\}\\bigr\]\\;\\geq\\;0\.\(111\)The two directions are asymmetric:
\(Sufficiency,⇒\\Rightarrow\)If the running meanΔ2¯w\(k\)\\overline\{\\Delta^\{2\}\}\_\{w\}^\{\(k\)\}\(Eq\. \([88](https://arxiv.org/html/2605.14089#A6.E88)\)\) saturates—i\.e\., its relative decrease acrossMMconsecutive windows falls below the toleranceρ\\rho\(Eq\. \([89](https://arxiv.org/html/2605.14089#A6.E89)\)\)—then under standard SGD descent assumptionsΔ2¯w\(k\)\\overline\{\\Delta^\{2\}\}\_\{w\}^\{\(k\)\}has approachedΔ¯∗\(k\)\\bar\{\\Delta\}^\{\*\(k\)\}\. IfΔ¯∗\(k\)\>0\\bar\{\\Delta\}^\{\*\(k\)\}\>0, this directly evidences that the joint\(𝒮\(k\),\(\\mathcal\{S\}^\{\(k\)\},policy class,PϕP\_\{\\phi\}class,ZθZ\_\{\\theta\}family\)\)cannot represent the reward\-matching TB condition; phasek\+1k\+1is triggered, and skill evolution attributes this insufficiency*a fortiori*to𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}, expanding the action space to enable further residual reduction\.
\(Necessity,⇐\\Leftarrow, only as a heuristic\)Plateau saturation is*not*an exclusive signature of library inadequacy: a saturatedΔ2¯w\(k\)\\overline\{\\Delta^\{2\}\}\_\{w\}^\{\(k\)\}may also reflect limited policy or backward\-policy capacity, an under\-trained partition function, exploration deficits, optimizer stagnation, or finite\-batch noise\. We therefore treat the plateau as a*sufficient diagnostic*for triggering evolution, with the implicit operating assumption that other capacity bottlenecks have been controlled by standard practice \(warm\-start, learning\-rate schedules, replay buffers\)\.
This is the precise sense in which the stagnation criterion “triggers” skill evolution: it provides a sufficient signal under controlled conditions, not an iff\-equivalence with library inadequacy alone\.
*\(ii\) The CGF\-based curation operatorΦ\\Phipreserves atomic composability\.*
By Lemma[27](https://arxiv.org/html/2605.14089#Thmtheorem27)\(Appendix[F\.5](https://arxiv.org/html/2605.14089#A6.SS5)\), if𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}is atomically composable \(Definition[19](https://arxiv.org/html/2605.14089#Thmdefinition19)\) and the Skill CreatorΨ\\Psiis constrained to produce atomic tips in both creation and refine modes \(Definition[16](https://arxiv.org/html/2605.14089#Thmdefinition16)\), then𝒮\(k\+1\)=Φ\(𝒮\(k\);\{\(G\(s\),Λ~\(s\)\)\}s,\{logI\(t\)\}t\)\\mathcal\{S\}^\{\(k\+1\)\}=\\Phi\(\\mathcal\{S\}^\{\(k\)\};\\,\\\{\(G\(s\),\\widetilde\{\\Lambda\}\(s\)\)\\\}\_\{s\},\\,\\\{\\log I\(t\)\\\}\_\{t\}\)is atomically composable\. The base case𝒮\(0\)\\mathcal\{S\}^\{\(0\)\}is atomically composable by initialization \(the seed library consists of bounded\-length, self\-contained tips\)\. Induction onkkthen yields atomic composability of𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}for every phasek≥0k\\geq 0\. The CGF inputsG\(s\)G\(s\)andΛ~\(s\)\\widetilde\{\\Lambda\}\(s\)used to driveΦ\\Phi’s classification are well\-defined \(Lemmas[22](https://arxiv.org/html/2605.14089#Thmtheorem22)–[23](https://arxiv.org/html/2605.14089#Thmtheorem23)\) andΦ\\Phiacts only by partitioning𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}and adjoiningΨ\\Psi’s atomic outputs; no operation introduces non\-atomic tips\.
*\(iii\) Frozen libraries within a phase guarantee flow conservation\.*
Within phasekk, the library𝒮\(k\)\\mathcal\{S\}^\{\(k\)\}is fixed \(Remark[8](https://arxiv.org/html/2605.14089#Thmremark8)\); the environmentℰ\(k\)\\mathcal\{E\}^\{\(k\)\}, action space, and DAG structure𝒢\(k\)\\mathcal\{G\}^\{\(k\)\}are therefore fixed\. By Theorem[26](https://arxiv.org/html/2605.14089#Thmtheorem26)\(DAG acyclicity\) and Lemma[18](https://arxiv.org/html/2605.14089#Thmtheorem18)\(tree\-DAG uniqueness\),𝒢\(k\)\\mathcal\{G\}^\{\(k\)\}admits a unique reward\-matching flowFFwhen terminal values\{F\(x\)=R~\(τx\)β\}x∈𝒳\\\{F\(x\)=\\tilde\{R\}\(\\tau\_\{x\}\)^\{\\beta\}\\\}\_\{x\\in\\mathcal\{X\}\}are prescribed\. Trainingπθ\\pi\_\{\\theta\}andPϕP\_\{\\phi\}to satisfy the SkillFlow TB condition \(Eq\. \([49](https://arxiv.org/html/2605.14089#A2.E49)\), Lemma[14](https://arxiv.org/html/2605.14089#Thmtheorem14)\) is equivalent to solving for that flow \(Theorem[9](https://arxiv.org/html/2605.14089#Thmtheorem9)\)\. At convergence, flow conservation holds at every non\-terminal state:
F\(s\)=∑s′∈Ch\(s\)F\(s→s′\)=∑s′∈Pa\(s\)F\(s′→s\),F\(s\)\\;=\\;\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Ch\}\(s\)\}F\(s\\to s^\{\\prime\}\)\\;=\\;\\sum\_\{s^\{\\prime\}\\in\\mathrm\{Pa\}\(s\)\}F\(s^\{\\prime\}\\to s\),\(112\)which is the defining property of a valid flow \(Definition[2](https://arxiv.org/html/2605.14089#Thmdefinition2)\)\.
At the phase transitionk→k\+1k\\to k\+1,Φ\\Phiproduces𝒮\(k\+1\)\\mathcal\{S\}^\{\(k\+1\)\}\(Definition[17](https://arxiv.org/html/2605.14089#Thmdefinition17)\); the new graph𝒢\(k\+1\)\\mathcal\{G\}^\{\(k\+1\)\}retains its tree\-DAG structure by Lemma[8](https://arxiv.org/html/2605.14089#Thmtheorem8)\(strict history growth is unaffected by library change\)\. To smooth the transition we warm\-startπθ,Pϕ\\pi\_\{\\theta\},P\_\{\\phi\}from phasekkand reinitializeZθ\(q\)Z\_\{\\theta\}\(q\)for the expanded action space \(Algorithm 1, Step 8\)\. Within phasek\+1k\+1, the same reasoning yields flow conservation at convergence\. By induction onkk, flow conservation is guaranteed within every phase\. ∎
## Appendix LBackward Policy Design and Implementation
This section details the design principles and implementation of the backward policyPϕP\_\{\\phi\}\.
### L\.1Hindsight Conditioning
The key insight is thatPϕP\_\{\\phi\}conditions on information unavailable to the forward policy:
###### Definition 21\(Hindsight\-Enriched State\)\.
The hindsight\-enriched state at stepttis:
Ht−1hindsight≔Ht−1⊕otexec,H\_\{t\-1\}^\{\\text\{hindsight\}\}\\;\\coloneqq\\;H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\},\(113\)which augments the forward stateHt−1H\_\{t\-1\}with the execution observationotexeco\_\{t\}^\{\\text\{exec\}\}thatπθ\\pi\_\{\\theta\}could not access when selectingata\_\{t\}\.
This information asymmetry is what makes step importance meaningful:
###### Lemma 33\(Information Asymmetry and Credit Signal\)\.
At steptt, the forward policyπθ\\pi\_\{\\theta\}selects actionata\_\{t\}givenHt−1H\_\{t\-1\}\(before execution\)\. The backward policyPϕP\_\{\\phi\}evaluates the same action givenHt−1⊕otexecH\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\(after execution\)\.
IfPϕ\(at∣Ht−1hindsight\)≪πθ\(at∣Ht−1\)P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}^\{\\text\{hindsight\}\}\)\\ll\\pi\_\{\\theta\}\(a\_\{t\}\\mid H\_\{t\-1\}\), the action appeared good before execution but turned out poorly in hindsight\. IfPϕ\(at∣Ht−1hindsight\)≫πθ\(at∣Ht−1\)P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}^\{\\text\{hindsight\}\}\)\\gg\\pi\_\{\\theta\}\(a\_\{t\}\\mid H\_\{t\-1\}\), the action was initially uncertain but validated by execution\.
The ratioI\(t\)=πθ/PϕI\(t\)=\\pi\_\{\\theta\}/P\_\{\\phi\}captures this update in assessment, providing a meaningful credit signal without additional samples\.
### L\.2Think\-Action Separation
To focus flow balance on the decision space rather than reasoning verbosity:
###### Definition 22\(Action Token vs\. Reasoning Token\)\.
Each stepttcontains:
- •Reasoning tokens: Free\-form text inrtr\_\{t\}for chain\-of\-thought\.
- •Action tokens: Structured JSON inat=\(αt,ot\)a\_\{t\}=\(\\alpha\_\{t\},o\_\{t\}\)specifying the decision\.
Only action tokens contribute to log\-probabilities inπθ\\pi\_\{\\theta\}andPϕP\_\{\\phi\}:
logπθ\(at∣Ht−1\)=∑j∈𝒜tlogπθ\(tokenj∣token<j,Ht−1\),\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid H\_\{t\-1\}\)=\\sum\_\{j\\in\\mathcal\{A\}\_\{t\}\}\\log\\pi\_\{\\theta\}\(\\text\{token\}\_\{j\}\\mid\\text\{token\}\_\{<j\},H\_\{t\-1\}\),\(114\)where𝒜t\\mathcal\{A\}\_\{t\}denotes the set of positions of action tokens at steptt\. Reasoning tokens are part of the context; their content influences future decisions but does not participate in the flow balance\.
This separation ensures that flow balance operates on the decision quality, not reasoning length\.
### L\.3Implementation Details
ArchitecturePϕP\_\{\\phi\}shares the base LLM \(Qwen3\.5\-9B\) withπθ\\pi\_\{\\theta\}but uses a separate LoRA adapter\. Sharing the base model reduces memory and computation; LoRA adapters are swapped at inference time by named adapter selection\.
Adapter Configurationϕ\\phi\-LoRA uses rank 32, targeting\{q,v\}\\\{q,v\\\}projections in attention layers\. This is smaller than theθ\\theta\-LoRA \(rank 64, targeting\{q,k,v,o\}\\\{q,k,v,o\\\}\)\.
Token\-Level ConditioningPϕP\_\{\\phi\}receives the hindsight\-enriched state as context, then evaluates the log\-probability of action tokens autoregressively\. This respects the causal structure: earlier action tokens do not condition on later ones\.
## Appendix MDataset Details
We evaluate SkillFlow on 14 public benchmarks covering four task categories: question answering, mathematical reasoning, interactive decision making, and code generation\. Seven datasets are used as in\-distribution \(IID\) benchmarks for training and evaluation; the remaining seven are held out as out\-of\-distribution \(OOD\) generalization tests, with the skill library frozen at end\-of\-training\.
### M\.1In\-Distribution Datasets
- •HotpotQA\[Yanget al\.,[2018](https://arxiv.org/html/2605.14089#bib.bib6)\]: A large\-scale multi\-hop QA corpus where each question requires reasoning across multiple Wikipedia paragraphs to derive the answer\.
- •TriviaQA\[Joshiet al\.,[2017](https://arxiv.org/html/2605.14089#bib.bib1)\]: A reading\-comprehension dataset of trivia question–answer pairs paired with evidence documents\.
- •MedQA\[Jinet al\.,[2021](https://arxiv.org/html/2605.14089#bib.bib2)\]: A multiple\-choice medical QA benchmark drawn from professional medical\-licensing examinations, requiring multi\-step clinical reasoning\.
- •AIME 2026: Problems from the 2026 American Invitational Mathematics Examination, used as a difficult mathematical\-reasoning benchmark\.
- •WebShop\[Yaoet al\.,[2022a](https://arxiv.org/html/2605.14089#bib.bib3)\]: A simulated e\-commerce environment where the agent interprets a natural\-language instruction, navigates web pages, and selects the matching product\.
- •ALFWorld\[Shridharet al\.,[2020](https://arxiv.org/html/2605.14089#bib.bib4)\]: A text\-based interactive environment grounded in household tasks; the agent issues actions in language and receives observation feedback\.
- •SWE\-bench\[Jimenezet al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib5)\]: A benchmark of real\-world GitHub issues that require generating code patches whose application resolves the issue\.
### M\.2Out\-of\-Distribution Datasets
- •MuSiQue\[Trivediet al\.,[2022](https://arxiv.org/html/2605.14089#bib.bib7)\]: A composite multi\-hop QA dataset built by chaining single\-hop questions, designed to stress reasoning composition\.
- •NQ\-Open: An open\-domain QA derivative of Natural Questions, where the model must produce free\-form answers from user\-issued queries\.
- •MATH\-Hard: The hard subset of competition mathematics problems \(algebra, geometry, number theory, etc\.\), evaluated by exact\-match correctness on the final answer\.
- •GPQA Diamond: The most challenging split of GPQA, containing graduate\-level science questions across physics, chemistry, and biology\.
- •HumanEval\[Chenet al\.,[2021](https://arxiv.org/html/2605.14089#bib.bib8)\]: A code\-generation benchmark of hand\-written Python problems with hidden unit tests; correctness is evaluated by execution\.
- •ScienceWorld: A text\-based interactive science environment that requires the agent to perform multi\-step experiments and reason over physical\-world dynamics\.
- •Mind2Web: A web\-navigation benchmark where the agent performs multi\-step actions across real\-world websites to complete user tasks\.
## Appendix NBaseline Details
We compare SkillFlow against four categories of baselines\. Unless otherwise stated, all baselines that involve fine\-tuning or reinforcement learning use the same Qwen3\.5\-9B backbone and the same training data as SkillFlow, isolating the orchestration objective and skill\-evolution mechanism as the source of any performance gap\.
### N\.1Direct\-LLM Baselines
- •Qwen3\.5\-9B: The same backbone we use as Supervisor, queried directly without any orchestration training, providing a faithful no\-orchestration reference point\.
- •v4\-flash: A strong proprietary instruction\-tuned LLM, queried in a single\-turn ReAct\-style prompting setup as a high\-capacity baseline\.
- •Claude Haiku 4\.5: Another high\-capacity proprietary baseline used to bound how much of SkillFlow’s gain arises from orchestration versus raw model strength\.
### N\.2Fine\-Tuning Baselines
- •SFT \(Qwen3\.5\-9B\): Supervised fine\-tuning on demonstration trajectories of orchestration plans, with no exploration or reward feedback\.
- •GRPO \(Qwen3\.5\-9B\)\[Shaoet al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib20)\]: Group Relative Policy Optimization on the same backbone, using terminal rewards and the same initial skill library as SkillFlow\.
### N\.3Search\-Based Workflow Baselines
- •AFlow\[Zhanget al\.,[2024](https://arxiv.org/html/2605.14089#bib.bib29)\]: A workflow\-search method that explores compositions of pre\-defined operators using MCTS guided by an LLM judge, with no parameter learning\.
### N\.4RL Agent Baselines
- •AgentFlow\[Liet al\.,[2025b](https://arxiv.org/html/2605.14089#bib.bib67)\]: An in\-the\-flow agentic system that jointly optimizes planning and tool use under reinforcement learning over multi\-turn interactions\.
- •FlowSteer\[Zhanget al\.,[2026b](https://arxiv.org/html/2605.14089#bib.bib48)\]: An end\-to\-end RL framework for interactive agentic workflow orchestration on a fixed skill library\.
- •SkillRL\[Xiaet al\.,[2026](https://arxiv.org/html/2605.14089#bib.bib57)\]: A skill\-augmented RL agent that grows its action space using heuristic skill\-distillation triggers, representative of the static\-library RL paradigm\.
## Appendix OEvaluation Metrics
We adopt task\-appropriate evaluation metrics consistent with each benchmark’s standard protocol\.
#### F1 Score\.
For QA\-style benchmarks \(HotpotQA, TriviaQA, MuSiQue, NQ\-Open\), we report the token\-level F1 between the predicted answeryiy\_\{i\}and the ground truthyi∗y\_\{i\}^\{\*\}after standard text normalization:
F1=2⋅Precision⋅RecallPrecision\+Recall,\\text\{F1\}=\\frac\{2\\cdot\\text\{Precision\}\\cdot\\text\{Recall\}\}\{\\text\{Precision\}\+\\text\{Recall\}\},\(115\)wherePrecision=\|tok\(yi\)∩tok\(yi∗\)\|/\|tok\(yi\)\|\\text\{Precision\}=\|\\text\{tok\}\(y\_\{i\}\)\\cap\\text\{tok\}\(y\_\{i\}^\{\*\}\)\|/\|\\text\{tok\}\(y\_\{i\}\)\|andRecall=\|tok\(yi\)∩tok\(yi∗\)\|/\|tok\(yi∗\)\|\\text\{Recall\}=\|\\text\{tok\}\(y\_\{i\}\)\\cap\\text\{tok\}\(y\_\{i\}^\{\*\}\)\|/\|\\text\{tok\}\(y\_\{i\}^\{\*\}\)\|\.
#### Exact Match \(EM\) / Accuracy\.
For mathematical reasoning \(AIME, MATH\-Hard\) and multi\-choice QA \(MedQA, GPQA Diamond\), we report exact\-match accuracy after canonicalization:
Acc=1N∑i=1N𝕀\(norm\(yi\)=norm\(yi∗\)\),\\text\{Acc\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\bigl\(\\text\{norm\}\(y\_\{i\}\)=\\text\{norm\}\(y\_\{i\}^\{\*\}\)\\bigr\),\(116\)wherenorm\(⋅\)\\text\{norm\}\(\\cdot\)standardizes whitespace, casing, and final\-answer extraction\.
#### Average Score and Success Rate\.
For interactive decision\-making \(WebShop, ALFWorld, ScienceWorld\), we follow each environment’s standard evaluator:*Average Score*reports the environment\-defined task\-specific score, while*Success Rate \(SR\)*is the fraction of episodes terminating in a fully solved state\.
#### Resolved Rate\.
For SWE\-bench\[Jimenezet al\.,[2023](https://arxiv.org/html/2605.14089#bib.bib5)\], we report the fraction of issues whose generated code patch passes the held\-out test suite \(the standard “resolved” metric\)\.
#### pass@1\.
For HumanEval\[Chenet al\.,[2021](https://arxiv.org/html/2605.14089#bib.bib8)\], we report*pass@1*, the fraction of problems for which the first generated program passes all hidden unit tests, computed by sampling one program per problem with deterministic decoding\.
#### Step\-Level Metrics for Web Navigation\.
For Mind2Web, we report*Step Accuracy*\(fraction of steps with correctly predicted target element and action\) and*Action F1*\(token\-level F1 over the predicted action string against the ground truth\)\.
## Appendix PComputational Resources
We report the hardware and the main\-run wall\-clock and GPU\-hour cost of training SkillFlow on Qwen3\.5\-9B\.
### P\.1Hardware
All training and on\-policy rollout were performed on a single server with4×4\\timesNVIDIA A100\-SXM4 \(80 GB\) GPUs, 32 logical CPU cores, 549 GB RAM, and 1\.6 TB local NVMe\. Software stack: CUDA 12\.1, PyTorch 2\.4 with DeepSpeed and PEFT for training, vLLM 0\.5 for the executor, and SGLang for the supervisor\.
### P\.2Main run: SkillFlow on Qwen3\.5\-9B
Wall\-clock 73 h,≈292\\approx 292GPU\-hours, 250 training steps\(LoRA checkpoint saved every 10 steps, 25 checkpoints in total\)\.
## Appendix QCase Studies
This appendix presents five complementary case studies illustrating SkillFlow’s behaviour at different granularities: training dynamics across phases \(Q\.1\), library\-level boom\-and\-prune cycles \(Q\.2\), three real evolved skills with signal attribution to claims C1–C3 \(Q\.3\), per\-step importance signals on a successful trajectory \(Q\.4\), and multi\-trajectory success/failure comparison \(Q\.5\)\. Numerical values are representative of one Qwen3\.5\-9B run; raw logs and trajectory dumps are released with the supplementary code\.
### Q\.1Training Dynamics: A Four\-Phase Trajectory
Table[3](https://arxiv.org/html/2605.14089#A17.T3)samples eight representative steps spanning the full 250\-step training run\. Four phases are visible:
1. 1\.Bootstrap\(steps 0–25\): the skill library is empty \(WS=0=0\); only the base policy drives reward, andℒTTB\\mathcal\{L\}\_\{\\text\{TTB\}\}falls steeply asZθZ\_\{\\theta\}adjusts\.
2. 2\.Emergence\(steps 25–75\): the first plateau onℒTTB\\mathcal\{L\}\_\{\\text\{TTB\}\}triggers the curation operatorΦ\\Phi, which begins generating skills \(WS grows0→140\\\!\\to\\\!14\)\. Reward variance is high butlogZθ\\log Z\_\{\\theta\}keeps rising\.
3. 3\.Maturity\(steps 75–175\): the boom\-and\-prune cycle \(P\.2\) operates; WS oscillates between 8 and 14 asF^\(s\)\\hat\{F\}\(s\)drives prune/refine decisions\.
4. 4\.Steady state\(steps 175–250\): WS stabilises around 11; flow entropy stays above 3\.0, indicating that reward\-proportional sampling preserves multiple high\-reward sub\-trajectories rather than collapsing to a single mode\.
Table 3:Training dynamics on Qwen3\.5\-9B \(eight representative steps from a 250\-step run\)\.ℒTTB\\mathcal\{L\}\_\{\\text\{TTB\}\}: TTB loss; avg\.RR: average reward; avg\.y^\\hat\{y\}: average answer\-correctness rate; avg\.\|τ\|\|\\tau\|: average trajectory length; flow ent\.:−∑aπθ\(a\)logπθ\(a\)\-\\\!\\sum\_\{a\}\\pi\_\{\\theta\}\(a\)\\log\\pi\_\{\\theta\}\(a\)at terminal step; WS: skill\-library size\.
### Q\.2Skill Library Evolution: Boom\-and\-Prune Cycles
The library size traces a characteristic boom\-and\-prune cycle rather than monotonically growing\. Table[4](https://arxiv.org/html/2605.14089#A17.T4)shows snapshots at eight phase boundaries; two “boom” peaks \(WS=22=22at step 50; WS=18=18at step 130\) are followed by single\-step prune sweeps that remove1414and88skills respectively\. The mature library settles around1111active skills covering all seven IID task categories\.
Boom mechanism\.A boom is the result of a singleΦ\\Phiinvocation at a phase boundary: when the running mean ofΔ\(τ\)2\\Delta\(\\tau\)^\{2\}saturates againstΔ¯∗\(k\)\\bar\{\\Delta\}^\{\*\(k\)\}\(Eq\.[12](https://arxiv.org/html/2605.14089#S4.E12)\), the curator drains the accumulated\(τ\+,τ−\)\(\\tau^\{\+\},\\tau^\{\-\}\)pair buffer at high\-\|logI\(t\)\|\|\\log I\(t\)\|steps and asksΨ\\Psito synthesise candidate tips for every uncovered decision gap\. Early phases accumulate the largest buffers because the policy is still exploring, which is why peak \#1 \(step 50, WS=22=22\) is larger than peak \#2 \(step 130, WS=18=18\)\.
Prune mechanism and convergence\.Newly created skills enter the library with resetF^\(s\)\\hat\{F\}\(s\); on the next batch the centred log\-flow shareΛ~\(s\)=Λ1\(s\)−𝔼s′\[Λ1\(s′\)\]\\widetilde\{\\Lambda\}\(s\)=\\Lambda^\{\(s\)\}\_\{1\}\-\\mathbb\{E\}\_\{s^\{\\prime\}\}\[\\Lambda^\{\(s^\{\\prime\}\)\}\_\{1\}\]flags those that fail to attract flow and removes them \(−14\-14after peak \#1,−8\-8after peak \#2\)\. Once the policy class is expressive enough for the current task mix,Λ~\(s\)\\widetilde\{\\Lambda\}\(s\)stops promoting new candidates and the library stabilises: steps 195 and 250 show*identical*1111\-skill compositions, indicating the minimum sufficient set under the trainedπθ\\pi\_\{\\theta\}\. Cumulative creation across the run \(≈63\\approx 63skills\) is≈5\.7×\\approx 5\.7\\timesthe final library size, evidencing that aggressive over\-creation followed by flow\-driven pruning is more effective than monotonic growth and confirming theF^\(s\)\\hat\{F\}\(s\)\-driven design of §[4\.3](https://arxiv.org/html/2605.14089#S4.SS3)\.
Table 4:Skill\-library snapshots at phase boundaries \(all per\-task sums match WS exactly\)\. Cumulative creation count over the run: alfworld≈16\\approx 16, math≈12\\approx 12, webshop≈11\\approx 11, multi\-hop≈8\\approx 8, factual≈7\\approx 7, code≈6\\approx 6, science≈3\\approx 3– harder, longer\-horizon tasks attract more creation activity, consistent with theF^\(s\)\\hat\{F\}\(s\)\-driven curation in §[4\.3](https://arxiv.org/html/2605.14089#S4.SS3)\. Snapshot timestamps are chosen at phase boundaries; intermediate steps sampled in Table[3](https://arxiv.org/html/2605.14089#A17.T3)\(e\.g\.,t=150t\{=\}150with WS=11=11\) lie between boom\-prune events and reflect post\-curation states\.
### Q\.3Real Evolved Skills with Signal Attribution
Beyond aggregate statistics, the most direct evidence of SkillFlow’s contribution is the*content*of the skills it produces\. We show three library members with the highest mean log\-flowG\(s\)G\(s\)at the end of training; each is paired with the flow\-signal that drove its emergence and the paper claim it evidences\.
Skill A:tip\-webshop— success78\.5%78\.5\\%over200200invocations — claim C1 \(TTB diversity preservation\)Curation trigger\.I\(t\)I\(t\)flaggedclick\[back\_to\_search\]as a high\-importance step that almost always coincided with terminalR=0R\{=\}0;Φ\\Phicreated the skill from the resulting success/failure pair set\.Rule Zero — never go back\.Once you leave search results, you are committed\. NEVERclick\[back\_to\_search\]: every trajectory using it scoredR=0\.0R\{=\}0\.0\.Option selection\.If instruction says “pink” and options are \[pink\|\|pink light blue\|\|pink purple\], click*only*“pink”\. Each click in the same category*overwrites*the previous selection \(clicking “pink” then “pink light blue”→\\tobought “pink light blue”, scoring0\.6670\.667instead of1\.01\.0\)\.Counter\-intuitive trade\-off\.If a product has no matching options,click\[buy\_now\]anyway: a partial match \(R∈\[0\.03,0\.75\]R\\\!\\in\\\!\[0\.03,0\.75\]\) beats a back\-to\-search loop \(R=0R\{=\}0\)\.Why baselines miss this\.The trade\-off “buy partial≻\\succloop” is unreachable from REINFORCE: it requires keeping a0\.030\.03\-reward trajectory alive in the batch long enough to discover it dominates a0\.00\.0loop\. Reward\-proportional sampling preserves this low\-but\-non\-zero mode; mode\-collapsing baselines never see it\.
Skill B:tip\-alfworld\-desklamp— success42\.5%42\.5\\%over212212invocations — claim C2 \(zero\-cost per\-step credit\)Hidden game\-engine rule discovered\.The ALFWorld documentation does not state this; SkillFlow extracted it from per\-step credit signals on success/failure trajectory pairs\.Win condition\.You are*holding the target item**and**the desklamp is on*\. The task auto\-completes*instantly*when both are true\.No manual completion command exists, andexamine X*never*triggers completion\.Fatal mistakes \(from evidence\)\.\(i\)move X to Ydrops the item and breaks the win condition\.\(ii\)examine X with desklamp 1after lamp\-on never completes \(5−75\{\-\}7wasted retries observed in failed trajectories\)\.\(iii\)go to desklamp 1fails: desklamp is not a navigable location \(4\+4\{\+\}wasted retries\)\.⇒\\RightarrowYou do NOT need the item and the lamp at the same location\.Why baselines miss this\.The hidden rule surfaces becausePϕP\_\{\\phi\}, conditioned on the post\-execution observation, assigns identical hindsight log\-probabilities totakeactions*regardless of the agent’s location*, while wastedexamine/ repeatedgo toactions receive vanishinglogI\(t\)\\log I\(t\)\. The backward policy thus localises which decisions were actually responsible for the reward — something terminal\-only baselines cannot recover\.
Skill C:tip\-factual\-qa— success77\.2%77\.2\\%over224224invocations — claim C3 \(flow\-driven curation\)Quantified by sibling\-query A/B comparisons sampled in the same batch\.Always search before answering\.Skipping search yieldsR≈0\.05R\\\!\\approx\\\!0\.05; searching first yieldsR≥1\.13R\\\!\\geq\\\!1\.13\.Query construction — reward gap up to1\.071\.07between siblings\.Lead with the most distinctive entity\. Example:‘‘TV detective George Toolan DS’’beats‘‘DS George Toolan TV detective’’\. Include concrete numbers/units; do*not*pre\-commit to a wrong candidate name in the query — this poisons retrieval\.Hard rules\.Never uselookup\(returnsNO\_MATCH;R=0\.16R\{=\}0\.16at 6 stepsvsR=1\.13R\{=\}1\.13at 3 steps\)\. On a\[REPEATED\]search, stop andanswerimmediately\.Why baselines miss this\.Sibling\-query reward gaps are observable only when both token\-orderings appear in the same batch — a direct consequence of reward\-proportional sampling\.I\(t\)I\(t\)then localises the search step as the high\-importance decision, andF^\(s\)\\hat\{F\}\(s\)ranks “query\-construction” as a high\-flow skill family worth retaining and refining\.
#### Take\-away\.
The three skills cover three distinct emergence modes: explicit forbidden actions from\(τ\+,τ−\)\(\\tau^\{\+\},\\tau^\{\-\}\)pairs \(A\), hidden environment dynamics from per\-step credit \(B\), and micro\-decision sensitivity from in\-batch siblings \(C\) — together directly instantiating the three SkillFlow claims\.
### Q\.4Per\-Step Importance Signals: An ALFWorld Trajectory
Table[5](https://arxiv.org/html/2605.14089#A17.T5)traces a successful 9\-step trajectoryτ\+\\tau^\{\+\}on the ALFWorld task “*put two watches on shelf*” \(rewardR~\(τ\+\)=0\.86\\tilde\{R\}\(\\tau^\{\+\}\)=0\.86\)\. The step importanceI\(t\)I\(t\)separates two regimes:
- •Confirmed steps\(I\(t\)≪1I\(t\)\\\!\\ll\\\!1, marked⋄\\diamond\): forward and backward policies agree, indicating routine deterministic moves whose role becomes obvious after the action is taken\.
- •Critical steps\(I\(t\)≫1I\(t\)\\\!\\gg\\\!1, marked⋆\\star\): the forward policy chose a low\-prior action that the hindsight backward strongly endorses — these are the decisions that drove success\.
The four high\-I\(t\)I\(t\)steps \(3,4,6,83,4,6,8\) cleanly identify the*navigate–pickup–place*pattern that distinguishesτ\+\\tau^\{\+\}from same\-query failure trajectories\.
Critical\-step interpretation\.The two highly\-critical \(⋆⋆\\star\\\!\\star\) decisionstake watch 1\(step 4\) andmove watch 1 to shelf 1\(step 6\) are both*state\-changing physical interactions*rather than navigation\. Their backward log\-probabilities \(logPϕ≈−14\\log P\_\{\\phi\}\\approx\-14\) are an order of magnitude below their forward values, reflecting that, conditional on the post\-execution observation, hindsight assigns very high credit to actions that actually advanced the world state toward the goal\. The single\-star navigation steps connect these criticals into thenavigate→\\topickup→\\toplacechain, while step 2 \(go to shelf 1,I\(t\)=0\.001I\(t\)=0\.001, marked⋄\\diamond\) flags a wasted early navigation — exactly the kind of redundancy that thetip\-alfworld\-routingskill \(Q\.3, Skill B\) is designed to prevent\.
Telescoping closure\.The cumulative log\-flowlogF\(Ht\)\\log F\(H\_\{t\}\)telescopes fromlogZθ\(q\)=−2\.30\\log Z\_\{\\theta\}\(q\)=\-2\.30att=0t\{=\}0to\+47\.95\+47\.95at the terminal step, matching the sum∑t′≤tlogI\(t′\)\\sum\_\{t^\{\\prime\}\\leq t\}\\log I\(t^\{\\prime\}\)\(Eq\.[11](https://arxiv.org/html/2605.14089#S4.E11)\) to within rounding\. This trajectory\-level identity is the empirical analogue of the Detailed\-Balance conditionF\(H\)PF\(H′∣H\)=F\(H′\)PB\(H∣H′\)F\(H\)\\,P\_\{F\}\(H^\{\\prime\}\\mid H\)=F\(H^\{\\prime\}\)\\,P\_\{B\}\(H\\mid H^\{\\prime\}\)at every edge \(Appendix[D\.1](https://arxiv.org/html/2605.14089#A4.SS1)\), and serves as a sanity check that the per\-stepI\(t\)I\(t\)values are mutually consistent — a property that ablatingPϕP\_\{\\phi\}\(Table[2](https://arxiv.org/html/2605.14089#S5.T2),−\-Backward policy\) immediately breaks\.
Table 5:Per\-step decomposition of a successful ALFWorld trajectory\.KtK\_\{t\}: action\-token count;I\(t\)=πθ\(at∣rt,Ht−1\)/Pϕ\(at∣Ht−1⊕otexec\)=exp\(logπθ−logPϕ\)I\(t\)=\\pi\_\{\\theta\}\(a\_\{t\}\\mid r\_\{t\},H\_\{t\-1\}\)/P\_\{\\phi\}\(a\_\{t\}\\mid H\_\{t\-1\}\\oplus o\_\{t\}^\{\\text\{exec\}\}\)=\\exp\(\\log\\pi\_\{\\theta\}\-\\log P\_\{\\phi\}\);logF\(Ht\)=logZθ\(q\)\+∑t′≤tlogI\(t′\)\\log F\(H\_\{t\}\)=\\log Z\_\{\\theta\}\(q\)\+\\sum\_\{t^\{\\prime\}\\leq t\}\\log I\(t^\{\\prime\}\)\(main\-text Eq\.[11](https://arxiv.org/html/2605.14089#S4.E11)\)\.⋄\\diamond: confirmed step;⋆\\star/⋆⋆\\star\\\!\\star: critical / highly\-critical decision\. Cumulative log\-flow att=9t\{=\}9matches−2\.30\+∑t′=19logI\(t′\)=47\.95\-2\.30\+\\sum\_\{t^\{\\prime\}=1\}^\{9\}\\log I\(t^\{\\prime\}\)=47\.95exactly\.
### Q\.5Multi\-Trajectory Comparison: SWE\-bench Code Generation
Table[6](https://arxiv.org/html/2605.14089#A17.T6)contrasts four trajectories sampled from a single SWE\-bench query \(apydapsigned\-bytes patch\)\. The two successes \(τ1\+\\tau^\{\+\}\_\{1\},τ3\+\\tau^\{\+\}\_\{3\}\) and two failures \(τ2−\\tau^\{\-\}\_\{2\},τ4−\\tau^\{\-\}\_\{4\}\) reveal a clean structural difference: every successful trajectory contains at least two high\-I\(t\)I\(t\)edit\_filesteps, whereas failures stall in repeatedsearch\_code/view\_fileloops without ever issuing a successful edit\.
Table 6:Four\-trajectory comparison on a single SWE\-bench query\. Successful trajectories carry a balanced TTB residual \(Δ/T\\Delta/Tnear zero\) and concentrate\|logI\(t\)\|\>1\|\\log I\(t\)\|\>1onedit\_file/verifysteps; failures show large positive residuals and place high importance on early\-stage search/view actions that never lead to a code change\. The critical\-step pattern is the signal thatΨ\\Psiuses to synthesise new code\-generation tips at phase boundaries\.The reward gap between the success and failure clusters is roughly0\.40\.4–0\.50\.5, and the failure trajectories’ high importance on non\-editing actions is exactly the “*where*are the gaps?” signal that drivesΨ\\Psito create a newtip\-code\-generationskill specifying the search→\\toedit→\\toverify ordering\.
## Appendix RLimitations
SkillFlow targets one specific question:*can flow\-based training drive recursive skill evolution from end\-to\-end task feedback alone?*Within that scope, our results substantiate the three claims \(TTB convergence, zero\-cost per\-step credit, plateau\-driven curation\)\. The method’s effectiveness, however, inherits two properties of the underlying language model that are themselves the subject of separate research lines\.
Reliance on long\-context memory\.Multi\-turn orchestration grows the supervisor’s input by the full historyHt=Ht−1⊕\(rt,at,otexec\)H\_\{t\}=H\_\{t\-1\}\\oplus\(r\_\{t\},a\_\{t\},o\_\{t\}^\{\\text\{exec\}\}\)at every step\. SkillFlow therefore relies on the backbone’s ability to attend to and faithfully use long histories—backbones with weaker long\-context fidelity see the per\-step credit signal degrade as trajectories grow, since the hindsight backwardPϕP\_\{\\phi\}must condition on increasingly distant context\. Improving long\-context modeling itself is an active research line whose advances stack with our training recipe but lie outside its scope\.
Reliance on backbone reasoning capacity\.As shown in §[5\.4](https://arxiv.org/html/2605.14089#S5.SS4)\(RQ3\), SkillFlow lifts every backbone but cannot replace the underlying base capability—orchestration amplifies what the model already knows about decomposition and tool use rather than installing new reasoning skills\. Where the bottleneck is core reasoning rather than orchestration, additional pretraining or distillation is required; this remains complementary to, but outside the scope of, the present paper\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and Introduction \(§[1](https://arxiv.org/html/2605.14089#S1)\) list three contributions—reward\-proportional TTB training, a hindsight backward policy with zero\-cost per\-step credit, and flow\-driven recursive skill evolution—each formalised in §[4](https://arxiv.org/html/2605.14089#S4)and validated by RQ1–RQ5 in §[5](https://arxiv.org/html/2605.14089#S5)\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: An explicit Limitations appendix \(Appendix[R](https://arxiv.org/html/2605.14089#A18)\) discusses backbone\-scale scope, the frozen\-executor assumption, the reward\-positivity /β\\betatuning requirement, and library\-size scaling\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: All three propositions \(Prop\.[1](https://arxiv.org/html/2605.14089#Thmtheorem1), Prop\.[2](https://arxiv.org/html/2605.14089#Thmtheorem2), Prop\.[3](https://arxiv.org/html/2605.14089#Thmtheorem3)\) state their assumptions inline; their full proofs appear in Appendices[E](https://arxiv.org/html/2605.14089#A5),[J](https://arxiv.org/html/2605.14089#A10), and[K](https://arxiv.org/html/2605.14089#A11), with supporting derivations in Appendices A–D, F, H, and I\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: §[5\.1](https://arxiv.org/html/2605.14089#S5.SS1)reports the experimental setup; Appendices[M](https://arxiv.org/html/2605.14089#A13),[N](https://arxiv.org/html/2605.14089#A14),[O](https://arxiv.org/html/2605.14089#A15), and[L](https://arxiv.org/html/2605.14089#A12)provide datasets, baselines, evaluation metrics, and backward\-policy implementation details, while Appendix[Q](https://arxiv.org/html/2605.14089#A17)gives per\-step traces that allow direct verification\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: An anonymised repository containing training/evaluation code, configuration files, and skill\-library snapshots is available at[https://anonymous\.4open\.science/r/SkillFlow\-E850](https://anonymous.4open.science/r/SkillFlow-E850)\. All datasets used are public and cited in §[5\.1](https://arxiv.org/html/2605.14089#S5.SS1)and Appendix[M](https://arxiv.org/html/2605.14089#A13)\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: §[5\.1](https://arxiv.org/html/2605.14089#S5.SS1)states the backbone, baselines, and metric definitions, and Appendices[M](https://arxiv.org/html/2605.14089#A13)–[O](https://arxiv.org/html/2605.14089#A15)extend these with dataset, baseline, and metric details; full hyperparameters and optimizer settings appear in the supplementary code\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: Tables[1](https://arxiv.org/html/2605.14089#S4.T1)and[2](https://arxiv.org/html/2605.14089#S5.T2)report mean±\\pmstandard deviation across multiple training runs; the variability source is randomness in initialisation and trajectory sampling\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: Hardware specifications and the main\-run wall\-clock / GPU\-hours / hyperparameters are reported in Appendix[P](https://arxiv.org/html/2605.14089#A16)\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: The work uses publicly released datasets and open\-source LLMs, involves no human subjects, and complies with the NeurIPS Code of Ethics throughout\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: On the positive side, SkillFlow lowers the engineering and compute cost of agentic systems and may reduce data\-collection burden by reusing distilled skills\. On the negative side, it inherits the dual\-use concerns of capable LLM\-based agents, while introducing no novel attack vector beyond existing autonomous\-agent stacks\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: SkillFlow is a training\-time framework built on existing public LLMs and benchmarks; no new high\-risk pretrained models or scraped datasets are released\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: All datasets and backbone models used are cited with their original publications in §[5\.1](https://arxiv.org/html/2605.14089#S5.SS1)and Appendix[M](https://arxiv.org/html/2605.14089#A13); each asset is used in compliance with its publicly stated license terms\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.14089v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: The SkillFlow framework, evolved skill libraries, and example trajectories are released in the anonymised supplementary material with documentation covering training scripts, configuration, and per\-skill metadata\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The study involves no crowdsourced annotation and no human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: No human\-subjects research is conducted; IRB approval is therefore not required\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[Yes\]
79. Justification: LLMs are central to the methodology: the Supervisor and Executor are LLM\-based components, with Qwen3\.5\-9B \(LoRA\-tuned\) as the primary trainable backbone and frontier proprietary models as alternative backbones \(§[5\.1](https://arxiv.org/html/2605.14089#S5.SS1), Appendix[L](https://arxiv.org/html/2605.14089#A12)\)\. The forward and hindsight backward policies, the TTB objective, and skill creation all operate over LLM next\-token distributions\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.Similar Articles
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
Introduces SkillDAG, a self-evolving typed directed graph for LLM skill selection at scale that models inter-skill relationships and allows agents to query and evolve the graph during execution, outperforming baselines on ALFWorld and SkillsBench.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
SkillsVote is a governance framework for long-horizon LLM agents that manages reusable skills through structured collection, recommendation, and evolution, improving performance on Terminal-Bench 2.0 and SWE-Bench Pro without model updates.