AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
Summary
AstraFlow is a dataflow-oriented RL system that enables efficient multi-policy collaborative training and elastic scaling for agentic LLMs, achieving a 2.7x training speedup over existing systems.
View Cached Full Text
Cached at: 05/19/26, 06:33 PM
Paper page - AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
Source: https://huggingface.co/papers/2605.15565
Abstract
AstraFlow is a dataflow-oriented reinforcement learning system that enables efficient multi-policy collaborative training and elastic scaling across diverse compute resources for large language model agents.
Reinforcement learning(RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities oflarge language models, butagentic RLremains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, includingmulti-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises fromtrainer-centered controlarchitectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventionaltrainer-centered controlwith principledcomponent abstractions. In AstraFlow,rollout services,dataflow management, andtrainingare decoupled into autonomous components, enabling the system to natively support complex multi-policyagentic RLworkloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policytraining,elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. Inmulti-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding uptrainingtime by 2.7x.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.15565
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15565 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15565 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15565 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
This paper studies when end-to-end reinforcement learning training improves multi-agent LLM workflows, comparing shared-policy and isolated-policy training across different workflows, tasks, and model scales, revealing conditional tradeoffs.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning
AgentJet is a distributed swarm training framework for LLM agent reinforcement learning that decouples agent rollouts from model optimization, enabling heterogeneous multi-agent RL, multi-task training, fault tolerance, and live code iteration with 1.5-10x training speedup. It also introduces an automated research system capable of autonomously conducting multi-day RL studies on large-scale clusters.
UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering
UniSteer introduces a text-guided activation flow matching method to learn a universal conditional velocity field in activation space, enabling versatile LLM behavior control and classification tasks without task-specific intervention modules.
SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration
SkillFlow proposes a flow-driven recursive skill evolution framework for LLM-based agentic orchestration, using Tempered Trajectory Balance to prevent strategy collapse and provide transparent credit assignment. Experiments on 14 datasets show significant improvements over baselines in QA, math, code, and decision-making tasks.