AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

Hugging Face Daily Papers Papers

Summary

AstraFlow is a dataflow-oriented RL system that enables efficient multi-policy collaborative training and elastic scaling for agentic LLMs, achieving a 2.7x training speedup over existing systems.

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:33 PM

Paper page - AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

Source: https://huggingface.co/papers/2605.15565

Abstract

AstraFlow is a dataflow-oriented reinforcement learning system that enables efficient multi-policy collaborative training and elastic scaling across diverse compute resources for large language model agents.

Reinforcement learning(RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities oflarge language models, butagentic RLremains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, includingmulti-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises fromtrainer-centered controlarchitectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventionaltrainer-centered controlwith principledcomponent abstractions. In AstraFlow,rollout services,dataflow management, andtrainingare decoupled into autonomous components, enabling the system to natively support complex multi-policyagentic RLworkloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policytraining,elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. Inmulti-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding uptrainingtime by 2.7x.

View arXiv pageView PDFProject pageGitHub5Add to collection

Get this paper in your agent:

hf papers read 2605\.15565

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15565 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15565 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15565 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Learning Agentic Policy from Action Guidance

arXiv cs.CL

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

arXiv cs.AI

AgentJet is a distributed swarm training framework for LLM agent reinforcement learning that decouples agent rollouts from model optimization, enabling heterogeneous multi-agent RL, multi-task training, fault tolerance, and live code iteration with 1.5-10x training speedup. It also introduces an automated research system capable of autonomously conducting multi-day RL studies on large-scale clusters.

SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration

arXiv cs.AI

SkillFlow proposes a flow-driven recursive skill evolution framework for LLM-based agentic orchestration, using Tempered Trajectory Balance to prevent strategy collapse and provide transparent credit assignment. Experiments on 14 datasets show significant improvements over baselines in QA, math, code, and decision-making tasks.