D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

arXiv cs.AI Papers

Summary

D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.

arXiv:2605.13276v1 Announce Type: new Abstract: The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces "Plane Decoupling," physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 06:15 AM

# D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
Source: [https://arxiv.org/html/2605.13276](https://arxiv.org/html/2605.13276)
Yucheng Guo5, Yongjian Guo1,511footnotemark:1, Zhong Guan3,511footnotemark:1, Wen Huang1,511footnotemark:1, Haoran Sun2,511footnotemark:1, Haodong Yue1, Xiaolong Xiang4,5, Shuai Di5, Zhen Sun4,5, Luqiao Wang4,5, Junwu Xiong5, Yicheng Gong5 1Tsinghua University,2Perking University,3Tianjin University,4Beihang University,5JDT AI Infra

###### Abstract

The rapid evolution of Embodied AI has enabled Vision\-Language\-Action \(VLA\) models to excel in multimodal perception and task execution\. However, applying Reinforcement Learning \(RL\) to these massive models in large\-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high\-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning\. This conflict often leaves overall throughput constrained by execution\-phase inefficiencies\. To address these challenges, we proposeD\-VLA, a high\-concurrency, low\-latency distributed RL framework for large\-scale embodied foundation models\.D\-VLAintroduces "Plane Decoupling," physically isolating high\-frequency training data from low\-frequency weight control to eliminate interference between simulation and optimization\. We further design a four\-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution\. Additionally, a dual\-pool VRAM management model and topology\-aware replication resolve memory fragmentation and optimize communication efficiency\. Experiments on benchmarks like LIBERO show thatD\-VLAsignificantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion\-parameter VLA models\. In trillion\-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high\-performance general\-purpose embodied agents\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.13276v1/figures/placement.png)Figure 1:Placement Strategies across Different Training FrameworksEmbodied AI, regarded as a pivotal pathway toward Artificial General Intelligence \(AGI\), is undergoing a profound paradigm shift driven by the emergence of Vision\-Language\-Action \(VLA\) modelsZitkovichet al\.\([2023](https://arxiv.org/html/2605.13276#bib.bib1)\); Blacket al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib4)\); Gemini Robotics Team \([2025](https://arxiv.org/html/2605.13276#bib.bib7)\); Kimet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib2)\); Shukoret al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib8)\)such as OpenVLAKimet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib2)\),π0\\pi\_\{0\}Blacket al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib4)\), and GR00TBjorcket al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib3)\)\. These models achieve a significant transition from manually designed explicit models to data\-driven implicit models by integrating visual perception, language understanding, and action generation into a unified end\-to\-end frameworkMaet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib9)\); Zhonget al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib10)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib11)\)\. Through continuous scaling across massive computational resources and datasets, VLA models have demonstrated unprecedented potential in cross\-task and cross\-morphology adaptation\. However, despite these notable advancements, current training paradigms remain heavily reliant on imitation learning based on Supervised Fine\-Tuning \(SFT\)\. Existing frameworks like LeRobotCadeneet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib24)\)and GR00TBjorcket al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib3)\)primarily utilize expert demonstration data to fine\-tune policies via behavior cloning\. This SFT\-centric path faces multiple severe challenges in practical applications: first, large\-scale human\-collected robot trajectory data is both costly and difficult to obtain, which strictly limits the scaling of modelsShenget al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib36)\); Huet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib35)\); second, processing high\-dimensional data and complex multimodal architectures imposes a heavy burden on training cycles and inference latency\. Furthermore, constrained by the limited diversity of offline datasets, SFT models often exhibit weak generalization capabilities when encountering distribution shifts and unseen tasksChenet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib5)\); Suet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib6)\)\. Unlike human learning mechanisms, SFT cannot support agents in discovering new action patterns through autonomous exploration beyond demonstration data\. Consequently, the research community is increasingly pivoting toward Reinforcement Learning \(RL\) frameworks to break through these limitations via online interaction\.

In response to the deficiencies of SFT, several RL frameworks have emerged, optimizing different dimensions of the training pipeline\. For instance, RLinfZanget al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib18)\)provides a universal distributed training and evaluation interface, offering a unified framework for multimodal agents and VLAs\. RL\-VLA3Guanet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib23)\)implements a three\-stage asynchronous pipeline to decouple data collection, policy inference, and model updates, thereby maximizing hardware utilization and training throughput\. SimpleVLA\-RLLiet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib20)\)designs rule\-based outcome rewards and interactive trajectory sampling for VLA models, proving that performance can surpass SFT even with minimal demonstrations\. Additionally, DexbotixXieet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib25)\)focuses on integrating tactile feedback for high\-degree\-of\-freedom dexterous hands, while VlabDana Aubakirovaet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib45)\)is dedicated to building specialized sim\-to\-real transfer environments\.

Unlike traditional online RL, Embodied AI training involves deep coupling between high\-fidelity physics simulation and large\-scale deep learning models\. The former features high\-frequency, fragmented occupation of computational resources, while the latter demands extreme throughput for GPU memory capacity and communication bandwidth\. Existing distributed training frameworks, such as RLinf\-VLAZanget al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib18)\)and RL\-VLA3Guanet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib23)\), introduce hybrid resource allocation and fine\-grained asynchronous mechanisms to alleviate computational pressure\. However, they have yet to fundamentally resolve the resource contention and execution conflicts between simulation tasks and model optimization at the underlying architecture level\. As a result, the overall system throughput remains bottlenecked by the slowest physics stepping or synchronization overhead\.

The current performance bottlenecks in embodied RL systems primarily stem from the high degree of coupling between simulation logic and learning logic on the execution plane\. On one hand, frequent memory allocation and deallocation by physics engines easily lead to severe memory fragmentation in deep learning frameworks\. On the other hand, the frequent transfer of massive multimodal environment data \(e\.g\., high\-resolution images\) between sampling and inference components introduces significant serialization overhead and communication latency\. This systemic "blocking" effect is particularly severe when handling long\-sequence interaction tasks, restricting the sample acquisition efficiency of agents in complex scenarios\.

To address these challenges, we proposeD\-VLA, a high\-performance distributed embodied reinforcement learning framework\. The core innovation of this framework lies in the design philosophy of "Plane Decoupling," which physically isolates the high\-frequency Data Plane from the low\-frequency weight Control Plane during the training process, eliminating interference between simulation and training tasks from the ground up\. Based on this concept, we have constructed a four\-thread asynchronous execution pipeline—the "Swimlane" model\. By parallelizing sampling, weight reception, gradient training, and parameter distribution, we achieve full overlap of computation and communication\. To further optimize heterogeneous resource utilization,D\-VLAintroduces a dual\-pool memory management model and a zero\-copy data exchange mechanism, supporting various flexible placement strategies including co\-located, separated, and hybrid deployments, as shown in Figure[1](https://arxiv.org/html/2605.13276#S1.F1)\. By combining Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib48)\)with local topology replication scaling techniques,D\-VLAsuccessfully breaks through the scalability bottlenecks of large\-scale interactive data processing, providing stable support for the training of ultra\-large\-scale VLA models\.

The primary contributions of this paper are summarized as follows:

- •"Plane Decoupling" and Four\-Thread Asynchronous Pipeline Architecture:We propose a system design that physically isolates the high\-frequency data interaction plane from the low\-frequency weight control plane\. Through an innovative four\-thread "Swimlane" parallel mechanism, we achieve complete computational overlap of data sampling, policy inference, gradient training, and parameter distribution, resolving resource conflicts between embodied simulation and model optimization at the architectural level\.
- •Hierarchical Memory Management and Topology\-Aware Scaling Strategies:we introduce a dual\-pool GPU memory management model and a zero\-copy data exchange mechanism to effectively alleviate memory fragmentation caused by physics engines\. Meanwhile, through local topology replication and control plane offloading techniques, we significantly reduce cross\-node communication latency and optimize the communication\-to\-computation ratio while maintaining global consistency for trillion\-parameter models\.
- •Performance Breakthrough and Validation in Large\-Scale Embodied Tasks:By integrating the GRPO algorithm into theD\-VLAframework and conducting extensive validation on complex benchmarks such as LIBERO, we demonstrate that the system possesses superior stability and sampling efficiency when handling long\-sequence, large\-scale interactive data, significantly outperforming existing mainstream distributed RL baselines\.

## 2Related Work

#### Evolution of Vision\-Language\-Action Models

In the field of Embodied AI, the evolution of Vision\-Language\-Action \(VLA\) models is undergoing a paradigm shift from fundamental Supervised Fine\-Tuning \(SFT\)Zhouet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib50)\)toward Reinforcement Learning \(RL\)Dana Aubakirovaet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib45)\)frameworks with enhanced generalization capabilities\. Early foundational VLA models such as RT\-2Zitkovichet al\.\([2023](https://arxiv.org/html/2605.13276#bib.bib1)\), OpenVLAKimet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib2)\), andπ0\\pi\_\{0\}Blacket al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib4)\)achieved preliminary robotic manipulation capabilities by training on large\-scale datasets like Open\-X EmbodimentO’Neillet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib51)\)\. Subsequently, models likeπ0\.5\\pi\_\{0\.5\}Intelligenceet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib52)\)and SmolVLAShukoret al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib8)\)further optimized parameter efficiency and inference speed while maintaining performance\. Concurrently, general\-purpose humanoid control models such as Gr00t N1\.5Bjorcket al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib3)\)have demonstrated broad application prospects\. However, due to the limitations of SFT methods in handling out\-of\-distribution data, research focus is gradually shifting toward utilizing RL to enable continuous dynamic environmental adaptation in complex benchmarks such as ManiSkillMuet al\.\([2021](https://arxiv.org/html/2605.13276#bib.bib28)\)and LIBEROLiuet al\.\([2023](https://arxiv.org/html/2605.13276#bib.bib27)\)\.

#### Frameworks for VLA Training and Optimization

To support the efficient training of large\-scale VLAs, the academic community has developed a series of optimization frameworks ranging from low\-level control to high\-level learning\. Frameworks such as LeRobotCadeneet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib24)\)and DexBotixXieet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib25)\)provide vertically integrated toolchains for end\-to\-end learning, covering the entire process from data processing to policy fine\-tuning for algorithms like ACT and Diffusion Policy\. Early attempts such as SimpleVLA\-RLLiet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib20)\)and RLinf\-VLAZanget al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib18)\)sought to migrate RLHF workflows from Large Language Models \(LLMs\) to embodied tasks\. Nevertheless, the high latency introduced by physics simulators causes synchronous training pipelines to face severe throughput bottlenecks\. To address this, RL\-VLA³Guanet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib23)\)proposed a fully asynchronous distributed architecture that decouples simulation, inference, and training processes\. By drawing on the design philosophies of LLM training systems like veRLShenget al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib36)\), OpenRLHFHuet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib35)\), and ROLLARTGaoet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib38)\), it significantly enhances hardware utilization and training efficiency for policy optimization\.

#### System Challenges in Embodied AI

Embodied AI systems face system\-level challenges distinct from those of traditional language models during training, primarily stemming from the uncertainties inherent in physics simulatorsZhouet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib50)\)\. Unlike the relatively stable neural reward models used in LLM trainingYuet al\.\([2025](https://arxiv.org/html/2605.13276#bib.bib26)\), VLA training requires frequent calls to simulation environments like RoboCasaNasirianyet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib31)\)\. These environments exhibit high and fluctuating CPU and GPU resource consumption, which can easily lead to computational idling and resource wasteMuet al\.\([2021](https://arxiv.org/html/2605.13276#bib.bib28)\); Nasirianyet al\.\([2024](https://arxiv.org/html/2605.13276#bib.bib31)\)\. To address these pain points, frontier research such as RL\-VLA³Guanet al\.\([2026](https://arxiv.org/html/2605.13276#bib.bib23)\)has introduced dynamic batch scheduling and fine\-grained environment sharding techniques, aiming to resolve the asynchronous bottlenecks caused by simulators while maximizing both sample efficiency and system throughput\.

## 3System Design ofD\-VLA

![Refer to caption](https://arxiv.org/html/2605.13276v1/figures/framwork3.png)Figure 2:TheD\-VLAFramework: Overview of the asynchronous embodied RL training architectureD\-VLA\. The GPU pool is partitioned into rollout workers and actor workers\. Rollout GPUs co\-locate PhysX\-accelerated parallel environments with a frozen inference policy copy, eliminating inter\-process observation transfer and model offload overhead\. Upon completing a fixed\-horizon rollout epoch, trajectory data is dispatched to actor GPUs via NCCL all\-to\-all, where GRPO advantages and clipped policy gradients are computed under FSDP\. Updated weights are broadcast back through a background Gloo channel — deliberately decoupled from CUDA to avoid stream contention with PhysX\. The pipeline achieves near 2× throughput over synchronous alternation with single\-step weight staleness\.### 3\.1D\-VLAFramework Overview

We designed and implemented theD\-VLAframework \(Figure[2](https://arxiv.org/html/2605.13276#S3.F2)\), whose core philosophy is to build a high\-concurrency, low\-latency training system based on distributed execution tools\. Unlike traditional frameworks that attempt to balance simulation and learning within the same execution flow,D\-VLAproposes a “Plane Decoupling” design By physically isolating high\-frequency data exchange from low\-frequency control logic, the system effectively utilizes heterogeneous computing resources This architecture aims to breakthrough performance bottlenecks in processing large\-scale interactive data while ensuring sample efficiency for complex embodied tasks\.D\-VLAintroduces a four\-thread asynchronous execution pipeline that allows data collection, model inference, gradient computation, and parameter distribution to overlap completely by constructing parallel execution logic within compute nodes

### 3\.2Flexible Deployment and Placement Strategies

To address the diverse resource requirements of embodied tasks at different scales,D\-VLAprovides a highly flexible component placement scheme, including co\-located, separated, and hybrid deployments\. In embodied RL, environment simulation often involves fragmented memory operations, while model inference requires continuous high bandwidth\. Through distributed management tools,D\-VLAdynamically adjusts the physical location of components based on hardware topology\. For instance, in hybrid mode, environment instances and data sampling workflows are assigned to the same device group to minimize cross\-device communication latency

To optimize memory overhead, we designed a zero\-copy data exchange mechanism for co\-located deployments\. Traditional distributed communication involves multiple rounds of data serialization, which creates significant overhead when handling high\-resolution images\. By maintaining different execution threads within the same process space,D\-VLAallows observation results from the simulation environment to be accessed directly by inference components, achieving “seamless” data flow at the memory level and reducing bandwidth consumption\. For tasks requiring massive parallel simulation,D\-VLAutilizes a separated deployment strategy with pipeline communication, allowing the environment side to perform physical stepping while the inference side concurrently processes the previous decision logic\.

### 3\.3Decoupled Communication Architecture

A key innovation of theD\-VLAarchitecture is its physical plane decoupling technology, which distinguishes the “Training Data Plane” from the “Weight Control Plane” In embodied AI training, the environment interaction data generated during the rollout phase constitutes the primary load of the Data Plane due to its high update frequency and volume\. We utilize high\-performance collective communication libraries to dynamically switch communication paradigms based on resource ratios In symmetric deployments, efficient point\-to\-point communication is used; in asymmetric deployments, dynamic networking enables global data exchange and metadata synchronization via master\-node broadcasting\.

Concurrently, model weight synchronization is assigned to the Weight Control Plane\. Unlike the Data Plane, weight distribution is less frequent but requires high determinism\.D\-VLAoffloads this logic to a CPU\-based communication backend, using host\-side contiguous buffers for broadcasting\. This design avoids the GPU synchronization calls common in physical simulation paths, which often cause stream contention or deadlocks in existing frameworks\. Through this dual\-plane isolation, the Data Plane focuses on peak throughput while the Control Plane maintains global model consistency

### 3\.4End\-to\-End Asynchronous Pipeline and Data Flow

![Refer to caption](https://arxiv.org/html/2605.13276v1/figures/mutinode.png)Figure 3:Schematic of Multi\-Node Communication inD\-VLAD\-VLAimplements a fully non\-blocking end\-to\-end data flow throughout its execution cycle\. Environmental features collected by Rollout components are pushed to Actor components in real\-time\. To mitigate the speed mismatch between data production and training, a resource buffer queue based on host memory is constructed on the Actor side, ensuring continuous sampling without blocking the simulators At the algorithmic level,D\-VLAintegrates Group Relative Policy Optimization \(GRPO\) combined with micro\-batch training, which captures sparse reward signals in embodied tasks and fits perfectly with the asynchronous architecture

We define this four\-thread architecture as the “Swimlane” model\. The main sampling thread, asynchronous weight receiving thread, training execution thread, and weight distribution thread run on their own physical resource tracks, synchronized via lightweight semaphores\. This ensures that hardware never idles while waiting for specific signals\. Compared to traditional synchronous distributed RL, this design elevates hardware utilization by hiding communication within the computation process

### 3\.5Memory Management and Large\-Scale Node Scaling

To prevent memory fragmentation caused by non\-framework components like physics engines,D\-VLAproposes a dual\-pool memory management model\. Memory is explicitly partitioned into a “Model Computation Pool” \(managed by Torch’s caching allocator for weights and gradients\) and an “Environment Auxiliary Pool” \(reserved for physics engine temporary objects like contact points\) as shown in Figure[2](https://arxiv.org/html/2605.13276#S3.F2)\. This physical isolation prevents framework memory crashes during frequent allocation and deallocation by simulation components\.

For cross\-node scaling,D\-VLAemploys a local topology replication strategy, as shown in Figure[3](https://arxiv.org/html/2605.13276#S3.F3)\. Recognizing that most communication occurs between sampling and inference,D\-VLAbuilds a complete sampling\-inference closed loop within a node and replicates this topology as a basic unit across the cluster\. This limits high\-frequency tensor flows to local high\-speed interconnects\. For global gradient reduction in large clusters,D\-VLAcombines Fully Sharded Data Parallel \(FSDP\) technology Since weight broadcasting is offloaded to the control plane, global synchronization does not negatively impact local sampling efficiency, optimizing the communication\-to\-computation ratio and increasing overall throughput\.

## 4Experiments

In this section, we conduct a comprehensive empirical evaluation ofD\-VLAagainst several state\-of\-the\-art \(SOTA\) orchestration frameworks\. Our analysis focuses on computational efficiency, throughput scalability, and hardware utilization in the context of Vision\-Language\-Action \(VLA\) training\. The results demonstrate thatD\-VLAsignificantly optimizes the training pipeline across diverse model architectures and resource configurations, effectively mitigating the hardware underutilization commonly encountered in heterogeneous embodied AI workloads\.

### 4\.1Experimental Setup

Model Architectures\.To evaluate the generalizability ofD\-VLA, we select two representative VLA paradigms:π0\.5\\pi\_\{0\.5\}, a diffusion\-based model utilizing iterative denoising processes, and OpenVLA\-OFT, an auto\-regressive Transformer\-based model employing Parameter\-Efficient Fine\-Tuning \(PEFT\)\. Both models are configured for action chunking prediction—predicting a sequence of future actions rather than single\-step control signals—to enhance environment interaction rates and maximize system efficiency in high\-frequency control tasks\.

Simulation Environment\.All experiments are conducted within the ManiSkill physical simulation framework\. Unlike traditional environments that rely on CPU\-bound physics \(e\.g\., Gym or MuJoCo\), ManiSkill leverages high\-concurrency GPU rendering and parallelized physics kernels for agent\-environment interactions\. This introduces significant heterogeneous resource contention, as the simulation process and VLA model inference must simultaneously compete for GPU memory and compute cycles, posing a substantial challenge to the underlying orchestration system\.

Baselines and Placement Strategies\.We benchmark our framework against two state\-of\-the\-art orchestration frameworks under their representative resource configurations on an 16\-GPU cluster:

RLinf\-VLA, a versatile baseline supporting three deployment modes with its standard protocols:colocated\(RLinf\-co\), where all components share the GPU pool;disaggregated\(RLinf\-dis\), which allocates 2 GPUs for Rollout/Environment and 4 GPUs for the Actor; andhybrid\(RLinf\-hyper\), where the Actor utilizes all GPUs during the training phase\. RL\-VLA3, a fully asynchronous framework that adopts a three\-stage "Environment\-Rollout\-Actor" pipeline\. Its placement follows a disaggregated design similar to RLinf\-dis, segregating Environment/Rollout and Actor onto 2 and 4 GPUs, respectively\.D\-VLAutilizes a Hybrid Asynchronous Orchestration strategy: 4 GPUs are shared by Rollout and Environment components to minimize data transfer overhead via locality, while the Actor components—responsible for heavy gradient updates and large\-model inference—are isolated on a dedicated 4\-GPU group\. This architectural isolation is a core design principle ofD\-VLA, aimed at preventing kernel\-level computational interference and maximizing hardware occupancy\.

Metrics: To evaluate system efficiency, we utilizethroughputas the core metric, defined as the total number of environment state transitions processed per unit of time\. Given a constantaction chunk size, this is mathematically equivalent to the total number of action inference steps executed by the rollout generators per second\. To ensure the reliability of the experimental results, we maintain identical configurations across all evaluated methods, specifically the rollout batch size and the actor training batch size\.

![Refer to caption](https://arxiv.org/html/2605.13276v1/pi_data.png)Figure 4:Performance Benchmarking ofπ0\.5\\pi\_\{0\.5\}under Different Distributed Strategies\. \(Left\) System throughput measured in steps per second; \(Middle\) Average inference latency per step in milliseconds; \(Right\) Percentage breakdown of execution time between Rollout and Actor components\. Ratios \(3:1 and 1:1\) represent the resource partitioning between rollout/environment and actor modules\.![Refer to caption](https://arxiv.org/html/2605.13276v1/openvla-data.png)Figure 5:Performance Evaluation of OpenVLA\-OFT across Various Scaling Configurations\. \(Left\) Comparison of throughput efficiency; \(Middle\) Comparison of single\-step processing time; \(Right\) Proportional distribution of time consumption for Rollout and Actor processes\.System Throughput Enhancement\.As shown in Figure[4](https://arxiv.org/html/2605.13276#S4.F4)and Figure[5](https://arxiv.org/html/2605.13276#S4.F5),π0\.5\\pi\_\{0\.5\}and OpenVLA\-OFT exhibit distinct throughput characteristics due to their inherent computational densities\. The traditional RLinf\-co \(colocated\) scheme, while providing a stable baseline, suffers from severe resource utilization bottlenecks due to its inherent "lock\-step" synchronization mode\. This leads to significant "GPU bubbles" during high\-concurrency environment sampling\.

D\-VLAachieves substantial performance breakthroughs across both model architectures\. Inπ0\.5\\pi\_\{0\.5\}experiments, the 1:1 configuration reaches a throughput of 147\.0 steps/s, a 22\.25% improvement over RLinf\-co \(127\.24 steps/s\)\. By optimizing the resource ratio to 3:1, the throughput surges to 237\.0 steps/s, representing an 86\.26% increase over the baseline\. For the more parameter\-heavy OpenVLA\-OFT,D\-VLAconsistently outperforms the competition, achieving 156\.0 steps/s—surpassing RLinf\-co \(108\.24 steps/s\) and RL\-VLA3\(110\.88 steps/s\) by 44\.44%\. This improvement confirms that our asynchronous strategy, combined with communication optimization, effectively mitigates GPU idling during heterogeneous task switching\.

Pipeline Latency and Asynchronous Overlapping\.Figures 2 and 3 provide a decoupled comparison of Step Time, Rollout Time, and Actor Time to clarify the optimization mechanisms\.D\-VLAdemonstrates superior latency control: inπ0\.5\\pi\_\{0\.5\}tasks, the total step time is only 566\.41 s, a 50\.43% reduction compared to RLinf\-dis \(1006\.8 s\)\. Even for OpenVLA\-OFT, which possesses high inherent inference latency,D\-VLArestricts the total time to 520\.3 s, significantly outperforming RLinf\-hyper\. This proves that efficient scheduling can effectively "mask" the computational burden of large\-model inference by overlapping it with the subsequent environment sampling phase through refined pipeline design\.

Bottleneck Analysis and Resource Balancing\.Decoupled timing in Figure[4](https://arxiv.org/html/2605.13276#S4.F4)reveals the dynamic shift of system bottlenecks\. In theπ0\.5\\pi\_\{0\.5\}3:1 configuration, the Rollout and Actor times are relatively balanced\. This workload symmetry allows the asynchronous pipeline to achieve near\-perfect mutual masking, maximizing overall throughput\.

However, the OpenVLA\-OFT experiment highlights the negative impact of resource imbalance\. In the 3:1 configuration, the Actor time reaches 542\.12 s due to the heavy inference load, becoming the primary bottleneck\. This forced wait causes the system to degenerate into a quasi\-synchronous mode\. To address this,D\-VLAutilizes an adaptive resource adjustment \(1:1 ratio\) to align both components at approximately 200 s\. This re\-establishes pipeline symmetry and restores mutual masking efficiency\. Our analysis indicates that the efficacy of asynchronous mechanisms depends heavily on temporal alignment between components\. By precisely compressing GPU bubbles and eliminating idle waiting,D\-VLAensures optimal throughput across varying VLA model scales\.

Multi\-node Scalability\.To verify the robustness of our framework in large\-scale settings, we extend our evaluation to a 16\-GPU multi\-node environment \(Table[1](https://arxiv.org/html/2605.13276#S4.T1)\)\. The experimental results are consistent with our single\-node findings, confirming that by leveraging the underlying network fabric \(e\.g\., InfiniBand\) for asynchronous weight and data transfers,D\-VLAachieves efficient scaling in large\-scale distributed scenarios without being bottlenecked by inter\-node communication\.

Table 1:Comparison of throughput performance across different methods forπ0\.5\\pi\_\{0\.5\}and OpenVLA\-OFT within the ManiSkill simulation environment using 16 GPUs\. Here, Thr, Step, Roll, and Act denote Throughput, Step Time, Rollout Time, and Actor Time, respectively\. Ratios \(3:1 and 1:1\) represent the resource partitioning between rollout/environment and actor modules\.Learning Performance and Convergence\. The success rate curves, as illustrated in Figure[6](https://arxiv.org/html/2605.13276#S4.F6), demonstrate that whileD\-VLAsignificantly accelerates the training process, it maintains competitive performance levels consistent with existing baselines\. This confirms that our asynchronous orchestration and plane\-decoupling mechanisms do not compromise the training stability or the final policy quality of VLA models\.

![Refer to caption](https://arxiv.org/html/2605.13276v1/figures/success_1.png)Figure 6:Training Success Rate on ManiSkill withπ0​\.5\\pi\_\{0\}\.5\.
### 4\.2Scalability Analysis and Bottleneck Exploration

To further investigate the capacity and performance evolution ofD\-VLAunder large\-scale parallel workloads, we conduct a systematic evaluation using theπ0\.5\\pi\_\{0\.5\}model as a benchmark with a 3:1 resource placement strategy\. We scale the environment count from 384 to 3,072 and monitor the dynamic changes in system throughput and sub\-component latencies\. The results demonstrate that while the asynchronous pipeline mechanism yields significant performance gains under high\-concurrency loads, it also reveals performance saturation points dictated by hardware topology and computational density\.

Throughput Evolution and Saturation Analysis\.As illustrated in Figure[7](https://arxiv.org/html/2605.13276#S4.F7)\(left\), the system throughput exhibits a non\-linear progression—initially climbing rapidly, then plateauing, and eventually showing a slight decline as the environment count increases\. As the environment scale expands from 384 to 768, throughput achieves a significant leap, reaching a peak performance of 379 steps/s at the 768 scale\. This phenomenon reflects the ability of our asynchronous framework to efficiently release GPU parallel potential by masking initial computational bubbles\. However, as the scale further expands to 3,072, throughput gradually recedes and stabilizes around 360 steps/s\. This performance degradation is not caused by scheduling inefficiencies but by the saturation of GPU memory bandwidth and compute units under high\-concurrency rendering, where the increased latency per environment instance eventually offsets the benefits of higher parallelism\. This defines the optimal environmental workload range forπ0\.5\\pi\_\{0\.5\}training under current hardware configurations\.

Component Decoupling and Pipeline Efficiency\.The linear evolution analysis of Total Step Time, Actor Time, and Rollout Time in Figure[7](https://arxiv.org/html/2605.13276#S4.F7)\(middle\) reveals varying sensitivities to system load across different components\. As the environment count increases, both Rollout and Actor times exhibit highly stable linear growth, confirming the predictability and rigor of theD\-VLAscheduling strategy\. Crucially, although both Actor and Rollout times scale with the load, their magnitudes remain relatively balanced across all tested scales\. According to asynchronous pipeline theory, this temporal symmetry between components is a prerequisite for achieving efficient masking\. The experimental data confirms that through precise pipeline alignment, our framework ensures a high degree of overlap between large\-model inference and large\-scale environment simulation, maintaining a superior system duty cycle even as total step time increases with load\.

Workload Backlog and Bottleneck Shifting\.The stacked latency analysis in Figure 4 \(right\) further reveals the shifting of system bottlenecks under heavy workloads\. At low environment counts \(e\.g\., 384\), the absolute latencies of Rollout and Actor are short, and the system possesses redundant compute capacity; during this phase, throughput is primarily constrained by the rollout process\. As the environment count increases to 768, the Actor and Rollout times converge, allowing for optimal mutual masking and thus achieving peak throughput\. When the load exceeds 1,536, the growth rate of Actor time slightly outpaces that of the Rollout, gradually evolving into the core factor limiting further throughput breakthroughs\. This shift reflects the extreme pressure that the complexity of diffusion model computational graphs exerts on compute units when handling massive concurrent inference requests\. While the asynchronous masking mechanism significantly compresses the absolute span of a single step cycle, it cannot eliminate the physical latency growth dictated by hardware limits\. The scalability performance ofπ0\.5\\pi\_\{0\.5\}onD\-VLAdemonstrates the framework’s academic depth and engineering feasibility for supporting ultra\-large\-scale parallel training, while highlighting the importance of identifying the "compute\-latency" equilibrium by dynamically adjusting environment scales in heterogeneous pipeline designs\.

![Refer to caption](https://arxiv.org/html/2605.13276v1/figures/scale.png)Figure 7:Performance scaling ofD\-VLAonπ0\.5\\pi\_\{0\.5\}across varying environment counts\. The plots illustrate \(Left\) throughput scaling trends with a peak at 768 environments, \(Middle\) the linear growth of decoupled time components, and \(Right\) the stacked breakdown of Rollout and Actor latencies\.

## 5Conclusion and Discussion

Conclusion\.This paper presentsD\-VLA, a high\-performance distributed RL framework for VLA training\. By introducing “Plane Decoupling,”D\-VLAphysically isolates high\-frequency data exchange from low\-frequency weight control, resolving resource contention between simulation and optimization\. Our four\-thread “Swimlane” pipeline enables full overlap of sampling, inference, and training, reducing GPU idle time\. Experiments onπ0\.5\\pi\_\{0\.5\}and OpenVLA\-OFT show up to 86% throughput gains over SOTA baselines without compromising convergence or policy quality\.

Discussion\.Analysis reveals that VLA training efficiency depends on temporal alignment between components\. WhileD\-VLAmitigates synchronization bottlenecks, scaling to trillion\-parameter models requires even finer pipeline symmetry\. Future work will investigate dynamic, load\-aware resource reallocation to adaptively adjust GPU partitioning based on real\-time latencies\. We also plan to extendD\-VLAto multi\-agent scenarios and more heterogeneous embodied platforms to further scale generalist foundation models\.

## References

- J\. Bjorck, F\. Castañeda, N\. Cherniadev, X\. Da, R\. Ding, L\. Fan, Y\. Fang, D\. Fox, F\. Hu, S\. Huang,et al\.\(2025\)Gr00t n1: an open foundation model for generalist humanoid robots\.arXiv preprint arXiv:2503\.14734\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- K\. Black, N\. Brown, D\. Driess, A\. Esmail, M\. Equi, C\. Finn, N\. Fusai, L\. Groom, K\. Hausman, B\. Ichter,et al\.\(2024\)π0\\pi\_\{0\}: A vision\-language\-action flow model for general robot control\.arXiv preprint arXiv:2410\.24164\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- R\. Cadene, S\. Aliberts, F\. Capuano, M\. Aractingi, A\. Zouitine, P\. Kooijmans, J\. Choghari, M\. Russi, C\. Pascal, S\. Palma,et al\.\(2026\)Lerobot: an open\-source library for end\-to\-end robot learning\.arXiv preprint arXiv:2602\.22818\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Chen, A\. Li, B\. Gong, B\. Jiang, B\. Fei, B\. Yang, B\. Shan, C\. Yu, C\. Wang, C\. Zhu,et al\.\(2025\)Minimax\-m1: scaling test\-time compute efficiently with lightning attention\.arXiv preprint arXiv:2506\.13585\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1)\.
- M\. S\. Dana Aubakirova, J\. Cholgari, and L\. von Werra \(2025\)VLAb: your laboratory for pretraining vlas\.GitHub\.Note:[https://github\.com/huggingface/vlab](https://github.com/huggingface/vlab)Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p2.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- W\. Gao, Y\. Zhao, T\. Wu, S\. Xiong, W\. Wang, D\. An, L\. Cao, D\. Muhtar, Z\. Liu, H\. Zhao,et al\.\(2025\)RollArt: scaling agentic rl training via disaggregated infrastructure\.arXiv preprint arXiv:2512\.22560\.Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1)\.
- Gemini Robotics Team \(2025\)Gemini Robotics 1\.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer\.arXiv e\-prints,pp\. arXiv:2510\.03342\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2510.03342),2510\.03342Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1)\.
- Z\. Guan, H\. Sun, Y\. Guo, S\. Di, X\. Bai, J\. Long, T\. Zhao, M\. Luo, C\. Zhou, Y\. Guo,et al\.\(2026\)RL\-vla3: reinforcement learning vla accelerating via full asynchronism\.arXiv preprint arXiv:2602\.05765\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p2.1),[§1](https://arxiv.org/html/2605.13276#S1.p3.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Hu, X\. Wu, Z\. Zhu, W\. Wang, D\. Zhang, Y\. Cao,et al\.\(2024\)Openrlhf: an easy\-to\-use, scalable and high\-performance rlhf framework\.arXiv preprint arXiv:2405\.111436\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Intelligence, K\. Black, N\. Brown, J\. Darpinian, K\. Dhabalia, D\. Driess, A\. Esmail, M\. Equi, C\. Finn, N\. Fusai,et al\.\(2025\)Pi 0\.5: a vision\-language\-action model with open\-world generalization\.arXiv preprint arXiv:2504\.16054\.Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- M\. J\. Kim, K\. Pertsch, S\. Karamcheti, T\. Xiao, A\. Balakrishna, S\. Nair, R\. Rafailov, E\. Foster, G\. Lam, P\. Sanketi,et al\.\(2024\)Openvla: an open\-source vision\-language\-action model\.arXiv preprint arXiv:2406\.09246\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- H\. Li, Y\. Zuo, J\. Yu, Y\. Zhang, Z\. Yang, K\. Zhang, X\. Zhu, Y\. Zhang, T\. Chen, G\. Cui,et al\.\(2025\)Simplevla\-rl: scaling vla training via reinforcement learning\.arXiv preprint arXiv:2509\.09674\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p2.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Liu, Y\. Zhu, C\. Gao, Y\. Feng, Q\. Liu, Y\. Zhu, and P\. Stone \(2023\)Libero: benchmarking knowledge transfer for lifelong robot learning\.Advances in Neural Information Processing Systems36,pp\. 44776–44791\.Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- Y\. Ma, Z\. Song, Y\. Zhuang, J\. Hao, and I\. King \(2024\)A survey on vision\-language\-action models for embodied ai\.arXiv preprint arXiv:2405\.14093\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1)\.
- T\. Mu, Z\. Ling, F\. Xiang, D\. Yang, X\. Li, S\. Tao, Z\. Huang, Z\. Jia, and H\. Su \(2021\)Maniskill: generalizable manipulation skill benchmark with large\-scale demonstrations\.In35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\),Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Nasiriany, A\. Maddukuri, L\. Zhang, A\. Parikh, A\. Lo, A\. Joshi, A\. Mandlekar, and Y\. Zhu \(2024\)Robocasa: large\-scale simulation of everyday tasks for generalist robots\.arXiv preprint arXiv:2406\.02523\.Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px3.p1.1)\.
- A\. O’Neill, A\. Rehman, A\. Maddukuri, A\. Gupta, A\. Padalkar, A\. Lee, A\. Pooley, A\. Gupta, A\. Mandlekar, A\. Jain,et al\.\(2024\)Open x\-embodiment: robotic learning datasets and rt\-x models: open x\-embodiment collaboration 0\.In2024 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 6892–6903\.Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p5.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Shukor, D\. Aubakirova, F\. Capuano, P\. Kooijmans, S\. Palma, A\. Zouitine, M\. Aractingi, C\. Pascal, M\. Russi, A\. Marafioti, S\. Alibert, M\. Cord, T\. Wolf, and R\. Cadene \(2025\)SmolVLA: a vision\-language\-action model for affordable and efficient robotics\.External Links:2506\.01844,[Link](https://arxiv.org/abs/2506.01844)Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.
- Z\. Su, L\. Pan, X\. Bai, D\. Liu, G\. Dong, J\. Huang, M\. Lv, W\. Hu, F\. Zhang, K\. Gai,et al\.\(2025\)Klear\-reasoner: advancing reasoning capability via gradient\-preserving clipping policy optimization\.arXiv preprint arXiv:2508\.07629\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1)\.
- B\. Xie, E\. Zhou, F\. Jia, H\. Shi, H\. Fan, H\. Zhang, H\. Li, J\. Sun, J\. Bin, J\. Huang,et al\.\(2025\)Dexbotic: open\-source vision\-language\-action toolbox\.arXiv preprint arXiv:2510\.23511\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p2.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Yu, S\. Wan, Y\. Wang, C\. Gao, L\. Gan, Z\. Zhang, and D\. Zhan \(2025\)Reward models in deep reinforcement learning: a survey\.arXiv preprint arXiv:2506\.15421\.Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Zang, M\. Wei, S\. Xu, Y\. Wu, Z\. Guo, Y\. Wang, H\. Lin, L\. Shi, Y\. Xie, Z\. Xu,et al\.\(2025\)Rlinf\-vla: a unified and efficient framework for vla\+ rl training\.arXiv preprint arXiv:2510\.06710\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p2.1),[§1](https://arxiv.org/html/2605.13276#S1.p3.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Zhang, J\. Sun, C\. Hu, X\. Wu, Z\. Yuan, R\. Zhou, F\. Shen, and Q\. Zhou \(2025\)Pure vision language action \(vla\) models: a comprehensive survey\.arXiv preprint arXiv:2509\.19012\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1)\.
- Y\. Zhong, F\. Bai, S\. Cai, X\. Huang, Z\. Chen, X\. Zhang, Y\. Wang, S\. Guo, T\. Guan, K\. N\. Lui,et al\.\(2025\)A survey on vision\-language\-action models: an action tokenization perspective\.arXiv preprint arXiv:2507\.01925\.Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1)\.
- C\. Zhou, H\. Sun, H\. Yang, J\. Long, J\. Xiong, L\. Wang, M\. Luo, Q\. Yang, S\. Di, S\. Wang,et al\.\(2026\)Thousand\-gpu large\-scale training and optimization recipe for ai\-native cloud embodied intelligence infrastructure\.arXiv preprint arXiv:2603\.11101\.Cited by:[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Zitkovich, T\. Yu, S\. Xu, P\. Xu, T\. Xiao, F\. Xia, J\. Wu, P\. Wohlhart, S\. Welker, A\. Wahid, Q\. Vuong, V\. Vanhoucke, H\. Tran, R\. Soricut, A\. Singh, J\. Singh, P\. Sermanet, P\. R\. Sanketi, G\. Salazar, M\. S\. Ryoo, K\. Reymann, K\. Rao, K\. Pertsch, I\. Mordatch, H\. Michalewski, Y\. Lu, S\. Levine, L\. Lee, T\. E\. Lee, I\. Leal, Y\. Kuang, D\. Kalashnikov, R\. Julian, N\. J\. Joshi, A\. Irpan, B\. Ichter, J\. Hsu, A\. Herzog, K\. Hausman, K\. Gopalakrishnan, C\. Fu, P\. Florence, C\. Finn, K\. A\. Dubey, D\. Driess, T\. Ding, K\. M\. Choromanski, X\. Chen, Y\. Chebotar, J\. Carbajal, N\. Brown, A\. Brohan, M\. G\. Arenas, and K\. Han \(2023\)RT\-2: vision\-language\-action models transfer web knowledge to robotic control\.InProceedings of The 7th Conference on Robot Learning,J\. Tan, M\. Toussaint, and K\. Darvish \(Eds\.\),Proceedings of Machine Learning Research, Vol\.229,pp\. 2165–2183\.External Links:[Link](https://proceedings.mlr.press/v229/zitkovich23a.html)Cited by:[§1](https://arxiv.org/html/2605.13276#S1.p1.1),[§2](https://arxiv.org/html/2605.13276#S2.SS0.SSS0.Px1.p1.2)\.

Similar Articles

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

Just open-sourced FastVLA

Reddit r/LocalLLaMA

FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.