RLDX-1 Technical Report

Hugging Face Daily Papers 05/05/26, 12:00 AM Papers

Summary

RLDX-1 is a general-purpose robotic policy for dexterous manipulation that uses a Multi-Stream Action Transformer architecture to integrate heterogeneous modalities, outperforming existing VLA models in real-world tasks.

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 08:10 AM

Paper page - RLDX-1 Technical Report

Source: https://huggingface.co/papers/2605.03269 Authors:

Abstract

RLDX-1 is a general-purpose robotic policy for dexterous manipulation that integrates heterogeneous modalities through a Multi-Stream Action Transformer architecture, demonstrating superior performance in complex real-world tasks compared to existing vision-language-action models.

WhileVision-Language-Action models(VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complexreal-world tasksrequiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy fordexterous manipulationbuilt on theMulti-Stream Action Transformer(MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams withcross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations forreal-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across bothsimulation benchmarksandreal-world tasksthat require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoFhumanoid robotunder diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-worlddexterous manipulation.

View arXiv page View PDF Project page GitHub77 Add to collection

Get this paper in your agent:

hf papers read 2605\.03269

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper9

#### RLWRLD/RLDX-1-PT Robotics• 7B• Updated2 days ago • 52 • 3 #### RLWRLD/RLDX-1-FT-ROBOCASA Robotics• 7B• Updated2 days ago • 51 • 1 #### RLWRLD/RLDX-1-MT-ALLEX Robotics• 8B• Updated2 days ago • 55 • 1 #### RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX Robotics• 7B• Updated2 days ago • 29 • 1 Browse 9 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.03269 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.03269 in a Space README.md to link it from this page.

Collections including this paper6

Browse 6 collections that include this paper

RLDX-1 Technical Report

Paper page - RLDX-1 Technical Report

Abstract

Models citing this paper9

Spaces citing this paper0

Collections including this paper6

Similar Articles

Learning dexterity

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Multi-Goal Reinforcement Learning: Challenging robotics environments and request for research

EasyVideoR1: Easier RL for Video Understanding

Submit Feedback

Similar Articles

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Multi-Goal Reinforcement Learning: Challenging robotics environments and request for research

EasyVideoR1: Easier RL for Video Understanding