RLDX-1 Technical Report

Hugging Face Daily Papers Papers

Summary

RLDX-1 is a general-purpose robotic policy for dexterous manipulation that uses a Multi-Stream Action Transformer architecture to integrate heterogeneous modalities, outperforming existing VLA models in real-world tasks.

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 08:10 AM

Paper page - RLDX-1 Technical Report

Source: https://huggingface.co/papers/2605.03269 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

RLDX-1 is a general-purpose robotic policy for dexterous manipulation that integrates heterogeneous modalities through a Multi-Stream Action Transformer architecture, demonstrating superior performance in complex real-world tasks compared to existing vision-language-action models.

WhileVision-Language-Action models(VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complexreal-world tasksrequiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy fordexterous manipulationbuilt on theMulti-Stream Action Transformer(MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams withcross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations forreal-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. π_{0.5} and GR00T N1.6) across bothsimulation benchmarksandreal-world tasksthat require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while π_{0.5} and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoFhumanoid robotunder diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-worlddexterous manipulation.

View arXiv pageView PDFProject pageGitHub77Add to collection

Get this paper in your agent:

hf papers read 2605\.03269

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper9

#### RLWRLD/RLDX-1-PT Robotics• 7B• Updated2 days ago • 52 • 3 #### RLWRLD/RLDX-1-FT-ROBOCASA Robotics• 7B• Updated2 days ago • 51 • 1 #### RLWRLD/RLDX-1-MT-ALLEX Robotics• 8B• Updated2 days ago • 55 • 1 #### RLWRLD/RLDX-1-FT-SIMPLER-WIDOWX Robotics• 7B• Updated2 days ago • 29 • 1 Browse 9 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.03269 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.03269 in a Space README.md to link it from this page.

Collections including this paper6

Browse 6 collections that include this paper

Similar Articles

Learning dexterity

OpenAI Blog

OpenAI announces Dactyl, a system that learns robotic hand dexterity through simulation and reinforcement learning, using LSTMs to generalize across different physical environments and the Rapid PPO implementation to train policies that transfer to real-world manipulation tasks.

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Hugging Face Daily Papers

RoboLab is a high-fidelity simulation benchmarking framework for evaluating task-generalist robotic policies, introducing the RoboLab-120 benchmark with 120 tasks across visual, procedural, and relational competency axes. It enables scalable, realistic task generation and systematic analysis of policy behavior under controlled perturbations to assess true generalization capabilities.

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.