IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
Summary
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.
View Cached Full Text
Cached at: 05/15/26, 04:23 AM
Paper page - IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
Source: https://huggingface.co/papers/2605.14712 Authors:
,
,
,
,
,
,
,
,
,
Abstract
IntentVLA is a history-conditioned visual-language action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations.
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with differentshort-horizon intents, task phases, or recent context. Existingframe-conditionedVLA policies infer each chunk from the current observation and instruction alone, so underpartial observabilitythey may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, ahistory-conditionedVLA framework that encodes recent visual observations into a compact short-horizonintent representationand uses it to condition chunk generation. We further introduceAliasBench, a 12-taskambiguity-aware benchmarkonRoboTwin2with matched training data and evaluation environments that isolate short-horizon observation aliasing. AcrossAliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improvesrollout stabilityand outperforms strong VLA baselines
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.14712
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.14712 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.14712 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.14712 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
Learning Visual Feature-Based World Models via Residual Latent Action
This paper introduces RLA-WM, a visual feature-based world model that leverages residual latent actions and flow matching to efficiently predict future visual states. The method outperforms existing video-diffusion and feature-based approaches while enabling novel robot learning techniques from offline, actionless demonstration videos.
Just open-sourced FastVLA
FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.