AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Summary
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
View Cached Full Text
Cached at: 05/19/26, 06:33 PM
Paper page - AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Source: https://huggingface.co/papers/2603.10126
Abstract
An autoregressive action expert generates continuous action sequences conditioned on vision-language prefixes, maintaining long-term memory for context-aware robotic policy training with improved trajectory smoothness and task success rates.
We propose a standaloneautoregressive(AR)Action Expertthat generates actions as a continuous causal sequence while conditioning on refreshablevision-language prefixes. In contrast to existingVision-Language-Action(VLA) models anddiffusion policiesthat reset temporal context with each new observation and predict actions reactively, ourAction Expertmaintains its own history through along-lived memoryand is inherentlycontext-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize are-anchoring mechanismthat mathematically accounts forperception stalenessduring both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditionalchunk-based action headsfor both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smootheraction trajectorieswhile maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable,context-awareaction generation schema that provides a robust structural foundation for training effectiverobotic policies. Code and Videos available at https://arvla.insait.ai
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2603\.10126
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2603.10126 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2603.10126 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2603.10126 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
This paper introduces an Information Bottleneck Adapter (IB-Adapter) for Vision-Language-Action (VLA) models to improve robustness against unseen visual disturbances without requiring extra data, achieving up to 30% improvement with minimal parameter overhead.