AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

Summary

Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies. Code and Videos available at https://arvla.insait.ai

Original Article

View Cached Full Text

Cached at: 05/19/26, 06:33 PM

Paper page - AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Source: https://huggingface.co/papers/2603.10126

Abstract

An autoregressive action expert generates continuous action sequences conditioned on vision-language prefixes, maintaining long-term memory for context-aware robotic policy training with improved trajectory smoothness and task success rates.

We propose a standaloneautoregressive(AR)Action Expertthat generates actions as a continuous causal sequence while conditioning on refreshablevision-language prefixes. In contrast to existingVision-Language-Action(VLA) models anddiffusion policiesthat reset temporal context with each new observation and predict actions reactively, ourAction Expertmaintains its own history through along-lived memoryand is inherentlycontext-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize are-anchoring mechanismthat mathematically accounts forperception stalenessduring both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditionalchunk-based action headsfor both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smootheraction trajectorieswhile maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable,context-awareaction generation schema that provides a robust structural foundation for training effectiverobotic policies. Code and Videos available at https://arvla.insait.ai

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2603\.10126

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2603.10126 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2603.10126 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2603.10126 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Paper page - AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Submit Feedback

Similar Articles

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories