StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
Summary
This paper introduces an Information Bottleneck Adapter (IB-Adapter) for Vision-Language-Action (VLA) models to improve robustness against unseen visual disturbances without requiring extra data, achieving up to 30% improvement with minimal parameter overhead.
View Cached Full Text
Cached at: 05/19/26, 06:30 AM
Paper page - StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
Source: https://huggingface.co/papers/2605.18287 Published on May 18
·
Submitted byhttps://huggingface.co/yfdeng10
yfdengon May 19
Abstract
Vision-Language-Action models exhibit degraded performance under unseen visual disturbances, but a lightweight information-theoretic adapter module significantly improves robustness with minimal parameter overhead.
It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding therobustnessof Vision-Language-Action (VLA) models when encountering unseen real-worldvisual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop whenvisual disturbancesabsent from the training data are introduced. To mitigate this issue, we propose a lightweightadapter modulegrounded ininformation theory, termed theInformation Bottleneck Adapter(IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achievesrobustnesscompetitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.18287
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18287 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18287 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18287 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Proposes a training-free inference-time method for Vision-Language-Action models to correct pace and path dynamics, improving success rates by up to 28.8% in dynamic environments.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA is a history-conditioned visual-language-action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations. It also introduces AliasBench, an ambiguity-aware benchmark for evaluating such methods.