Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Summary
Proposes a training-free inference-time method for Vision-Language-Action models to correct pace and path dynamics, improving success rates by up to 28.8% in dynamic environments.
View Cached Full Text
Cached at: 05/15/26, 08:24 AM
Paper page - Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Source: https://huggingface.co/papers/2605.11459
Abstract
Vision-Language-Action models suffer from temporal blindness in dynamic environments, but a training-free correction method using quadratic optimization improves performance by addressing pace and path dynamics simultaneously.
Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind totemporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps anychunked-actionVLA. From a singlequadratic cost,joint minimizationyields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only andstatic-dynamic mixed environments, respectively.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.11459
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.11459 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.11459 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.11459 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
This paper introduces an Information Bottleneck Adapter (IB-Adapter) for Vision-Language-Action (VLA) models to improve robustness against unseen visual disturbances without requiring extra data, achieving up to 30% improvement with minimal parameter overhead.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
Closed-Loop Neural Activation Control in Vision-Language-Action Models
Proposes CTRL-STEER, a closed-loop framework for adaptive steering of vision-language-action models using time-varying control signals, achieving better trade-off between concept regulation and task success without retraining.
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.