Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time
Summary
This paper introduces a retrieval-augmented vision-language-action policy that eliminates per-task fine-tuning by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation at test time.
View Cached Full Text
Cached at: 06/16/26, 11:34 AM
Paper page - Retrieve, Don’t Retrain: Extending Vision Language Action Models to New Tasks at Test Time
Source: https://huggingface.co/papers/2606.15631 Published on Jun 14
·
Submitted byhttps://huggingface.co/Jeongeun
Parkon Jun 16
Abstract
Retrieval-augmented vision-language-action policies eliminate per-task fine-tuning costs by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation.
Extending a vision-language-action (VLA) policy to a new task typically requires task-specificteleoperated demonstrationsandper-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Ourretrieval-augmented policyis trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to aretrieval pool. Thefrozen policyconditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced inCosmos Policy, avideo-generation-based world-action model(WAM). In this setting, retrieval supplies coarse task progression, while the WAM’sfuture-image objectiveprovides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior forcross-embodiment generalizationto unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.15631
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.15631 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15631 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15631 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
Researchers propose APT, a two-stage training method that pretrains action experts on vision-action pairs before integrating language conditioning, significantly improving out-of-distribution instruction generalization for Vision-Language-Action policies.
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
This paper introduces an Information Bottleneck Adapter (IB-Adapter) for Vision-Language-Action (VLA) models to improve robustness against unseen visual disturbances without requiring extra data, achieving up to 30% improvement with minimal parameter overhead.
PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models
PolicyTrim is a reinforcement learning-based post-training framework that improves action chunk utilization by 3× and reduces physical execution steps by 51.4% in Vision-Language-Action models, delivering up to 5.83× deployment speedup.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.