Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

Hugging Face Daily Papers Papers

Summary

This paper introduces a retrieval-augmented vision-language-action policy that eliminates per-task fine-tuning by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation at test time.

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:34 AM

Paper page - Retrieve, Don’t Retrain: Extending Vision Language Action Models to New Tasks at Test Time

Source: https://huggingface.co/papers/2606.15631 Published on Jun 14

·

Submitted byhttps://huggingface.co/Jeongeun

Parkon Jun 16

Abstract

Retrieval-augmented vision-language-action policies eliminate per-task fine-tuning costs by using pre-trained models with indexed demonstrations, enabling efficient cross-embodiment generalization and task adaptation.

Extending a vision-language-action (VLA) policy to a new task typically requires task-specificteleoperated demonstrationsandper-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Ourretrieval-augmented policyis trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to aretrieval pool. Thefrozen policyconditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced inCosmos Policy, avideo-generation-based world-action model(WAM). In this setting, retrieval supplies coarse task progression, while the WAM’sfuture-image objectiveprovides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior forcross-embodiment generalizationto unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.15631

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.15631 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15631 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15631 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles