CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

Summary

This paper introduces CapVector, a method that decouples auxiliary training objectives from standard supervised finetuning in Vision-Language-Action models. By extracting transferable capability vectors and applying orthogonal regularization, it enhances model performance and generalization while significantly reducing computational overhead.

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/12/26, 10:52 AM

Paper page - CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Source: https://huggingface.co/papers/2605.10903

Abstract

A novel approach decouples auxiliary training objectives from standard supervised finetuning to enhance model capabilities while reducing computational overhead through capability vector merging and orthogonal regularization.

This paper proposes a novel approach to address the challenge thatpretrained VLA modelsoften fail to effectively improve performance and reduce adaptation costs during standardsupervised finetuning(SFT). Some advanced finetuning methods withauxiliary training objectivescan improve performance and reduce the number ofconvergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within theparameter space, namely, enhancinggeneral capabilitiesand fittingtask-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters’ difference between the two models can then be interpreted ascapability vectorsprovided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhancedmeta model. Moreover, when standard SFT is augmented with a lightweightorthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that ourcapability vectors(1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.10903

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### haofuly/capvector_models_collection Robotics• Updatedabout 3 hours ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10903 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10903 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Paper page - CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

Vokenization: Multimodel Learning for Vision and Language

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Submit Feedback

Similar Articles

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

Vokenization: Multimodel Learning for Vision and Language

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning