CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
Summary
This paper introduces CapVector, a method that decouples auxiliary training objectives from standard supervised finetuning in Vision-Language-Action models. By extracting transferable capability vectors and applying orthogonal regularization, it enhances model performance and generalization while significantly reducing computational overhead.
View Cached Full Text
Cached at: 05/12/26, 10:52 AM
Paper page - CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
Source: https://huggingface.co/papers/2605.10903
Abstract
A novel approach decouples auxiliary training objectives from standard supervised finetuning to enhance model capabilities while reducing computational overhead through capability vector merging and orthogonal regularization.
This paper proposes a novel approach to address the challenge thatpretrained VLA modelsoften fail to effectively improve performance and reduce adaptation costs during standardsupervised finetuning(SFT). Some advanced finetuning methods withauxiliary training objectivescan improve performance and reduce the number ofconvergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within theparameter space, namely, enhancinggeneral capabilitiesand fittingtask-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters’ difference between the two models can then be interpreted ascapability vectorsprovided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhancedmeta model. Moreover, when standard SFT is augmented with a lightweightorthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that ourcapability vectors(1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.10903
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### haofuly/capvector_models_collection Robotics• Updatedabout 3 hours ago
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10903 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10903 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA proposes a high-concurrency distributed asynchronous reinforcement learning framework for Vision-Language-Action models, using plane decoupling and a swimlane pipeline to improve throughput and efficiency in large-scale embodied AI training.
Vokenization: Multimodel Learning for Vision and Language
The article explains 'Vokenization,' a multimodal learning technique that bridges computer vision and natural language processing by using weak supervision to link visual data with language tokens. It contrasts this approach with text-only models like GPT-3 and BERT, highlighting how visual grounding can improve language understanding.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
This paper introduces ReAD, a reinforcement-guided capability distillation framework that optimizes token budgets by accounting for cross-capability transfer in large language models. It demonstrates improved downstream utility and reduced harmful spillover compared to existing baselines.
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
The paper introduces BalCapRL, a balanced reinforcement learning framework for multimodal large language models that jointly optimizes correctness, coverage, and linguistic quality in image captioning. It demonstrates improved performance over existing methods by addressing trade-offs between utility and fluency through reward decoupling and length-conditional masking.